[Haskell-cafe] Optimization problem

Thu Sep 14 11:22:05 EDT 2006

Magnus Jonsson wrote:
> splitStreams::Ord a=>[(a,b)]->[(a,[b])]
> 
> >splitStreams [(3,x),(1,y),(3,z),(2,w)]
> [(3,[x,z]),(1,[y]),(2,[w])]

> I'm afraid this algorithm is O(n*m) time in the worst case, where n is the 
> length of the input list, and m is the number of unique channels.
> 
> But is there any way to write it such that each element is touched only 
> once? Or at least an O(n*log(m)) algorithm?

This can be done. It's an interesting puzzle to make it work for infinite
lists. For finite lists, sortBy + groupBy + map easily do the job.

The problem is dealing with holes in data structures in Haskell, which
are to be filled in later. This can be done by providing blueprints for
them. Let's define a basic map:

> data Map k a  = Node k a (Map k a) (Map k a) | Leaf

We will use Map k () as blueprints - the () indicate where the holes
are. Next we define a lookup function, which takes a blueprint and
evaluates the correct hole in a second map:

> lookup :: Ord k => k -> Map k () -> Map k a -> Maybe a
> lookup _ Leaf            _                  = Nothing
> lookup k (Node k' _ l r) ~(Node _ a' l' r') = case compare k k' of
>     LT -> lookup k l l'
>     EQ -> Just a'
>     GT -> lookup k r r'

As you can see, the structure of the second argument is forced
by the first argument. The lazy pattern assures that we don't look
at the second argument too early.

In a similar fashion, we can define an update function:

> update :: Ord k => k -> (a -> a) -> Map k x -> Map k a -> Map k a
> update k f (Node k' _ l r) ~(Node _ a' l' r') = case compare k k' of
>     LT -> Node k' a' (update k f l l') r'
>     EQ -> Node k' (f a') l' r'
>     GT -> Node k' a' l' (update k f r r')

Next comes insert. Insert takes a blueprint and inserts a key into it.
It also takes a map and removes the corresponding hole from it. To
simplify the code it does no balancing. [1]

> insert :: Ord k => k -> Map k () -> Map k a -> (Map k (), Map k a)
> insert k Leaf            _
>     = (Node k () Leaf Leaf, Leaf)
> insert k (Node k' _ l r) ~(Node _ a' l' r')
>     = case compare k k' of
>         LT -> let (m, m') = insert k l l' in
>               (Node k' () m r, Node k' a' m' r')
>         EQ -> error "inserting existing element"
>         GT -> let (m, m') = insert k r r' in
>               (Node k' () l m, Node k' a' l' m')

We also need a map, defined in the usual fashion, without the blueprint.
A version with blueprint can also be defined, but it isn't necessary for
our problem:

> map :: (a -> b) -> Map k a -> Map k b
> map _ Leaf           = Leaf
> map f (Node k a l r) = Node k (f a) (mapMap f l) (mapMap f r)

We can now build the splitStream function, using the following helper
function:

> splitSeq' :: Ord a => Map a () -> [(a,b)] -> ([(a,[b])], Map a [b])
> splitSeq' bp []         = ([], map (const []) bp)
> splitSeq' bp ((a,b):xs) = case lookup a bp bp of
>     Just _ -> let (l, m) = splitSeq' bp xs in (l, update a (b:) bp m)
>     _      -> let (bp', m) = insert a bp m'
>                   (l, m')  = splitSeq' bp' xs
>                 in
>                   ((a, b : (fromJust $ lookup a bp' m')) : l, m)

splitSeq' takes a blueprint for a map with all keys seen so far, and
a list tail. It returns the result list for all new keys, and a map
(corresponding to the given blueprint) with the tails of the seen
elements.

The in the base case it just fills the blueprint with empty lists and
returns to the caller.

If a new element is seen, insert is used to create a new blueprint,
including the new key a, which is passed to the recursive call of
splitSeq'. The resulting map from that call is threaded back into
insert, which gives us a new map without the a key which matches
the structure of the original blueprint.

Finally we can define splitSeq as follows:

> splitSeq :: Ord a => [(a,b)] -> [(a,[b])]
> splitSeq = fst . splitSeq' Leaf

A quick test:
*Main> let s = splitSeq ([(3,'x'),(1,'y'),(3,'z'),(2,'w')] ++ repeat (4,' '))
*Main> s !! 0
(3,"xzInterrupted.  (I pressed ^C)
*Main> s !! 2
(2,"wInterrupted.   (ditto)
*Main> s !! 3
(4,"       ...

Is there a simpler way to do this? I don't know.

I'm also unsure whether it is a good idea - unless you use several
threads to process the list the code will produce a lot of thunks,
and eat a lot of memory.

The code above provides a maximum amount of lazyness while using
O(n log m) time. Depending on the circumstances processing the list in
chunks and using techniques like the above to combine the result will
be better.

enjoy,

Bertram

[1] Balancing can be done with the information in the blueprint, and
mapping back is easily done by doing the transformation on the tree
in reverse.