[Haskell-cafe] Optimization problem

Bertram Felgenhauer bertram.felgenhauer at googlemail.com
Thu Sep 14 23:13:29 EDT 2006


Ross Paterson wrote:
> On Thu, Sep 14, 2006 at 05:22:05PM +0200, Bertram Felgenhauer wrote:
> > [much subtle code]
> > We can now build the splitStream function, using the following helper
> > function:
> > 
> > > splitSeq' :: Ord a => Map a () -> [(a,b)] -> ([(a,[b])], Map a [b])
> 
> This works for infinite lists if there is no balancing, but if insert does
> balancing, the top of the map will not be available until the last key
> is seen, so splitSeq' could only be used for finite chunks.  Then you'll
> need a way to put the partial answers together.

Just to prove the point, here's the same code with balancing:

>>> SNIP HERE (end marked with <<<) >>>

module SplitSeq (splitSeq) where

import Prelude hiding (lookup, map)

-- our map
data Map k a  = Node !Int k a (Map k a) (Map k a) | Leaf deriving Show

size :: Map k a -> Int
size Leaf             = 0
size (Node s _ _ _ _) = s

member :: Ord k => k -> Map k a -> Bool
member _ Leaf              = False
member k (Node _ k' _ l r) = case compare k k' of
    LT -> member k l
    EQ -> True
    GT -> member k r

-- insert key into blueprint and extract the corresponding value from
-- the second argument, threading it backward through all operations
insert :: Ord k => k -> Map k () -> Map k a -> (Map k (), a, Map k a)
insert k Leaf            ~(Node _ _ a _ _)
    = (Node 1 k () Leaf Leaf, a, Leaf)
insert k (Node s k' _ l r) node
    = case compare k k' of
        LT -> let (m, a, l'')      = insert  k l l'
                  (m', a', l', r') = balance k' m r node
              in  (m', a, Node s k' a' l'' r')
        EQ -> error "inserting existing element"
        GT -> let (m, a, r'')      = insert  k r r'
                  (m', a', l', r') = balance k' l m node
              in  (m', a, Node s k' a' l' r'')

-- balance and co are taken from Data.Map and adapted
balance k l r node
    | size l + size r <= 1 = let Node _ _ a l' r' = node
                             in  (mkNode k () l r, a, l', r')
    | size r >= 5 * size l = rotateL k l r node
    | size l >= 5 * size r = rotateR k l r node
    | otherwise            = let Node _ _ a l' r' = node
                             in  (mkNode k () l r, a, l', r')

rotateL k l r@(Node _ _ _ l' r') node
    | size l' < 2*size r' = singleL k l r node
    | otherwise           = doubleL k l r node

rotateR k l@(Node _ _ _ l' r') r node
    | size r' < 2*size l' = singleR k l r node
    | otherwise           = doubleR k l r node

singleL k l (Node s k' _ m r)
    ~(Node _ _ a ~(Node _ _ a' l' m') r') =
        (mkNode k' () (mkNode k () l m) r,
        a', l', Node s k' a m' r')

singleR k (Node s k' _ l m) r
    ~(Node _ _ a l' ~(Node _ _ a' m' r')) =
        (mkNode k' () l (mkNode k () m r), 
        a', Node s k' a l' m', r')

doubleL k l (Node s k' _ (Node s' k'' _ ml mr) r)
    ~(Node _ _ a ~(Node _ _ a' l' ml') ~(Node _ _ a'' mr' r')) =
        (mkNode k'' () (mkNode k () l ml) (mkNode k' () mr r),
        a', l', Node s k' a'' (Node s' k'' a ml' mr') r')

doubleR k (Node s k' _ l (Node s' k'' _ ml mr)) r
    ~(Node _ _ a ~(Node _ _ a' l' ml') ~(Node _ _ a'' mr' r')) =
        (mkNode k'' () (mkNode k' () l ml) (mkNode k () mr r),
        a'', Node s k' a' l' (Node s' k'' a ml' mr'), r')

-- make a new node with the correct size
mkNode k x l r = Node (size l + size r + 1) k x l r

-- update the element associated with the given key
update :: Ord k => k -> (a -> a) -> Map k x -> Map k a -> Map k a
update k f (Node s k' _ l r) ~(Node _ _ a' l' r') = case compare k k' of
    LT -> Node s k' a' (update k f l l') r'
    EQ -> Node s k' (f a') l' r'
    GT -> Node s k' a' l' (update k f r r')

-- standard map function, no blueprints here
map :: (a -> b) -> Map k a -> Map k b
map _ Leaf             = Leaf
map f (Node s k a l r) = Node s k (f a) (map f l) (map f r)

-- finally, define splitSeq
splitSeq :: Ord a => [(a,b)] -> [(a,[b])]
splitSeq = fst . splitSeq' Leaf

splitSeq' :: Ord a => Map a () -> [(a,b)] -> ([(a,[b])], Map a [b])
splitSeq' bp []         = ([], map (const []) bp)
splitSeq' bp ((a,b):xs) = case member a bp of
    True  -> let (l, m) = splitSeq' bp xs in (l, update a (b:) bp m)
    False -> let (bp', a', m) = insert a bp m'
                 (l, m')      = splitSeq' bp' xs
             in
                 ((a, b : a') : l, m)
<<<

The balancing code is adopted from Data.Map. I added threading back the
result map (from splitSeq) to the balancing operations.

This results in the promised O(n*log m) time. The cost associated with
a lazy pattern (it creates a closure for each bound variable, IIRC)
is quite high but constant, so each call to insert takes O(log m)
time, including future forces of the lazy patterns.

Likewise, member and update are also O(log m).

Example test:

Prelude SplitSeq> print $ take 11 $ snd $ splitSeq ([(n `mod` 10000, 'a') | n <- [1..100000]] ++ [(n, 'b') | n <- [0..]]) !! 9534
"aaaaaaaaaab"
(1.07 secs, 137026252 bytes)

The nonbalancing version would take ages.

regards,

Bertram


More information about the Haskell-Cafe mailing list