[Haskell-cafe] Time consumption nub

Wed Jul 18 08:43:45 EDT 2007

Arie Groeneveld wrote:
> Hi,
> 
> Wondering about time space consuming:  'nub' vs 'map head.group.sort'
> 
> Consider:
> 
> ry = [1..10000] ++ replicate 13 5 ++ replicate 21 34
> 
> 
> 
> *Main> length . nub $ ry
> 10000
> (5.18 secs, 1050000 bytes)
> 
> *Main> length . map head . group . sort $ ry
> 10000
> (0.03 secs, 6293384 bytes)
> 
>        Time     space
> nub     ---      +
> fnub    +++      -
> 
> + is better ;-)
> 
> Thanks
> 
> @@i=arie
> 

nub is working on unsorted input.

If you want (nub.sort) then the best thing to use is a merge sort that discards 
duplicates as it works.  Copying and modifying GHC's Data.List.sort code:

 > -- stolen from http://darcs.haskell.org/packages/base/Data/List.hs
 > -- with 'merge' changed to discard duplicates.

> nsort l = mergesort compare l
> nsortBy cmp l = mergesort compare l
> 
> mergesort :: (a -> a -> Ordering) -> [a] -> [a]
> mergesort cmp = mergesort' cmp . map wrap
> 
> mergesort' :: (a -> a -> Ordering) -> [[a]] -> [a]
> mergesort' cmp [] = []
> mergesort' cmp [xs] = xs
> mergesort' cmp xss = mergesort' cmp (merge_pairs cmp xss)
> 
> merge_pairs :: (a -> a -> Ordering) -> [[a]] -> [[a]]
> merge_pairs cmp [] = []
> merge_pairs cmp [xs] = [xs]
> merge_pairs cmp (xs:ys:xss) = merge cmp xs ys : merge_pairs cmp xss
> 
> merge :: (a -> a -> Ordering) -> [a] -> [a] -> [a]
> merge cmp xs [] = xs
> merge cmp [] ys = ys
> merge cmp (x:xs) (y:ys)
>  = case x `cmp` y of
>         GT -> y : merge cmp (x:xs)   ys
>         LT -> x : merge cmp    xs (y:ys)
>         EQ -> x : merge cmp    xs    ys
> 
> wrap :: a -> [a]
> wrap x = [x]
> 

Then you can use nsort or nsortBy, which benchmark (with -O2) as slightly faster 
than (map head . group . sort)

Cheers,
   Chris