[Haskell-beginners] Code help requested

Tue Jan 12 06:57:55 EST 2010

Am Dienstag 12 Januar 2010 02:22:49 schrieb Joe Van Dyk:
> I've written two versions of the same program, one in ruby and one in
> haskell.  Given words on stdin, find all the anagrams in those words.
> For nicer display, we're only going to display the top 3 results.
>
> I'm obviously new to haskell.  The ruby version runs about 5x as fast
> on a large file.   How can I improve the haskell version?
>
> http://gist.github.com/274774
>
> # Ruby version
> input = STDIN.read.split("\n")
> result = Hash.new([])
> input.each do |word|
>   sorted_word = word.split('').sort.join
>   result[sorted_word] += [word]
> end
> values = result.values.sort { |a, b| b.size <=> a.size }
> p values[0..3]
>
> # Haskell version
> import List
> import qualified Data.Map as Map
>
> -- Given as stdin
> -- presents
> -- serpents
> -- no
> -- on
> -- whatever
> -- Expected Output:
> -- [["serpents","presents"],["on","no"]]
>
> -- This version only displays words that have more than one
> -- match in the list, and sorts by the words that got the most matches.
>
> -- Can we do the map bit better?
>
> main = do
>   input <- getContents
>   print $ anagrams $ lines input
>
> anagrams words =
>   sorted_anagrams
>   where
>     sorted_anagrams = sortBy sorter filtered_anagrams
>     sorter a b = compare (length b)  (length a)
>     filtered_anagrams = Map.elems $ Map.filter filter_function
> all_anagrams filter_function words = length words > 1
>     all_anagrams = do_anagrams words Map.empty
>     do_anagrams [] result = result
>     do_anagrams words result = do_anagrams
>                                  (tail words)

Here be dragons.
unionWith is O(n+m) where n and m are the sizes of the two maps, the insert 
variants are O(log n). So this may be quadratic [actually, if the second 
map in unionWith is a singleton, it behaves better, but it's still much 
slower than inserts] (and very lazy, which means that without 
optimisations, all the unions form a giant thunk which overflows the stack 
for large enough input), using

all_anagrams 
    = foldl' (\m w -> Map.insertWith' (++) (sort w) [w] m) Map.empty words

, you get an O(n*log n) algorithm with sufficient strictness to not blow 
the stack. For

$ wc -l /usr/share/dict/words
380645 /usr/share/dict/words

that is a heck of a difference.

>                                  (Map.unionWith
>                                    (++)

Don't use Map.fromList [(key,value)], use Map.singleton key value instead.

>                                    (Map.fromList
> [(sorted_current_word, [current_word])])
>                                    result)
>       where
>         current_word = head words
>         sorted_current_word = sort current_word

While the original version got a stack overflow without optimisations, it 
ran with -O2, but took a *lot* of memory and was ~10% slower than the Ruby 
version. But it spent 68% of the time garbage collecting.

Change all_anagrams as above, and it uses reasonable memory (about 30% more 
than Ruby if left to choose how much to use, it can run on less than Ruby 
with +RTS -MxM, but that of course increases GC times a bit) and takes 
about a third of the time of the Ruby version (*without optimisations*, -O2 
makes only a small difference [~6%] here).

$ time ruby ./Anagrams.rb < /usr/share/dict/words > /dev/null
26.27user 0.21system 0:26.48elapsed 100%CPU

$ ./AnagramsH +RTS -sstderr < /usr/share/dict/words  > /dev/null
./AnagramsH +RTS -sstderr
   1,807,965,184 bytes allocated in the heap
     577,083,904 bytes copied during GC
      73,277,232 bytes maximum residency (12 sample(s))
         858,644 bytes maximum slop
             166 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0:  3437 collections,     0 parallel,  2.47s,  2.62s elapsed
  Generation 1:    12 collections,     0 parallel,  1.07s,  1.25s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    5.16s  (  5.19s elapsed)
  GC    time    3.54s  (  3.86s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    8.70s  (  9.05s elapsed)

  %GC time      40.7%  (42.7% elapsed)

  Alloc rate    350,629,823 bytes per MUT second

  Productivity  59.3% of total user, 57.0% of total elapsed