[Haskell-beginners] Code help requested
Daniel Fischer
daniel.is.fischer at web.de
Tue Jan 12 06:57:55 EST 2010
Am Dienstag 12 Januar 2010 02:22:49 schrieb Joe Van Dyk:
> I've written two versions of the same program, one in ruby and one in
> haskell. Given words on stdin, find all the anagrams in those words.
> For nicer display, we're only going to display the top 3 results.
>
> I'm obviously new to haskell. The ruby version runs about 5x as fast
> on a large file. How can I improve the haskell version?
>
> http://gist.github.com/274774
>
> # Ruby version
> input = STDIN.read.split("\n")
> result = Hash.new([])
> input.each do |word|
> sorted_word = word.split('').sort.join
> result[sorted_word] += [word]
> end
> values = result.values.sort { |a, b| b.size <=> a.size }
> p values[0..3]
>
> # Haskell version
> import List
> import qualified Data.Map as Map
>
> -- Given as stdin
> -- presents
> -- serpents
> -- no
> -- on
> -- whatever
> -- Expected Output:
> -- [["serpents","presents"],["on","no"]]
>
> -- This version only displays words that have more than one
> -- match in the list, and sorts by the words that got the most matches.
>
> -- Can we do the map bit better?
>
> main = do
> input <- getContents
> print $ anagrams $ lines input
>
> anagrams words =
> sorted_anagrams
> where
> sorted_anagrams = sortBy sorter filtered_anagrams
> sorter a b = compare (length b) (length a)
> filtered_anagrams = Map.elems $ Map.filter filter_function
> all_anagrams filter_function words = length words > 1
> all_anagrams = do_anagrams words Map.empty
> do_anagrams [] result = result
> do_anagrams words result = do_anagrams
> (tail words)
Here be dragons.
unionWith is O(n+m) where n and m are the sizes of the two maps, the insert
variants are O(log n). So this may be quadratic [actually, if the second
map in unionWith is a singleton, it behaves better, but it's still much
slower than inserts] (and very lazy, which means that without
optimisations, all the unions form a giant thunk which overflows the stack
for large enough input), using
all_anagrams
= foldl' (\m w -> Map.insertWith' (++) (sort w) [w] m) Map.empty words
, you get an O(n*log n) algorithm with sufficient strictness to not blow
the stack. For
$ wc -l /usr/share/dict/words
380645 /usr/share/dict/words
that is a heck of a difference.
> (Map.unionWith
> (++)
Don't use Map.fromList [(key,value)], use Map.singleton key value instead.
> (Map.fromList
> [(sorted_current_word, [current_word])])
> result)
> where
> current_word = head words
> sorted_current_word = sort current_word
While the original version got a stack overflow without optimisations, it
ran with -O2, but took a *lot* of memory and was ~10% slower than the Ruby
version. But it spent 68% of the time garbage collecting.
Change all_anagrams as above, and it uses reasonable memory (about 30% more
than Ruby if left to choose how much to use, it can run on less than Ruby
with +RTS -MxM, but that of course increases GC times a bit) and takes
about a third of the time of the Ruby version (*without optimisations*, -O2
makes only a small difference [~6%] here).
$ time ruby ./Anagrams.rb < /usr/share/dict/words > /dev/null
26.27user 0.21system 0:26.48elapsed 100%CPU
$ ./AnagramsH +RTS -sstderr < /usr/share/dict/words > /dev/null
./AnagramsH +RTS -sstderr
1,807,965,184 bytes allocated in the heap
577,083,904 bytes copied during GC
73,277,232 bytes maximum residency (12 sample(s))
858,644 bytes maximum slop
166 MB total memory in use (1 MB lost due to fragmentation)
Generation 0: 3437 collections, 0 parallel, 2.47s, 2.62s elapsed
Generation 1: 12 collections, 0 parallel, 1.07s, 1.25s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 5.16s ( 5.19s elapsed)
GC time 3.54s ( 3.86s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 8.70s ( 9.05s elapsed)
%GC time 40.7% (42.7% elapsed)
Alloc rate 350,629,823 bytes per MUT second
Productivity 59.3% of total user, 57.0% of total elapsed
More information about the Beginners
mailing list