[Haskell-beginners] Code help requested
Daniel Fischer
daniel.is.fischer at web.de
Wed Jan 13 05:56:45 EST 2010
Am Mittwoch 13 Januar 2010 03:38:46 schrieb Tim Perry:
> I compiled the original version, Yusaka's version, and a version I wrote
> and found the following:
>
> $ time ./Anagram_me < /usr/share/dict/words > /dev/null
> real 0m2.197s
> user 0m2.040s
> sys 0m0.160s
>
> $ time ./Anagram_JoeVanDyke < /usr/share/dict/words > /dev/null
> real 0m4.570s
> user 0m4.290s
> sys 0m0.260s
>
> perry at emperor:~/haskell$ time ./Anagram_Yusaku < /usr/share/dict/words >
> /dev/null real 0m1.337s
> user 0m1.230s
> sys 0m0.100s
>
>
> From
> this, it looks like mine version takes less than half the time of the
> original. However, if I run a bigger dictionary (Ubuntu package
> wamerican-large instead of wamerican-small) then I'm only about 30%
> faster than the original. This makes me think I have some sort of
> exponential data structure growth going on. Here is my version. Can
> anyone confirm that data structure growth is the problem with my
> approach? Thanks, Tim
>
>
>
> import Data.List as Lst
> import Data.Map as Map
>
> -- This version only displays words that have more than two
> -- match in the list, and sorts by the words that got the most matches.
>
> -- Can we do the map bit better?
>
> main = do
> input <- getContents
> print $ anagrams $ lines input
>
> anagrams words =
> sorted_anagrams
> where
> sorted_anagrams = sortBy sorter filtered_anagrams
> sorter a b = compare (length b) (length a)
> longEnoughWords = [x | x <- words, length x > 1]
The words are short here, so it's not catastrophic, but
*don't use length unless you really want to know the length*
Here, use (not . null . drop 1) [an input line might be empty, so don't use
tail], in general, instead of
length list > k, use not . null $ drop k list;
if you want to check (length list == k),
case drop (k-1) list of
(_:[]) -> True
_ -> False
is O(min k (length list)), if there's a slight possibility that the list is
much longer than k, it's safer.
However, it's unlikely that there are more than 52 one-letter words in the
word list, so filtering out those shouldn't make it faster.
> filtered_anagrams = [x | x <- Map.elems $ foldr insert empty $
> wordPairs, length x > 2] where
That's bad.
Using foldr to construct the map, you must have the whole list from which
to construct it in the memory at once - since the list takes less memory
than the map, that is not a real problem, if you run out of memory thus,
you would anyway - and can start constructing the map only after the entire
reading is done - this is the real problem.
You build a nice huge thunk that way, which may blow the stack. And it's
slow.
foldr is for the cases where you can start returning output before the
entire list has been consumed, a necessary condition for that is that the
accumulation function is lazy in its second argument, like (++), (&&),
(||).
In practically all other cases, you want foldl' (there might be a few cases
where foldl is what you want, I haven't seen such a case yet, though).
> wordPairs = zip (Prelude.map Lst.sort longEnoughWords)
> longEnoughWords
> insert (sorted, original) = insertWith (++) sorted [original]
changing foldr to foldl' and insert to
insert m (sorted,original) = insertWith (++) sorted [original] m
reduces the time on my /usr/share/dict/words from 31 seconds to 8.
The difference is 0.64s vs 0.76s for 40,000 words, 1.17s vs. 1.80s on
70,000 words, 2.12s vs. 4.22s on 120,000 words.
More information about the Beginners
mailing list