[Haskell-beginners] Code help requested

Wed Jan 13 05:56:45 EST 2010

Am Mittwoch 13 Januar 2010 03:38:46 schrieb Tim Perry:
> I compiled the original version, Yusaka's version, and a version I wrote
> and found the following:
>
> $ time ./Anagram_me < /usr/share/dict/words > /dev/null
> real    0m2.197s
> user    0m2.040s
> sys     0m0.160s
>
> $ time ./Anagram_JoeVanDyke < /usr/share/dict/words > /dev/null
> real    0m4.570s
> user    0m4.290s
> sys     0m0.260s
>
> perry at emperor:~/haskell$ time ./Anagram_Yusaku < /usr/share/dict/words >
> /dev/null real    0m1.337s
> user    0m1.230s
> sys     0m0.100s
>
>
> From
> this, it looks like mine version takes less than half the time of the
> original. However, if I run a bigger dictionary (Ubuntu package
> wamerican-large instead of wamerican-small) then I'm only about 30%
> faster than the original. This makes me think I have some sort of
> exponential data structure growth going on.  Here is my version. Can
> anyone confirm that data structure growth is the problem with my
> approach?  Thanks, Tim
>
>
>
> import Data.List as Lst
> import Data.Map as Map
>
> -- This version only displays words that have more than two
> -- match in the list, and sorts by the words that got the most matches.
>
> -- Can we do the map bit better?
>
> main = do
>   input <- getContents
>   print $ anagrams $ lines input
>
> anagrams words =
>   sorted_anagrams
>   where
>     sorted_anagrams = sortBy sorter filtered_anagrams
>     sorter a b = compare (length b) (length a)
>     longEnoughWords = [x | x <- words, length x > 1]

The words are short here, so it's not catastrophic, but

*don't use length unless you really want to know the length*

Here, use (not . null . drop 1) [an input line might be empty, so don't use 
tail], in general, instead of 

length list > k, use not . null $ drop k list;

if you want to check (length list == k),

case drop (k-1) list of
  (_:[]) -> True
 _      -> False

is O(min k (length list)), if there's a slight possibility that the list is 
much longer than k, it's safer.

However, it's unlikely that there are more than 52 one-letter words in the 
word list, so filtering out those shouldn't make it faster.

>     filtered_anagrams = [x | x <- Map.elems $ foldr insert empty $
> wordPairs, length x > 2] where

That's bad. 

Using foldr to construct the map, you must have the whole list from which 
to construct it in the memory at once - since the list takes less memory 
than the map, that is not a real problem, if you run out of memory thus, 
you would anyway - and can start constructing the map only after the entire 
reading is done - this is the real problem. 
You build a nice huge thunk that way, which may blow the stack. And it's 
slow.

foldr is for the cases where you can start returning output before the 
entire list has been consumed, a necessary condition for that is that the 
accumulation function is lazy in its second argument, like (++), (&&), 
(||).
In practically all other cases, you want foldl' (there might be a few cases 
where foldl is what you want, I haven't seen such a case yet, though).

>          wordPairs =  zip (Prelude.map Lst.sort longEnoughWords)
>                                    longEnoughWords
>          insert (sorted, original) = insertWith (++) sorted [original]

changing foldr to foldl' and insert to

insert m (sorted,original) = insertWith (++) sorted [original] m

reduces the time on my /usr/share/dict/words from 31 seconds to 8.

The difference is 0.64s vs 0.76s for 40,000 words, 1.17s vs. 1.80s on 
70,000 words, 2.12s vs. 4.22s on 120,000 words.