[Haskell-beginners] How would you improve this program?

Mon Oct 10 00:08:01 CEST 2011

On Sunday 09 October 2011, 22:11:35, Lorenzo Bolla wrote:
> Hi all,
> I'm new to Haskell and I'd like you to take a look at one of my programs
> and tell me how you would improve it (in terms of efficiency, style,
> and so on!).
> 
> The source code is here:
> https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs
> The program is an implementation of this problem:
> http://www.scs.stanford.edu/11au-cs240h/labs/lab1.html (basically,
> counting how many times a word appear in a text.)
> (I'm not a Stanford student, so by helping me out you won't help me to
> cheat my exam, don't worry!)

A few problems:
1) your input cleaning is not correct, running it over the source file, I 
get (among others)

--              #################################################
->              ##################################
                #############################
$               ########################

Now, they say they aren't going to be tricky, so most of this isn't 
necessary to take care of, but dashes/hyphens -- e.g. for parenthetical 
remarks -- should probably be dealt with---also the style with one em-dash 
not surrounded by spaces. 
Also quotes,

"Bob said: 'I will go there!'", Alice told the listeners.

2) If you have a looong but rare word in the text, you will get too many 
spaces between the words and the markers, that's not good.

3) you calculate the number of '#'s using the total line length and then 
subtract the length of (word + spaces), skewing your output. You should 
calculate (spaceForHashes = lineWidth - maxLength - 1) and the use that to 
determine the number of hashes.
[Although it should be more complicated really, maxLength should only 
include those words that are actually output.]

Style is okay, although like Michael Xavier, I'm more of a where-guy.

> 
> I've implemented 3 versions of the algorithm:
> 
>    1. a Haskell version using the standard "sort": read all the words
> from stdin, sort them and group them.

Not much you can do there.

>    2. a Haskell version using map: read all the words from stdin, stick
> each word in a Data.Map incrementing a counter if the word is already
> present in the map.

You build the map via fromListWith (+). That's bad if you have input with 
large counts. fromListWith (+) gives you thunks of the form
(...(1+1)+1)...+1) as map-entries. That costs a lot of space and time.
Better build the map using insertWith',

  mp = foldl' ins empty words
    where
      ins m w = insertWith' (+) w 1 m

>    3. a Python version using defaultdict.
> 
> I timed the different versions and the results are here:
> https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png.
> The python version is the quickest (I stripped out the fancy formatting
> before benchmarking, so IO is not responsible for the time difference).
> Any comments on the graph, too?

sort is O(n*log n), map and python are O(n*log d), where n is the number of 
words and d the number of distinct words. In the range of the graph both 
look quite linear.
The number of distinct words is much smaller than the number of words, so 
that's a smaller 'constant' (logarithmic) factor.
Python seems to have a smaller factor than map, I'm not sure whether that's 
because Python can mutate the dictionary in-place while Haskell's immutable 
Map requires some copying on each insert or whether it's due to the thunks 
and the excessive space usage caused by them.

Would be interesting to see the graph for the more efficient map-building.