[Haskell-beginners] How would you improve this program?
Daniel Fischer
daniel.is.fischer at googlemail.com
Mon Oct 10 00:08:01 CEST 2011
On Sunday 09 October 2011, 22:11:35, Lorenzo Bolla wrote:
> Hi all,
> I'm new to Haskell and I'd like you to take a look at one of my programs
> and tell me how you would improve it (in terms of efficiency, style,
> and so on!).
>
> The source code is here:
> https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs
> The program is an implementation of this problem:
> http://www.scs.stanford.edu/11au-cs240h/labs/lab1.html (basically,
> counting how many times a word appear in a text.)
> (I'm not a Stanford student, so by helping me out you won't help me to
> cheat my exam, don't worry!)
A few problems:
1) your input cleaning is not correct, running it over the source file, I
get (among others)
-- #################################################
-> ##################################
#############################
$ ########################
Now, they say they aren't going to be tricky, so most of this isn't
necessary to take care of, but dashes/hyphens -- e.g. for parenthetical
remarks -- should probably be dealt with---also the style with one em-dash
not surrounded by spaces.
Also quotes,
"Bob said: 'I will go there!'", Alice told the listeners.
2) If you have a looong but rare word in the text, you will get too many
spaces between the words and the markers, that's not good.
3) you calculate the number of '#'s using the total line length and then
subtract the length of (word + spaces), skewing your output. You should
calculate (spaceForHashes = lineWidth - maxLength - 1) and the use that to
determine the number of hashes.
[Although it should be more complicated really, maxLength should only
include those words that are actually output.]
Style is okay, although like Michael Xavier, I'm more of a where-guy.
>
> I've implemented 3 versions of the algorithm:
>
> 1. a Haskell version using the standard "sort": read all the words
> from stdin, sort them and group them.
Not much you can do there.
> 2. a Haskell version using map: read all the words from stdin, stick
> each word in a Data.Map incrementing a counter if the word is already
> present in the map.
You build the map via fromListWith (+). That's bad if you have input with
large counts. fromListWith (+) gives you thunks of the form
(...(1+1)+1)...+1) as map-entries. That costs a lot of space and time.
Better build the map using insertWith',
mp = foldl' ins empty words
where
ins m w = insertWith' (+) w 1 m
> 3. a Python version using defaultdict.
>
> I timed the different versions and the results are here:
> https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png.
> The python version is the quickest (I stripped out the fancy formatting
> before benchmarking, so IO is not responsible for the time difference).
> Any comments on the graph, too?
sort is O(n*log n), map and python are O(n*log d), where n is the number of
words and d the number of distinct words. In the range of the graph both
look quite linear.
The number of distinct words is much smaller than the number of words, so
that's a smaller 'constant' (logarithmic) factor.
Python seems to have a smaller factor than map, I'm not sure whether that's
because Python can mutate the dictionary in-place while Haskell's immutable
Map requires some copying on each insert or whether it's due to the thunks
and the excessive space usage caused by them.
Would be interesting to see the graph for the more efficient map-building.
More information about the Beginners
mailing list