[Haskell-beginners] How would you improve this program?

Mon Oct 10 14:11:29 CEST 2011

On Monday 10 October 2011, 10:13:45, Lorenzo Bolla wrote:
> A part from style, is there any reason why one should prefer "where"
> over "let..in"?

Undoubtedly, some people have come up with theoretical reasons why one is 
generally better than the other. I haven't seen a convincing one yet, 
though.

There are some cases where one is clearly better than the other, bindings 
scoping over guards,

foo x y z
  | a == 0 = ...
  | b <= a = ...
  | a <= b+c = ...
  | otherwise = ...
    where
      a = notTooExpensive x y
      b = soSoExpensive z y
      c = reallyExpensive x y z

are much easier to write and read with where. On the other hand,
let binds in expr is an expression, so it's easily usable in contexts 
requiring an expression such as lambda abstractions and do-blocks,

map (\x -> let a = h x in if b then foo a x else bar a x) list

do a <- b
   c <- d
   let e = f a c
   bar e c
   baz
   foo a e

where a where would be awkward at best (though the map example isn't 
particularly good, there

map quux list
  where
    quux x = if b then foo a x else bar a x
      where
        a = h x

isn't bad and becomes easier to understand if the lambda becomes 
complicated enough).

So there are places where one is (clearly) more convenient/better than the 
other, but in other places it's purely up to personal preference.

> > You build the map via fromListWith (+). That's bad if you have input
> > with large counts. fromListWith (+) gives you thunks of the form
> > (...(1+1)+1)...+1) as map-entries. That costs a lot of space and time.
> > Better build the map using insertWith',
> >
> >  mp = foldl' ins empty words
> >    where
> >      ins m w = insertWith' (+) w 1 m
> 
> Please find the "strict" version here:
> https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs.

One further time-saver (in printTokens):

    maxCount = maximum $ map count tokens

you need not traverse the entire list of tokens for that,

    maxCount = case sortedTokens of
                 (Token _ c:_) -> c

> 
> Plot of times is now:
> https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png
> 
> Python is still slightly faster, but I'm getting there.
> 
> >    3. a Python version using defaultdict.
> >
> > sort is O(n*log n), map and python are O(n*log d), where n is the
> > number of words and d the number of distinct words.
> 
> I believe Python's defaultdict have (optimistically) O(1) lookup and
> insert, because it is implemented as a hashmap.

Ah, I tend to forget about hashmaps. Although the O(1) isn't quite true 
[calculating the hash isn't bounded time, and the lookup in the bucket need 
not be either], for a good hashtable it's true enough that it scales better 
than a tree-based map.