[Haskell-beginners] Space consumption and strictness in strict map

John Lusk johnlusk4 at gmail.com
Tue Oct 27 13:38:40 UTC 2015


Hello, beginners at haskell.org.  I posted this a few days ago, but it looks
like attachments get pulled out by Mailman, whether or not they are small
text files ( :>( ), so I'm trying again, sans attachment.
_________________________________________

Hey, I have a problem I "solved" by sprinkling !s around like pixie dust,
but I'd like to know what's going on.  My program is below, in all its
messy glory, but the whole ball of wax is at
https://github.com/JohnL4/PassphraseGenerator (mostFrequent.hs).  It takes
as input one of the 2012 Google ngram (specifically, 1-gram) raw data
files.  The expected input format is documented in several places.  (For
testing, I took the first million lines.)

The bang that worked is on the 3rd argument of 'wordCount', the map.  I
tried commenting out the "(Map.insertWith (+) ngram matchCount aMap)" part
(and just returning the input map).  When I did that, boy was space
consumption small, so that line is part of the problem.

So, what's going on without the bang?

I guess I'm just building up a bunch of thunks w/that Map.insertWith call,
but what kind of thunks? Unevaluated calls to splitOn and (!!) and read and
(+) and lines?

I thought a strict Map would avoid that.  I guess WHNF isn't enough?
Looking at https://wiki.haskell.org/Weak_head_normal_form, I don't *think* I
have any constructors in there, but profiling with -hd shows me that (:) is
the most frequently occurring closure. Where is *that* coming from?  Is the
occurrence of (:) so high because it's either a built-in function applied
to too few arguments or a lambda abstraction? Which one?  (Also, is "lambda
abstraction" the same as "lambda expresson"?)

And then... what does that bang on the map argument do?  Does it force
evaluation of the passed argument all the way down to primitives, so that
we truly get a data structure containing only strings and ints (and no
thunks)?  What does that do to time complexity?  Does it have to traverse
the entire map looking for thunks, even though I only added one at some
random location in the map?

Is there a better way?

I guess I need to force strictness somewhere else, but I'm not sure how.  I
tried using seq (a little half-heartedly), and ($!), but I guess I did it
wrong and only wound up with more thunks to seq and id ($!), right?  So, I
wound up with lazy strictness?

Thanks for any help.

John.

My code:

-- | Reads stdin and prints the n most-frequent words occurring after year
y to stdout, along with
-- | their counts.

{-# LANGUAGE BangPatterns #-}

import GHC.IO.Handle (hPutStr)
import GHC.IO.Handle.FD (stderr)
import Data.Char (toLower)
import Data.List (intersperse, sortBy)
import qualified Data.Map.Strict as Map
import Data.List.Split (splitOn)

{-
Test data:
let filetext = "aaa\t1900\t2\naaa\t1950\t3\nbbb\t1950\t5\nbbb_VERB\t1980\t9"
-}

-- | Number of most-frequent words we want.
n :: Int
n = 15000

-- | Earliest year we want to count while building frequency table
y :: Int
y = 1950

main :: IO()
main = do
{-
Read line, split on tabs, take first three fields, which are: 1-gram
(word), year, match-count.
If year >= y:
   Trim parts of speech (POS) from 1-gram (leading, trailing known
fragments delimited by "_")
      Known fragments: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET
(determiners and articles), ADP
      (prepositions, postpositions), NUM, CONJ, PRT (particles), X
(miscellaneous)

   Split on "_", discard expected known fragments, complain if there are
more than one fragments left.  (Note that I
   have verified that each ngram has 0 or 1 trailing attributes, so the
check for an unexpected number of parts
   is unnecessary.)

   Find word in dictionary and add match-count to that entry.
At end of input, sort dictionary entries by match-counts (descending) and
take first n entries for output.
-}
  allInput <- getContents
  putStrLn (concat (intersperse "\n" (map fst (take n (sortBy
countDescending
                                                       (Map.toList
(wordCounts y (lines allInput) Map.empty)))))))
  hPutStr stderr "Done.\n"

-- | Orders inputs by 2nd element (count), descending
countDescending :: (String,Int) -> (String,Int) -> Ordering
countDescending (_, countA) (_, countB)
  | countA < countB  = GT
  | countA == countB = EQ
  | otherwise        = LT

-- | Returns map of all counts (summed) for words occurring on or after
given year.
wordCounts :: Int               -- ^ Year
           -> [String]          -- ^ Lines in form
"word\tyear\tcount\totherStuffWeDontCareAbout"
           -> Map.Map String Int -- ^ Input map
           -> Map.Map String Int -- ^ Output map
wordCounts _ [] aMap = aMap
wordCounts aYear (aLine:restLines) !aMap =
  let fields      = splitOn "\t" aLine
      ngramParts  = splitOn "_" (fields!!0)
      ngram       = map toLower (ngramParts!!0) -- "ngram" is the same as
"word", in this case.
      year        = read (fields!!1) :: Int
      matchCount  = read (fields!!2) :: Int
  in if (year < aYear)
     then (wordCounts aYear restLines aMap)
     else (wordCounts aYear restLines
           (Map.insertWith (+) ngram matchCount aMap)
           -- aMap
           )
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/beginners/attachments/20151027/d7a28da2/attachment.html>


More information about the Beginners mailing list