[Haskell-beginners] Space consumption and strictness in strict map

Tue Oct 27 13:57:40 UTC 2015

Not 100% on this, but I would say the map is strict in keys and values as
it says in the haddocks, but it is still an immutable structure.  If you
don't force the map while you are building it, you end up with unevaluated
thunks, just like you would with +.

insertWith + x y (insertWith + x y (insertWith + x y ...)))

On the bright side because it is strict, the addition you are doing on the
values of this map are being evaluated, so the + you are using is not also
leaving behind thunks like value + value + value.  I'd say the reason (:)
is so common is because the arguments to insertWith both come from a list
that is not being cleaned up until you force the map.

On Tue, Oct 27, 2015 at 9:38 AM, John Lusk <johnlusk4 at gmail.com> wrote:

> Hello, beginners at haskell.org.  I posted this a few days ago, but it looks
> like attachments get pulled out by Mailman, whether or not they are small
> text files ( :>( ), so I'm trying again, sans attachment.
> _________________________________________
>
> Hey, I have a problem I "solved" by sprinkling !s around like pixie dust,
> but I'd like to know what's going on.  My program is below, in all its
> messy glory, but the whole ball of wax is at
> https://github.com/JohnL4/PassphraseGenerator (mostFrequent.hs).  It
> takes as input one of the 2012 Google ngram (specifically, 1-gram) raw data
> files.  The expected input format is documented in several places.  (For
> testing, I took the first million lines.)
>
> The bang that worked is on the 3rd argument of 'wordCount', the map.  I
> tried commenting out the "(Map.insertWith (+) ngram matchCount aMap)"
> part (and just returning the input map).  When I did that, boy was space
> consumption small, so that line is part of the problem.
>
> So, what's going on without the bang?
>
> I guess I'm just building up a bunch of thunks w/that Map.insertWith
> call, but what kind of thunks? Unevaluated calls to splitOn and (!!) and
> read and (+) and lines?
>
> I thought a strict Map would avoid that.  I guess WHNF isn't enough?
> Looking at https://wiki.haskell.org/Weak_head_normal_form, I don't *think* I
> have any constructors in there, but profiling with -hd shows me that (:) is
> the most frequently occurring closure. Where is *that* coming from?  Is
> the occurrence of (:) so high because it's either a built-in function
> applied to too few arguments or a lambda abstraction? Which one?  (Also, is
> "lambda abstraction" the same as "lambda expresson"?)
>
> And then... what does that bang on the map argument do?  Does it force
> evaluation of the passed argument all the way down to primitives, so that
> we truly get a data structure containing only strings and ints (and no
> thunks)?  What does that do to time complexity?  Does it have to traverse
> the entire map looking for thunks, even though I only added one at some
> random location in the map?
>
> Is there a better way?
>
> I guess I need to force strictness somewhere else, but I'm not sure how.
> I tried using seq (a little half-heartedly), and ($!), but I guess I did
> it wrong and only wound up with more thunks to seq and id ($!), right?
> So, I wound up with lazy strictness?
>
> Thanks for any help.
>
> John.
>
> My code:
>
> -- | Reads stdin and prints the n most-frequent words occurring after year
> y to stdout, along with
> -- | their counts.
>
> {-# LANGUAGE BangPatterns #-}
>
> import GHC.IO.Handle (hPutStr)
> import GHC.IO.Handle.FD (stderr)
> import Data.Char (toLower)
> import Data.List (intersperse, sortBy)
> import qualified Data.Map.Strict as Map
> import Data.List.Split (splitOn)
>
> {-
> Test data:
> let filetext =
> "aaa\t1900\t2\naaa\t1950\t3\nbbb\t1950\t5\nbbb_VERB\t1980\t9"
> -}
>
> -- | Number of most-frequent words we want.
> n :: Int
> n = 15000
>
> -- | Earliest year we want to count while building frequency table
> y :: Int
> y = 1950
>
> main :: IO()
> main = do
> {-
> Read line, split on tabs, take first three fields, which are: 1-gram
> (word), year, match-count.
> If year >= y:
>    Trim parts of speech (POS) from 1-gram (leading, trailing known
> fragments delimited by "_")
>       Known fragments: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET
> (determiners and articles), ADP
>       (prepositions, postpositions), NUM, CONJ, PRT (particles), X
> (miscellaneous)
>
>    Split on "_", discard expected known fragments, complain if there are
> more than one fragments left.  (Note that I
>    have verified that each ngram has 0 or 1 trailing attributes, so the
> check for an unexpected number of parts
>    is unnecessary.)
>
>    Find word in dictionary and add match-count to that entry.
> At end of input, sort dictionary entries by match-counts (descending) and
> take first n entries for output.
> -}
>   allInput <- getContents
>   putStrLn (concat (intersperse "\n" (map fst (take n (sortBy
> countDescending
>                                                        (Map.toList
> (wordCounts y (lines allInput) Map.empty)))))))
>   hPutStr stderr "Done.\n"
>
> -- | Orders inputs by 2nd element (count), descending
> countDescending :: (String,Int) -> (String,Int) -> Ordering
> countDescending (_, countA) (_, countB)
>   | countA < countB  = GT
>   | countA == countB = EQ
>   | otherwise        = LT
>
> -- | Returns map of all counts (summed) for words occurring on or after
> given year.
> wordCounts :: Int               -- ^ Year
>            -> [String]          -- ^ Lines in form
> "word\tyear\tcount\totherStuffWeDontCareAbout"
>            -> Map.Map String Int -- ^ Input map
>            -> Map.Map String Int -- ^ Output map
> wordCounts _ [] aMap = aMap
> wordCounts aYear (aLine:restLines) !aMap =
>   let fields      = splitOn "\t" aLine
>       ngramParts  = splitOn "_" (fields!!0)
>       ngram       = map toLower (ngramParts!!0) -- "ngram" is the same as
> "word", in this case.
>       year        = read (fields!!1) :: Int
>       matchCount  = read (fields!!2) :: Int
>   in if (year < aYear)
>      then (wordCounts aYear restLines aMap)
>      else (wordCounts aYear restLines
>            (Map.insertWith (+) ngram matchCount aMap)
>            -- aMap
>            )
>
>
> _______________________________________________
> Beginners mailing list
> Beginners at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/beginners
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/beginners/attachments/20151027/5c889db0/attachment-0001.html>