<div dir="ltr"><div class="gmail_quote"><div>Hello, <a href="mailto:beginners@haskell.org">beginners@haskell.org</a>. I posted this a few days ago, but it looks like attachments get pulled out by Mailman, whether or not they are small text files ( :>( ), so I'm trying again, sans attachment.</div><div>_________________________________________</div><div dir="ltr"><br></div><div dir="ltr">Hey, I have a problem I "solved" by sprinkling !s around like pixie dust, but I'd like to know what's going on. My program is below, in all its messy glory, but the whole ball of wax is at <a href="https://github.com/JohnL4/PassphraseGenerator" target="_blank">https://github.com/JohnL4/PassphraseGenerator</a> (mostFrequent.hs). It takes as input one of the 2012 Google ngram (specifically, 1-gram) raw data files. The expected input format is documented in several places. (For testing, I took the first million lines.)<div><br></div><div>The bang that worked is on the 3rd argument of '<font face="monospace, monospace">wordCount</font>', the map. I tried commenting out the "<font face="monospace, monospace">(Map.insertWith (+) ngram matchCount aMap)</font>" part (and just returning the input map). When I did that, boy was space consumption small, so that line is part of the problem.</div><div><br></div><div>So, what's going on without the bang?</div><div><br></div><div>I guess I'm just building up a bunch of thunks w/that <font face="monospace, monospace">Map.insertWith</font> call, but what kind of thunks? Unevaluated calls to <font face="monospace, monospace">splitOn</font> and (!!) and <font face="monospace, monospace">read</font> and (+) and <font face="monospace, monospace">lines</font>?</div><div><br></div><div>I thought a strict Map would avoid that. I guess WHNF isn't enough? Looking at <a href="https://wiki.haskell.org/Weak_head_normal_form" target="_blank">https://wiki.haskell.org/Weak_head_normal_form</a>, I don't <i>think</i> I have any constructors in there, but profiling with -hd shows me that (:) is the most frequently occurring closure. Where is <i>that</i> coming from? Is the occurrence of (:) so high because it's either a built-in function applied to too few arguments or a lambda abstraction? Which one? (Also, is "lambda abstraction" the same as "lambda expresson"?)</div><div><br></div><div>And then... what does that bang on the map argument do? Does it force evaluation of the passed argument all the way down to primitives, so that we truly get a data structure containing only strings and ints (and no thunks)? What does that do to time complexity? Does it have to traverse the entire map looking for thunks, even though I only added one at some random location in the map?</div><div><br></div><div>Is there a better way?</div><div><br></div><div>I guess I need to force strictness somewhere else, but I'm not sure how. I tried using <font face="monospace, monospace">seq </font>(a little half-heartedly), and ($!), but I guess I did it wrong and only wound up with more thunks to <font face="monospace, monospace">seq</font> and <font face="monospace, monospace">id ($!)</font>, right? So, I wound up with lazy strictness?</div><div><br></div><div>Thanks for any help.</div><span class=""><font color="#888888"><div><br></div><div>John.</div><div><br></div><div>My code:</div></font></span></div>
</div><br><div><div><font face="monospace, monospace">-- | Reads stdin and prints the n most-frequent words occurring after year y to stdout, along with</font></div><div><font face="monospace, monospace">-- | their counts.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">{-# LANGUAGE BangPatterns #-}</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">import GHC.IO.Handle (hPutStr)</font></div><div><font face="monospace, monospace">import GHC.IO.Handle.FD (stderr)</font></div><div><font face="monospace, monospace">import Data.Char (toLower)</font></div><div><font face="monospace, monospace">import Data.List (intersperse, sortBy)</font></div><div><font face="monospace, monospace">import qualified Data.Map.Strict as Map</font></div><div><font face="monospace, monospace">import Data.List.Split (splitOn)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">{-</font></div><div><font face="monospace, monospace">Test data:</font></div><div><font face="monospace, monospace">let filetext = "aaa\t1900\t2\naaa\t1950\t3\nbbb\t1950\t5\nbbb_VERB\t1980\t9"</font></div><div><font face="monospace, monospace">-}</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">-- | Number of most-frequent words we want.</font></div><div><font face="monospace, monospace">n :: Int</font></div><div><font face="monospace, monospace">n = 15000</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">-- | Earliest year we want to count while building frequency table</font></div><div><font face="monospace, monospace">y :: Int</font></div><div><font face="monospace, monospace">y = 1950</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">main :: IO()</font></div><div><font face="monospace, monospace">main = do</font></div><div><font face="monospace, monospace">{-</font></div><div><font face="monospace, monospace">Read line, split on tabs, take first three fields, which are: 1-gram (word), year, match-count.</font></div><div><font face="monospace, monospace">If year >= y:</font></div><div><font face="monospace, monospace"> Trim parts of speech (POS) from 1-gram (leading, trailing known fragments delimited by "_")</font></div><div><font face="monospace, monospace"> Known fragments: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET (determiners and articles), ADP</font></div><div><font face="monospace, monospace"> (prepositions, postpositions), NUM, CONJ, PRT (particles), X (miscellaneous)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace"> Split on "_", discard expected known fragments, complain if there are more than one fragments left. (Note that I</font></div><div><font face="monospace, monospace"> have verified that each ngram has 0 or 1 trailing attributes, so the check for an unexpected number of parts</font></div><div><font face="monospace, monospace"> is unnecessary.)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace"> Find word in dictionary and add match-count to that entry.</font></div><div><font face="monospace, monospace">At end of input, sort dictionary entries by match-counts (descending) and take first n entries for output.</font></div><div><font face="monospace, monospace">-}</font></div><div><font face="monospace, monospace"> allInput <- getContents</font></div><div><font face="monospace, monospace"> putStrLn (concat (intersperse "\n" (map fst (take n (sortBy countDescending</font></div><div><font face="monospace, monospace"> (Map.toList (wordCounts y (lines allInput) Map.empty)))))))</font></div><div><font face="monospace, monospace"> hPutStr stderr "Done.\n"</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">-- | Orders inputs by 2nd element (count), descending</font></div><div><font face="monospace, monospace">countDescending :: (String,Int) -> (String,Int) -> Ordering</font></div><div><font face="monospace, monospace">countDescending (_, countA) (_, countB)</font></div><div><font face="monospace, monospace"> | countA < countB = GT</font></div><div><font face="monospace, monospace"> | countA == countB = EQ</font></div><div><font face="monospace, monospace"> | otherwise = LT </font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">-- | Returns map of all counts (summed) for words occurring on or after given year.</font></div><div><font face="monospace, monospace">wordCounts :: Int -- ^ Year</font></div><div><font face="monospace, monospace"> -> [String] -- ^ Lines in form "word\tyear\tcount\totherStuffWeDontCareAbout"</font></div><div><font face="monospace, monospace"> -> Map.Map String Int -- ^ Input map</font></div><div><font face="monospace, monospace"> -> Map.Map String Int -- ^ Output map</font></div><div><font face="monospace, monospace">wordCounts _ [] aMap = aMap</font></div><div><font face="monospace, monospace">wordCounts aYear (aLine:restLines) !aMap =</font></div><div><font face="monospace, monospace"> let fields = splitOn "\t" aLine</font></div><div><font face="monospace, monospace"> ngramParts = splitOn "_" (fields!!0)</font></div><div><font face="monospace, monospace"> ngram = map toLower (ngramParts!!0) -- "ngram" is the same as "word", in this case.</font></div><div><font face="monospace, monospace"> year = read (fields!!1) :: Int</font></div><div><font face="monospace, monospace"> matchCount = read (fields!!2) :: Int</font></div><div><font face="monospace, monospace"> in if (year < aYear)</font></div><div><font face="monospace, monospace"> then (wordCounts aYear restLines aMap)</font></div><div><font face="monospace, monospace"> else (wordCounts aYear restLines</font></div><div><font face="monospace, monospace"> (Map.insertWith (+) ngram matchCount aMap)</font></div><div><font face="monospace, monospace"> -- aMap</font></div><div><font face="monospace, monospace"> )</font></div></div><div><br></div></div>