[Haskell-cafe] How on Earth Do You Reason about Space?

Wed Jun 1 18:22:11 CEST 2011

Just out of interest, did you try reading the input as plain old Strings?  They may be unfashionable these days, and have their own type of badness in space and time performance, but might perhaps be a win over ByteStrings for extremely large datasets.
Regards,
    Malcolm

On 01 Jun, 2011,at 02:49 PM, Aleksandar Dimitrov <aleks.dimitrov at googlemail.com> wrote:

Hi John,
> I think the issue is data sharing, as Brandon mentioned. A bytestring
> consists of an offset, length, and a pointer. You're using a chunk size of
> 64k, which means the generated bytestrings all have a pointer to that 64k of
> data. Suppose there's one new word in that 64k, and it's near the beginning
> of the chunk. Even though the word may only be a few characters long, it'll
> reference the entire chunk and keep it from being GC'd.

This seems to be the issue indeed! If the bytestrings on the hash map are
holding references to the chunks, it is clear that we're going to consume memory
scaling with the size of the input file, in case there are *any* new chunks
generated.

As I understand it, this what not a problem with the multiple copies of War and
Peace, because all byte strings were already found on the hash table! On reading
new input, old entries were found on the hash table, so only old chunks were
kept in memory, the new ones could be gc'ed.

In *realistic* data, however, the Long Tail is the reason that, after a while,
some chunks of input would only be kept because there were a few byte strings
referencing them. New words are rare, but they need not occur more frequently
than once per chunk, in order to keep the whole chunk in memory, even though the
chunk was mostly useless (most other data in this chunk would already be on the
hash map.)

> There are a few solutions to this. The first is to make a copy of the
> bytestring so only the required data is retained. In my experiments this
> wasn't helpful, but it would depend on your corpus. The second is to start
> with smaller chunks. Using a chunk size of 1024 worked fairly well for me.
> If your corpus is similar to natural language, I think it'll probably work
> better for you as well.

I think I solved this problem elegantly: I used Data.Text as hash map keys,
instead of Data.ByteString. See the modified program below:

> import qualified Data.Iteratee as I
> import Data.IterateeChar
> import Data.Iteratee.IO
> 
> import qualified Data.HashMap.Strict as T
> 
> import Data.Ord (comparing)
> import Data.List (sortBy)
> import System.Environment (getArgs)
> 
> import qualified DataByteString as S
> import qualified Data.Text as X
> import Data.Text.Encoding (decodeUtf8)
> 
> type Wordcounts = T.HashMap X.Text Int
> 
> f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
> f' = I.joinI $ (enumLinesBS I.><> I.filter (not.S.null)) $ I.foldl'
> (\t s -> T.insertWith (+) (convert s) 1 t) T.empty
> where convert = decodeUtf8
> 
> main :: IO ()
> main = getArgs >>= fileDriverVBuf 65536 f'.head
> >>= mapM_ (\(w,c)-> putStrLn $ X.unpack w ++ "\t" ++ show c)sortM
> where sortM = sortBy (comparing snd) . T.toList

Initial benchmarks on the realistic 100MB Gutenberg corpus I downloaded with my
script yesterday report the following: htop says 120M memory residency towards
the end of the life-cycle.

<<ghc: 40928992256 bytes, 77984 GCs, 20939886/46665600 avg/max bytes residency (455 samples), 133M in use, 0.00 INIT (0.00 elapsed), 26.01 MUT (26.20 elapsed), 32.96 GC (33.09 elapsed) :ghc>>

19MB avg, 44MB max residency, 133M in use (which is similar to what htop told
me) and the heap profile clearly shows allocation and deallocation of the
bytestrings: http://imgur.com/U1nyw (I'm attributing the ruggedness of the
profile with the frequent little spikes to the iteratee chunk allocation and
deallocation.) It seems I can't get rid of 50% of the heap being lag state. I
don't quite understand yet what that is. I also don't know what INHERENT_USE is.

But in any case, I now have a program that I can reasonably expect to run if I
could fit the input file into memory. I might try to implement an analogous
program in C++ or Java, just to see whether that would do better or similarly in
terms of memory consumption.

> Note that Johan's Ngram code also only keeps the minimum required data,
> giving it a good memory profile. I didn't notice this last night because I
> was testing with different data, and unfortunately the peculiar distribution
> of that data masked this problem.

This is kind of the big problem here: whether or not you'll see the particular
behaviour I was so upset about seemed to depend on the corpus' distributional
properties.

In any case, I would like to thank you all for helping me understand and address
the issue. I probably still have a long way to go in terms of understanding
space-behaviour of Haskell programs, but at least I don't see ByteStrings as
this black box of efficiency anymore, but actually understand how they're
structured, and what they're good at, and what they aren't good at.

(take home lesson: Data.Text is really nice. Also: if iteratee has a space leak,
I probably didn't hit it, really. But: if reading byte-strings, one should mind
the pointers that byte-strings actually are.)

Regards,
Aleks
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe at haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110601/0693c6b8/attachment.htm>