Solved: [Was: Re: [Haskell-cafe] Program with ByteStrings leads to memory exhaust]

Don Stewart dons at galois.com
Mon Sep 14 12:00:31 EDT 2009


sargrigory:
> I have a simple program that first generates a large (~ 500 mb) file
> of random numbers and then reads the numbers back to find their sum.
> It uses Data.Binary and Data.ByteString.Lazy.
> 
> The problem is when the program tries to read the data back it quickly
> (really quickly) consumes all memory.
> 
> The source: http://moonpatio.com/fastcgi/hpaste.fcgi/view?id=3607#a3607

I have tweaked this program a few ways for you. 

The big mistake (and why it runs out of space) is that you take
ByteString.Lazy.length to compute the block size. This forces the entire
file into memory -- so no benefits of lazy IO.

As a separate matter, calling 'appendFile . encode' incrementally for
each element will be very slow. Much faster to encode an entire list in
one go.

Finally, using System.Random.Mersenne is significantly faster at Double
generation that System.Random.

With these changes  (below), your program runs in constant space (both
writing out and reading in the 0.5Gb file), and is much faster:

    {-# LANGUAGE BangPatterns #-}

    import Data.Binary.Put
    import Data.Binary
    import System.IO
    import Data.Int
    import qualified Data.ByteString.Lazy as BL
    import System.Random.Mersenne

    path = "Results.data"
    n    = 20*1024*1024 :: Int

    -- getBlockSize :: BL.ByteString -> Int64
    -- getBlockSize bs = round $ (fromIntegral $ BL.length bs) / (fromIntegral n)
    --                                     --   ^^^^^ why do you take the length!?
    --                                     -- there's no point doing lazy IO then.

    -- Custom serialization (no length prefix)
    fillFile n = do
        g <- newMTGen (Just 42)
        rs <- randoms g :: IO [Double]
        BL.writeFile path $ runPut $ mapM_ put (take n rs)

    -- fillFile :: MTGen -> Int -> IO ()
    -- fillFile _ 0 = return ()
    -- fillFile g i = do
    --     x <- random g :: IO Double
    --     encodeFileAp path x
    --     fillFile g (i-1)

    processFile :: BL.ByteString -> Int64 -> Int -> Double -> Double
    processFile !bs !blockSize 0 !sum = sum
    processFile bs blockSize i sum = processFile y blockSize (i-1) (sum + decode x)
      where
        (x,y) = BL.splitAt blockSize bs

    main = do
        fillFile n

        -- compute the size without loading the file into memory
        h  <- openFile path ReadMode
        sz <- hFileSize h
        hClose h

        results <- BL.readFile path
        let blockSize = round $ fromIntegral sz / fromIntegral n
        print $ processFile results blockSize n 0

------------------------------------------------------------------------

Running this :

    $ ./A +RTS -sstderr
    1.0483476019172292e7

     226,256,100,448 bytes allocated in the heap
         220,413,096 bytes copied during GC
              65,416 bytes maximum residency (1186 sample(s))
             136,376 bytes maximum slop
                   2 MB total memory in use (0 MB lost due to fragmentation)
            ^^^^^^^^^^^^^^^^^
            It now runs in constant space.

      Generation 0: 428701 collections,     0 parallel,  3.17s,  3.49s elapsed
      Generation 1:  1186 collections,     0 parallel,  0.13s,  0.16s elapsed

      INIT  time    0.00s  (  0.00s elapsed)
      MUT   time  118.26s  (129.19s elapsed)
      GC    time    3.30s  (  3.64s elapsed)
      EXIT  time    0.00s  (  0.00s elapsed)
      Total time  121.57s  (132.83s elapsed)

      %GC time       2.7%  (2.7% elapsed)
            ^^^^^^^^^^^^^^^^
            Does very little GC.

      Alloc rate    1,913,172,101 bytes per MUT second

      Productivity  97.3% of total user, 89.0% of total elapsed

-- Don


More information about the Haskell-Cafe mailing list