[Haskell-cafe] Compress and serialise data with lazy bytestrings, zlib and Data.Binary (was: Allocating enormous amounts of memory)

Donald Bruce Stewart dons at cse.unsw.edu.au
Sun Jul 8 22:47:02 EDT 2007


dons:
> Jefferson Heard write:
> 
> > I'm using the Data.AltBinary package to read in a list of 4.8 million
> > floats and 1.6 million ints.  Doing so caused the memory footprint to
> > blow up to more than 2gb, which on my laptop simply causes the program
> > to crash.  I can do it on my workstation, but I'd really rather not,
> > because I want my program to be fairly portable.  
> > 
> > The file that I wrote out in packing the data structure was only 28MB,
> > so I assume I'm just using the wrong data structure, or I'm using full
> > laziness somewhere I shouldn't be.
> 
> Here's a quick example of how to efficient read and write such a structure to
> disk, compressing and decompressing on the fly. 
> 
>     $ time ./A
>     Wrote 4800000 floats, and 1600000 ints
>     Read  4800000 floats, and 1600000 ints
>     ./A  0.93s user 0.06s system 89% cpu 1.106 total
> 
> It uses Data.Binary to provide quick serialisation, and the zlib library to
> compress the resulting stream. It builds the tables in memory, writes and
> compresses the result to disk, reads them back in, and checks we read the right
> amount of CFloats and CInts. You'd then pass the CFloats over to your C library
> that needs them.
> 
> Compressing with zlib is a flourish, but cheap and simple, so we may as well do
> it. With zlib and Data.Binary, the core code just becomes:
> 
>         encodeFile "/tmp/table.gz" table
>         table' <- decodeFile "/tmp/table.gz"
> 
> Which transparently streams the data through zlib, and onto the disk, and back.
> 
> Simple and efficient.

Oh, and profiling this code:

    $ ghc -prof -auto-all -O2 --make A.hs

    $ ./A +RTS -p                        
    Wrote 4800000 floats, and 1600000 ints
    Read 4800000 floats, and 1600000 ints

    $ cat A.prof 
        Mon Jul  9 12:44 2007 Time and Allocation Profiling Report  (Final)

        total time  =        0.90 secs   (18 ticks @ 50 ms)
        total alloc =  26,087,140 bytes  (excludes profiling overheads)

    COST CENTRE                    MODULE               %time %alloc
    main                           Main                 100.0  100.0

Looks fine. We'd expect at least 25,600,000 bytes, and a little overhead for the 
runtime system. I note that the compressed file on disk is 26k too (yay for
gzip on zeros ;)

-- Don



More information about the Haskell-Cafe mailing list