[Haskell-cafe] Compress and serialise data with lazy bytestrings,
zlib and Data.Binary (was: Allocating enormous amounts of memory)
Donald Bruce Stewart
dons at cse.unsw.edu.au
Sun Jul 8 22:47:02 EDT 2007
dons:
> Jefferson Heard write:
>
> > I'm using the Data.AltBinary package to read in a list of 4.8 million
> > floats and 1.6 million ints. Doing so caused the memory footprint to
> > blow up to more than 2gb, which on my laptop simply causes the program
> > to crash. I can do it on my workstation, but I'd really rather not,
> > because I want my program to be fairly portable.
> >
> > The file that I wrote out in packing the data structure was only 28MB,
> > so I assume I'm just using the wrong data structure, or I'm using full
> > laziness somewhere I shouldn't be.
>
> Here's a quick example of how to efficient read and write such a structure to
> disk, compressing and decompressing on the fly.
>
> $ time ./A
> Wrote 4800000 floats, and 1600000 ints
> Read 4800000 floats, and 1600000 ints
> ./A 0.93s user 0.06s system 89% cpu 1.106 total
>
> It uses Data.Binary to provide quick serialisation, and the zlib library to
> compress the resulting stream. It builds the tables in memory, writes and
> compresses the result to disk, reads them back in, and checks we read the right
> amount of CFloats and CInts. You'd then pass the CFloats over to your C library
> that needs them.
>
> Compressing with zlib is a flourish, but cheap and simple, so we may as well do
> it. With zlib and Data.Binary, the core code just becomes:
>
> encodeFile "/tmp/table.gz" table
> table' <- decodeFile "/tmp/table.gz"
>
> Which transparently streams the data through zlib, and onto the disk, and back.
>
> Simple and efficient.
Oh, and profiling this code:
$ ghc -prof -auto-all -O2 --make A.hs
$ ./A +RTS -p
Wrote 4800000 floats, and 1600000 ints
Read 4800000 floats, and 1600000 ints
$ cat A.prof
Mon Jul 9 12:44 2007 Time and Allocation Profiling Report (Final)
total time = 0.90 secs (18 ticks @ 50 ms)
total alloc = 26,087,140 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
main Main 100.0 100.0
Looks fine. We'd expect at least 25,600,000 bytes, and a little overhead for the
runtime system. I note that the compressed file on disk is 26k too (yay for
gzip on zeros ;)
-- Don
More information about the Haskell-Cafe
mailing list