[Haskell-cafe] Compress and serialise data with lazy bytestrings,
zlib and Data.Binary (was: Allocating enormous amounts of memory)
Donald Bruce Stewart
dons at cse.unsw.edu.au
Sun Jul 8 22:38:57 EDT 2007
Jefferson Heard write:
> I'm using the Data.AltBinary package to read in a list of 4.8 million
> floats and 1.6 million ints. Doing so caused the memory footprint to
> blow up to more than 2gb, which on my laptop simply causes the program
> to crash. I can do it on my workstation, but I'd really rather not,
> because I want my program to be fairly portable.
>
> The file that I wrote out in packing the data structure was only 28MB,
> so I assume I'm just using the wrong data structure, or I'm using full
> laziness somewhere I shouldn't be.
Here's a quick example of how to efficient read and write such a structure to
disk, compressing and decompressing on the fly.
$ time ./A
Wrote 4800000 floats, and 1600000 ints
Read 4800000 floats, and 1600000 ints
./A 0.93s user 0.06s system 89% cpu 1.106 total
It uses Data.Binary to provide quick serialisation, and the zlib library to
compress the resulting stream. It builds the tables in memory, writes and
compresses the result to disk, reads them back in, and checks we read the right
amount of CFloats and CInts. You'd then pass the CFloats over to your C library
that needs them.
Compressing with zlib is a flourish, but cheap and simple, so we may as well do
it. With zlib and Data.Binary, the core code just becomes:
encodeFile "/tmp/table.gz" table
table' <- decodeFile "/tmp/table.gz"
Which transparently streams the data through zlib, and onto the disk, and back.
Simple and efficient.
-- Don
Here's the code:
-- some imports
import Text.Printf
import Data.Binary
import Codec.Compression.Zlib
import Control.Monad
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Base as B
import Foreign
import Foreign.C.Types
-- A simple table type for the arrays, that will be easy to manipulate
data Table = Table { floats :: L.ByteString
, ints :: L.ByteString }
-- Serialise this data in gzip-compressed form: a gzipping Binary instance.
instance Binary Table where
put t = do put (compress (floats t))
put (compress (ints t))
get = liftM2 Table (decompress `liftM` get)
(decompress `liftM` get)
-- Default sizes
floatSize = 4800000
intSize = 1600000
-- Build a new empty table
newTable :: IO Table
newTable = do
f <- mallocArray floatSize :: IO (Ptr CFloat)
i <- mallocArray intSize :: IO (Ptr CInt)
-- fill them with data here --
-- convert to bytestrings.
bf <- B.packCStringFinalizer (castPtr f)
(floatSize * sizeOf (undefined :: CFloat)) (return ())
bi <- B.packCStringFinalizer (castPtr i)
(intSize * sizeOf (undefined :: CInt)) (return ())
return $ Table (L.fromChunks [bf]) (L.fromChunks [bi])
-- Now just build the table, serialise it, read it back in, and print the result
main = do
table <- newTable
-- write the data to disk, compressed with gzip as we go.
encodeFile "/tmp/table.gz" table
draw "Wrote" table
-- load it back in, decompressing on the fly
table' <- decodeFile "/tmp/table.gz"
draw "Read" table'
-- now use 'floats' as a Ptr to pass back to C.
where
draw s v = printf "%s %d floats, and %d ints\n" s
(fromIntegral (lengthFloats v) `div` sizeOf (undefined :: CFloat))
(fromIntegral (lengthInts v ) `div` sizeOf (undefined :: CInt))
lengthFloats = L.length . floats
lengthInts = L.length . ints
More information about the Haskell-Cafe
mailing list