[Haskell-cafe] Allocating enormous amounts of memory and wondering why

Tue Jul 10 22:25:13 EDT 2007

jeff:
> I switched to Data.Binary, which dropped me from 2.6GB to 1.5GB, and
> then I switched this afternoon to unboxed arrays from lists of floats,
> and that dropped me again from 1.5GB to 475MB.  I think, all told, that
> I'm in an acceptable range now, and thank you for pointing out the
> library mistake.  I'm also down from 1.5 minutes load time to under 10
> seconds of load time, which is also very very nice.  Incidentally, the
> code I'm now using is:

Good!

> 
> binaryLoadDocumentCoordinates :: 
>   String -> IO (Ptr Float, Array.UArray Int Int)
> binaryLoadDocumentCoordinates path = do
>   putStrLn "File opened"
>   coordinates <- decodeFile (path ++ "/Clusters.bin") :: IO
> (Array.UArray Int Float)
>   print . Array.bounds $ coordinates
>   putStrLn "Got coordinates"
>   galaxies <- decodeFile (path ++ "/Galaxies.bin") :: IO (Array.UArray
> Int Int)
>   putStrLn "Got galaxies"
>   coordinatesArr <- mallocArray . snd . Array.bounds $ coordinates
>   putStrLn "Allocated array"
>   pokeArray coordinatesArr . Array.elems $ coordinates
>   return (coordinatesArr, galaxies)
> 
> binarySaveDocumentCoordinates :: String -> [Point] -> IO ()
> binarySaveDocumentCoordinates path points = do
>   let len = length points
>   encodeFile (path ++ "Clusters.bin") . (Array.listArray (0,len*3) :: 
>     [Float] -> Array.UArray Int Float) . coordinateList . solve $ points
>   encodeFile (path ++ "Galaxies.bin") . (Array.listArray (0,len) :: 
>     [Int] -> Array.UArray Int Int) . galaxyList $ points

You could improve this further by removing the intermediate list
serialisation and construction required by the UArray instance. 

My previous example did that,  using Ptr Int arrays instead of UArray,
which can then be serialised by casting to a bytestring, and writing
those bytes directly, rather than serialising via lists, as UArrays do.

Removing the UArray serialisation means no allocation overhead at all to
serialise the Int array.  That final step would reduce the memory
overhead to exactly the size of the input file, and probably shave at
least 50% off the time, if you need to further improve it.

That is, you'd have:

      binaryLoadDocumentCoordinates :: 
        String -> IO (Ptr Float, IntTable)

where IntTable is a shallow wrapper over Ptr Int, serialised with
the techniques used here,
    http://haskell.org/haskellwiki/Serialisation_and_compression_with_Data_Binary

    newtype IntTable = IntTable (Ptr Int)

with a suitable Binary instance (possibly compressing it on the fly
too).

The general lesson is to avoid lists of any kind (including those
constructed implicitly when serialising Arrays), as soon as you have
more than 1M or so of data stored in those lists.

-- Don