[Haskell-cafe] High memory usage with 1.4 Million records?
Johan Tibell
johan.tibell at gmail.com
Fri Jun 8 22:40:12 CEST 2012
Hi Andrew,
On Thu, Jun 7, 2012 at 5:39 PM, Andrew Myers <asm198 at gmail.com> wrote:
> Hi Cafe,
> I'm working on inspecting some data that I'm trying to represent as records
> in Haskell and seeing about twice the memory footprint than I was
> expecting. I've got roughly 1.4 million records in a CSV file (400M on
> disk) that I parse in using bytestring-csv. bytestring-csv returns a
> [[ByteString]] (wrapped in `type`s) which I then convert into a list of
> records that have the following structure:
>
>> 3 Int
>> 1 Text Length 3
>> 1 Text Length 11
>> 12 Float
>> 1 UTCTime
>
> All fields are marked strict and have {-# UNPACK #-} pragmas (I'm guessing
> that doesn't do anything for non primitives). (Side note, is there a way to
> check if things are actually being unpacked?)
GHC used to complain when you use UNPACK with something that can't be
unpacked, but that warning seems to have been (accidentally) removed
in 7.4.1.
The rule for unpacking is:
* all product types (i.e. types with only one constructor) can be
unpacked. This includes Int, Char, Double, etc and tuples or records
their-of.
* sum types (i.e. data types with more than one constructor) and
polymorphic fields can't be unpacked.
> My back of the napkin memory estimates based on the assumption that nothing
> is being unpacked (and my very spotty understanding of Haskell data
> structures):
>
> Platform: 64 Bit Linux
> # Type (Sizeof type (occasionally a guess))
>
> 3 * Int (8)
> 14 * Char (4) -- Text is some kind of bytestring but I'm guessing it can't
> be worse than the same number of Char?
> 12 * Float (4)
> 18 * sizeOf (ptr) (8)
> UTC: -- From what I can gather through :info in ghci
> 4 * (ptr) (8)
> 2 * Integer (16) -- Shouldn't be overly large, times are within 2012
All fields in a constructor are word aligned. This means that all
primitive types take 8 bytes on a 64-bit platform, including Char and
Float. You might find the following blog posts by me useful in
computing the size of data structures:
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html
http://blog.johantibell.com/2011/11/slides-from-my-guest-lecture-at.html
Here's some more on the topic:
http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-types
http://stackoverflow.com/questions/6574444/how-to-find-out-ghcs-memory-representations-of-data-types
> I've written a small driver test program that just parses the CSV, finds the
> minimum value for a couple of the Float fields, and exits. In the process
> monitor the memory usage is 6.9G before the program exits. I've tried
> profiling with +RTS -hc but it ran for >3 hours without finishing, it
> normally finishes within 4 minutes. Anyone have any ideas for me? Things
> to try?
> Thanks,
> Andrew
You could try to use a 32-bit GHC, which would use about half the
memory. You're at the limit of the size of data that you can
comfortably fit in memory on a normal desktop machine, so it might be
time to consider a streaming approach.
-- Johan
More information about the Haskell-Cafe
mailing list