[Haskell-cafe] High memory usage with 1.4 Million records?

Andrew Myers asm198 at gmail.com
Fri Jun 8 22:50:59 CEST 2012


Thanks for the responses everyone, I'll try them out and see what happens :)
Andrew

On Fri, Jun 8, 2012 at 4:40 PM, Johan Tibell <johan.tibell at gmail.com> wrote:

> Hi Andrew,
>
> On Thu, Jun 7, 2012 at 5:39 PM, Andrew Myers <asm198 at gmail.com> wrote:
> > Hi Cafe,
> > I'm working on inspecting some data that I'm trying to represent as
> records
> > in Haskell and seeing about twice the memory footprint than I was
> > expecting.  I've got roughly 1.4 million records in a CSV file (400M on
> > disk) that I parse in using bytestring-csv.  bytestring-csv returns a
> > [[ByteString]] (wrapped in `type`s) which I then convert into a list of
> > records that have the following structure:
> >
> >> 3  Int
> >> 1 Text Length 3
> >> 1 Text Length 11
> >> 12 Float
> >> 1 UTCTime
> >
> > All fields are marked strict and have {-# UNPACK #-} pragmas (I'm
> guessing
> > that doesn't do anything for non primitives).  (Side note, is there a
> way to
> > check if things are actually being unpacked?)
>
> GHC used to complain when you use UNPACK with something that can't be
> unpacked, but that warning seems to have been (accidentally) removed
> in 7.4.1.
>
> The rule for unpacking is:
>
> * all product types (i.e. types with only one constructor) can be
> unpacked. This includes Int, Char, Double, etc and tuples or records
> their-of.
> * sum types (i.e. data types with more than one constructor) and
> polymorphic fields can't be unpacked.
>
> > My back of the napkin memory estimates based on the assumption that
> nothing
> > is being unpacked (and my very spotty understanding of Haskell data
> > structures):
> >
> > Platform: 64 Bit Linux
> > #  Type (Sizeof type (occasionally a guess))
> >
> > 3 * Int (8)
> > 14 * Char (4) -- Text is some kind of bytestring but I'm guessing it
> can't
> > be worse than the same number of Char?
> > 12  * Float (4)
> > 18 * sizeOf (ptr) (8)
> > UTC:  -- From what I can gather through :info in ghci
> > 4 * (ptr) (8)
> > 2 * Integer (16) -- Shouldn't be overly large, times are within 2012
>
> All fields in a constructor are word aligned. This means that all
> primitive types take 8 bytes on a 64-bit platform, including Char and
> Float. You might find the following blog posts by me useful in
> computing the size of data structures:
>
>
> http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
> http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html
> http://blog.johantibell.com/2011/11/slides-from-my-guest-lecture-at.html
>
> Here's some more on the topic:
>
>
> http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-types
>
> http://stackoverflow.com/questions/6574444/how-to-find-out-ghcs-memory-representations-of-data-types
>
> > I've written a small driver test program that just parses the CSV, finds
> the
> > minimum value for a couple of the Float fields, and exits.  In the
> process
> > monitor the memory usage is 6.9G before the program exits.  I've tried
> > profiling with +RTS -hc but it ran for >3 hours without finishing, it
> > normally finishes within 4 minutes.  Anyone have any ideas for me?
> Things
> > to try?
> > Thanks,
> > Andrew
>
> You could try to use a 32-bit GHC, which would use about half the
> memory. You're at the limit of the size of data that you can
> comfortably fit in memory on a normal desktop machine, so it might be
> time to consider a streaming approach.
>
> -- Johan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120608/d247de86/attachment.htm>


More information about the Haskell-Cafe mailing list