[Haskell-cafe] API for reading a big binary file

Stefan O'Rear stefanor at cox.net
Thu Dec 21 19:07:03 EST 2006


On Thu, Dec 21, 2006 at 01:47:48PM -0800, Ranjan Bagchi wrote:
> I've got a big [around a gigabyte] binary file, filled with identical  
> binary structures (imagine a C process writing structs).  I'd like to  
> process/analyze them efficiently.  In C or even Java, i'd memory map  
> the file and extract the data I need.

Are you sure you want a memory map?  (Disclamer: I am only familiar with
the Linux VM.)

* IO is usually (drum roll) IO bound.  CPU performance isn't a big deal.
* By using memory mapping, you limit yourself to the largest consecutive
  chunk of free address space, which is at most 3GB on 32-bit Linux.
* Memory mapping doesn't work on pipes - with today's CPU and disk speeds,
  zcat is often faster than reading a file.
* Haskell is not (yet!) powerful enough to statically check normal array
  access, so you'll be paying for lots of bounds checks.
* mmap's biggest performance advantage, the ability to use disk cache pages
  in place, is probably lost when your dataset doesn't fit into cache.

That said, if you actually need memory mapping, it shouldn't be too painful.

> Is there a fast way to do this using ghc?  I can extract fields by  
> using a ByteString, but I may not be using it fast enough:  I've had  
> to write my own routines to extract ints, longs and doubles.

* Define an instance of Storable.  If you are feeling altruistic, get a
  copy of DrIFT and add support for Storable.
* Use Data.Array.Storable.  This provides a mutable array interface to
  a pointer-to-array-of-struct.
* foreign import ccall "mmap" unsafe c_mmap :: Ptr a -> CSize -> CInt ->
  CInt -> CInt -> COff -> IO (Ptr a) -- use the FFI to access mmap(2), 
  AFAIK there is no standard interfact to this.

> Any help / examples would really be appreciated.

The sources for the standard libraries are generally a good source for
system interfacing questions.


More information about the Haskell-Cafe mailing list