[Haskell-cafe] Re: zip-archive performance/memmory usage

Tue Aug 10 15:47:32 EDT 2010

Thanks for your comments John.
I appreciate your work.  I think pandoc is fantastic!

I'm interested to solve this problem, but time is also an issue.
I'll try to toy around with it.

Thanks,

Pieter

On Tue, Aug 10, 2010 at 7:06 PM, John MacFarlane <jgm at berkeley.edu> wrote:

> Hi all,
>
> I'm the author of zip-archive. I wrote it for a fairly special purpose --
> I wanted to create and read ODT files in pandoc -- and I know it could be
> improved.
>
> The main problem is that the parsing algorithm is kind of stupid; it just
> reads the whole archive in sequence, storing the files as it comes to them.
> So a file listing will take almost as much time as a full extract.
>
> There's a better way: The zip archive ends with an "end of central
> directory
> record", which contains (among other things) the offset of the central
> directory from the beginning of the file. So, one could use something like
> the
> following strategy:
>
> 1. read the "end of central directory record", which should be the last 22
> bytes of the file. I think it should be possible to do this without
> allocating
> memory for the whole file.
>
> 2. parse that to determine the offset of the central directory.
>
> 3. seek to the offset of the central directory and parse it. This will give
> you a list of file headers. Each file header will tell you the name of a
> file
> in the archive, how it is compressed, and where to find it (its offset) in
> the
> file.
>
> At this point you'd have the list of files, and enough information to seek
> to
> any file and read it from the archive. The API could be changed to allow
> lazy
> reading of a single file without reading all of them.
>
> I don't think these changes would be too difficult, since you wouldn't have
> to
> change any of the functions that do the binary parsing -- it would just be
> a
> matter of changing the top-level functions.
>
> I don't have time to do this right now, but if one of you wants to tackle
> the
> problem, patches are more than welcome! There's some documentation on the
> ZIP
> format in comments in the source code.
>
> John
>
>
> +++ Neil Brown [Aug 10 10 12:35 ]:
> > On 10/08/10 00:29, Pieter Laeremans wrote:
> > >Hello,
> > >
> > >I'm trying some haskell scripting. I'm writing a script to print
> > >some information
> > >from a zip archive.  The zip-archive library does look nice but
> > >the performance of zip-archive/lazy bytestring
> > >doesn't seem to scale.
> > >
> > >Executing :
> > >
> > >   eRelativePath $ head $ zEntries archive
> > >
> > >on an archive of around 12 MB with around 20 files yields
> > >
> > >Stack space overflow: current size 8388608 bytes.
> > >
> > >
> > >The script in question can be found at :
> > >
> > >http://github.com/plaeremans/HaskellSnipplets/blob/master/ZipList.hs
> > >
> > >I'm using the latest version of haskell platform.  Are these
> > >libaries not production ready,
> > >or am I doing something terribly wrong ?
> >
> > I downloaded your program and compiled it (GHC 6.12.1, zip-archive
> > 0.1.1.6, bytestring 0.9.1.5).  I ran it on the JVM src.zip (20MB,
> > ~8000 files) and it sat there for a minute (67s), taking 2.2% memory
> > according to top, then completed successfully.  Same behaviour with
> > -O2.  Which compares very badly in time to the instant return when I
> > ran unzip -l on the same file, but I didn't see any memory problems.
> > Presumably your archive is valid and works with unzip and other
> > tools?
> >
> > Thanks,
> >
> > Neil.
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>

-- 
Pieter Laeremans <pieter at laeremans.org>

"The future is here. It's just not evenly distributed yet."  W. Gibson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100810/8f63e088/attachment.html