[Haskell-cafe] file splitter with enumerator package
Eric Rasmussen
ericrasmussen at gmail.com
Sat Jul 23 04:56:53 CEST 2011
Hi Felipe,
Thank you for the very detailed explanation and help. Regarding the first
point, for this particular use case it's fine if the user-specified file
size is extended by the length of a partial line (it's a compact csv file so
if the user breaks a big file into 100mb chunks, each chunk would only ever
be about 100mb + up to 80 bytes, which is fine for the user).
I'm intrigued by the idea of making the bulk copy function with EB.isolate
and EB.iterHandle, but I couldn't find a way to fit these into the larger
context of writing to multiple file handles. I'll keep working on it and see
if I can address the concerns you brought up.
Thanks again!
Eric
On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa <
felipe.lessa at gmail.com> wrote:
> There is one problem with your algorithm. If the user asks for 4 GiB,
> then the program will create files with *at least* 4 GiB. So the user
> would need to ask for less, maybe 3.9 GiB. Even so there's some
> danger, because there could be a 0.11 GiB line on the file.
>
> Now, the biggest problem your code won't run in constant memory.
> 'EB.take' does not lazily return a lazy ByteString. It strictly
> returns a lazy ByteString [1]. The lazy ByteString is used to avoid
> copying data (as it is basically the same as a linked list of strict
> bytestrings). So if the user asked for 4 GiB files, this program
> would need at least 4 GiB of memory, probably more due to overheads.
>
> If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy
> I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator
> package doesn't really buy you anything. You should just use
> bytestring package's lazy I/O functions.
>
> If you want the guarantee of no leaks that enumerator gives, then you
> have to use another way of constructing your program. One safe way of
> doing it is something like:
>
> takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString)
> takeNextLine = ...
>
> go :: Monad m => Handle -> Int64 -> E.Iteratee B.ByteString m (Maybe
> L.ByteString)
> go h n = do
> mline <- takeNextLine
> case mline of
> Nothing -> return Nothing
> Just line
> | L.length line <= n -> L.hPut h line >> go h (n - L.length line)
> | otherwise -> return mline
>
> So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h'
> and returns the leftover data. The driver code needs to check its
> results. Case 'Nothing', then the program finishes. Case 'Just
> line', save line on a new file and call 'go h2 (n - L.length line)'.
> It isn't efficient because lines could be small, resulting in many
> small hPuts (bad). But it is correct and will never use more than 'n'
> bytes (great). You could also have some compromise where the user
> says that he'll never have lines longer than 'x' bytes (say, 1 MiB).
> Then you call a bulk copy function for 'n - x' bytes, and then call
> 'go h x'. I think you can make the bulk copy function with EB.isolate
> and EB.iterHandle.
>
> Cheers, =)
>
> [1]
> http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take
>
> --
> Felipe.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110722/8deec0f7/attachment.htm>
More information about the Haskell-Cafe
mailing list