[Haskell-cafe] file splitter with enumerator package

Sat Jul 23 03:00:00 CEST 2011

There is one problem with your algorithm.  If the user asks for 4 GiB,
then the program will create files with *at least* 4 GiB.  So the user
would need to ask for less, maybe 3.9 GiB.  Even so there's some
danger, because there could be a 0.11 GiB line on the file.

Now, the biggest problem your code won't run in constant memory.
'EB.take' does not lazily return a lazy ByteString.  It strictly
returns a lazy ByteString [1].  The lazy ByteString is used to avoid
copying data (as it is basically the same as a linked list of strict
bytestrings).  So if the user asked for 4 GiB files, this program
would need at least 4 GiB of memory, probably more due to overheads.

If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy
I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator
package doesn't really buy you anything.  You should just use
bytestring package's lazy I/O functions.

If you want the guarantee of no leaks that enumerator gives, then you
have to use another way of constructing your program.  One safe way of
doing it is something like:

  takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString)
  takeNextLine = ...

  go :: Monad m => Handle -> Int64 -> E.Iteratee B.ByteString m (Maybe
L.ByteString)
  go h n = do
    mline <- takeNextLine
    case mline of
      Nothing -> return Nothing
      Just line
        | L.length line <= n -> L.hPut h line >> go h (n - L.length line)
        | otherwise -> return mline

So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h'
and returns the leftover data.  The driver code needs to check its
results.  Case 'Nothing', then the program finishes.  Case 'Just
line', save line on a new file and call 'go h2 (n - L.length line)'.
It isn't efficient because lines could be small, resulting in many
small hPuts (bad).  But it is correct and will never use more than 'n'
bytes (great).  You could also have some compromise where the user
says that he'll never have lines longer than 'x' bytes (say, 1 MiB).
Then you call a bulk copy function for 'n - x' bytes, and then call
'go h x'.  I think you can make the bulk copy function with EB.isolate
and EB.iterHandle.

Cheers, =)

[1] http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take

-- 
Felipe.