[Haskell-cafe] file splitter with enumerator package

David McBride dmcbride at neondsl.com
Tue Jul 26 02:26:25 CEST 2011


I feel like there is a little bit better way to code this by splitting
the file outputting part from the part that counts and checks for
newlines like so:

run_ $ (EB.enumFile "file.txt" $= toChunksnl 4096) $$ toFiles filelist

toFiles [] = error "expected infinite file list"
toFiles (f:fs) = do
  next <- EL.head
  case next of
    Nothing -> return ()
    Just next' -> do
      liftIO $ L.writeFile f next'
      toFiles fs

toChunksnl n = EL.concatMapAccum (somefunc n) L.empty
  where
    somefunc :: Int -> L.ByteString -> B.ByteString -> (L.ByteString,
[L.ByteString])
    somefunc = undefined

Where it has an accumulator that starts empty, gets a new bytestring,
then parses the concatenation of those two that into as many full
chunks that end with a newline as it can and stores that in the second
part of the pair and then whatever remains unterminated ends up as the
first part.  I tried to write it myself, but I can't seem to hit all
the edge cases necessary, but it seems like it should be doable for
someone who wants to.  It would be trivial with strings, but with
bytestrings it requires a little elbow grease.

However as to your question on whether you should use iteratees inside
other iteratees, yes of course.  It is all composeable.

On Mon, Jul 25, 2011 at 1:38 PM, Eric Rasmussen <ericrasmussen at gmail.com> wrote:
> I just found another solution that seems to work, although I don't
> fully understand why. In my original function where I used EB.take to
> strictly read in a Lazy ByteString and then L.hPut to write it out to
> a handle, I now use this instead (full code in the annotation here:
> http://hpaste.org/49366):
>
> EB.isolate bytes =$ EB.iterHandle handle
>
> It now runs at the same speed but in constant memory, which is exactly
> what I was looking for. Is it recommended to nest iteratees within
> iteratees like this? I'm surprised that it worked, but I can't see a
> cleaner way to do it because of the other parts of the program that
> complicate matters. At this point I've achieved my original goals,
> unusual as they are, but since this has been an interesting learning
> experience I don't want it to stop there if there are more idiomatic
> ways to write code with the enumerator package.
>
> On Mon, Jul 25, 2011 at 4:06 AM, David McBride <dmcbride at neondsl.com> wrote:
>> Well I was going to say:
>>
>> import Data.Text.IO as T
>> import Data.Enumerator.List as EL
>> import Data.Enumerator.Text as ET
>>
>> run_ $ (ET.enumHandle fp $= ET.lines) $$ EL.mapM_ T.putStrLn
>>
>> for example.  But it turns out this actually concatenates the lines
>> together and prints one single string at the end.  The reason is
>> because it turns out that ET.enumHandle already gets lines one by one
>> without you asking and it doesn't add newlines to the end, so ET.lines
>> looks at each chunk and never sees any newlines so it returns the
>> entire thing concatenated together figuring that was an entire line.
>> I'm kind of surprised that enumHandle fetches linewise rather than to
>> let you handle it.
>>
>> But if you were to make your own enumHandle that wasn't linewise that
>> would work.
>>
>> On Mon, Jul 25, 2011 at 6:26 AM, Yves Parès <limestrael at gmail.com> wrote:
>>> Okay, so there, the chunks (xs) will be lines of Text, and not just random
>>> blocks.
>>> Isn't there a primitive like printChunks in the enumerator library, or are
>>> we forced to handle Chunks and EOF by hand?
>>>
>>> 2011/7/25 David McBride <dmcbride at neondsl.com>
>>>>
>>>> blah = do
>>>>  fp <- openFile "file" ReadMode
>>>>  run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True
>>>>
>>>> printChunks is super duper simple:
>>>>
>>>> printChunks printEmpty = continue loop where
>>>>        loop (Chunks xs) = do
>>>>                let hide = null xs && not printEmpty
>>>>                CM.unless hide (liftIO (print xs))
>>>>                continue loop
>>>>
>>>>        loop EOF = do
>>>>                liftIO (putStrLn "EOF")
>>>>                yield () EOF
>>>>
>>>> Just replace print with whatever IO action you wanted to perform.
>>>>
>>>> On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès <limestrael at gmail.com> wrote:
>>>> > Sorry, I'm only beginning to understand iteratees, but then how do you
>>>> > access each line of text output by the enumeratee "lines" within an
>>>> > iteratee?
>>>> >
>>>> > 2011/7/24 Felipe Almeida Lessa <felipe.lessa at gmail.com>
>>>> >>
>>>> >> On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès <limestrael at gmail.com>
>>>> >> wrote:
>>>> >> > If you used Data.Enumerator.Text, you would maybe benefit the "lines"
>>>> >> > function:
>>>> >> >
>>>> >> > lines :: Monad m => Enumeratee Text Text m b
>>>> >>
>>>> >> It gets arbitrary blocks of text and outputs lines of text.
>>>> >>
>>>> >> > But there is something I don't get with that signature:
>>>> >> > why isn't it:
>>>> >> > lines :: Monad m => Enumeratee Text [Text] m b
>>>> >> > ??
>>>> >>
>>>> >> Lists of lines of text?
>>>> >>
>>>> >> Cheers, =)
>>>> >>
>>>> >> --
>>>> >> Felipe.
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > Haskell-Cafe mailing list
>>>> > Haskell-Cafe at haskell.org
>>>> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>>>> >
>>>> >
>>>
>>>
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>



More information about the Haskell-Cafe mailing list