[Haskell-cafe] How to deal with huge text file?

Daniel Fischer daniel.is.fischer at web.de
Mon May 24 23:06:04 EDT 2010


On Tuesday 25 May 2010 04:26:07, Ivan Miljenovic wrote:
> On 25 May 2010 12:20, Magicloud Magiclouds
>
> <magicloud.magiclouds at gmail.com> wrote:
> > This is the function. The problem sure seems like something was
> > preserved unexpected. But I cannot find out where is the problem.
> >
> > seperateOutput file =
> >  let content = lines file
> >      indexOfEachOutput_ = fst $ unzip $ filter (\(i, l) ->
> >                                                 " Log for "
> > `isPrefixOf` l ) $ zip [0..] content indexOfEachOutput =
> > indexOfEachOutput_ ++ [length content] in
>
>      ^^^^^^^^^^^^^^^^
>
>      Expensive bit
>
> >  map (\(a, b) ->
> >         drop a $ take b content
> >      ) $ zip indexOfEachOutput $ tail indexOfEachOutput
>
> You're not "streaming" the String; you're also keeping it around to
> calculate the length (I'm also not sure how GHC optimises that if at
> all; it might even re-evaluate the length each time you use
> indexOfEachOutput.

Not that it helps, but it evaluates the length only once.
But it does that at the very end, when dealing with the last log.

>
> The zipping of indexOfEachOutput should be OK without that length at
> the end, as it will lazy construct the zipped list (only evaluating up
> to two values at a time).  However, you'd be better off using "zipWith
> f" rather than "map f . zip".

There'd still be the problem of

drop a $ take b content

, so nothing can be garbage collected before everything's done.

separateOutpout file =
    let contents = lines file
        split = break ("Log for " `isPrefixOf`)
        msplit [] = Nothing
        msplit lns = Just (split lns)
    in drop 1 $ unfoldr msplit contents

should fix it.



More information about the Haskell-Cafe mailing list