[Haskell-cafe] How to deal with huge text file?
Daniel Fischer
daniel.is.fischer at web.de
Mon May 24 23:06:04 EDT 2010
On Tuesday 25 May 2010 04:26:07, Ivan Miljenovic wrote:
> On 25 May 2010 12:20, Magicloud Magiclouds
>
> <magicloud.magiclouds at gmail.com> wrote:
> > This is the function. The problem sure seems like something was
> > preserved unexpected. But I cannot find out where is the problem.
> >
> > seperateOutput file =
> > let content = lines file
> > indexOfEachOutput_ = fst $ unzip $ filter (\(i, l) ->
> > " Log for "
> > `isPrefixOf` l ) $ zip [0..] content indexOfEachOutput =
> > indexOfEachOutput_ ++ [length content] in
>
> ^^^^^^^^^^^^^^^^
>
> Expensive bit
>
> > map (\(a, b) ->
> > drop a $ take b content
> > ) $ zip indexOfEachOutput $ tail indexOfEachOutput
>
> You're not "streaming" the String; you're also keeping it around to
> calculate the length (I'm also not sure how GHC optimises that if at
> all; it might even re-evaluate the length each time you use
> indexOfEachOutput.
Not that it helps, but it evaluates the length only once.
But it does that at the very end, when dealing with the last log.
>
> The zipping of indexOfEachOutput should be OK without that length at
> the end, as it will lazy construct the zipped list (only evaluating up
> to two values at a time). However, you'd be better off using "zipWith
> f" rather than "map f . zip".
There'd still be the problem of
drop a $ take b content
, so nothing can be garbage collected before everything's done.
separateOutpout file =
let contents = lines file
split = break ("Log for " `isPrefixOf`)
msplit [] = Nothing
msplit lns = Just (split lns)
in drop 1 $ unfoldr msplit contents
should fix it.
More information about the Haskell-Cafe
mailing list