[Haskell-cafe] ANNOUNCE: text 0.8.0.0, fast Unicode text support

Wed Sep 8 07:46:13 EDT 2010

On Wednesday 08 September 2010 06:44:32, Bryan O'Sullivan wrote:
> I have a Replace.hs benchmark in the main text repo, just to be sure
> we're talking about the same thing.

Grabbed the darcs repo, compiled with the darcs version and also with the 
installed text package, both exhibit the same behaviour - what I reported 
before.

> Factoring out the time spent on I/O,
> with GHC HEAD, my replace code takes twice the space and time of that in
> the stringsearch package.

That's very good. Twice the space is a given for mostly ASCII text, twice 
the time is okay, I think.
My timings are quite different, but that's probably because 6.12.3's 
inliner doesn't give the full fusion benefit, so it'll improve 
automatically with the next GHC release.

> Given that the space involved is just 121KB
> maximum residency while processing a 124MB file, I'm not concerned about
> it.

I wouldn't either.

> And the time required isn't a bad place to start from, I think.
>
> By the way, as this implies, I can't reproduce your space behaviour at
> all.
>

That's surprising.
Have you made sure to replace a pattern which does not occur in the text?
Can you reproduce the behaviour with a) Data.List.intersperse instead of 
the lazier version now used, b) ghc-6.12.* instead of HEAD?

Anyway, I would've thought that with

split pat src
    | null pat        = emptyError "split"
    | isSingleton pat = splitBy (== head pat) src
    | otherwise       = go 0 (indices pat src) src
  where
    go  _ []     cs = [cs]
    go !i (x:xs) cs = let h :*: t = splitAtWord (x-i) cs
                      in  h : go (x+l) xs (dropWords l t)
    l = foldlChunks (\a (T.Text _ _ b) -> a + fromIntegral b) 0 pat

you can't start returning chunks before it's known whether the list of 
indices is empty, so split would have O(index of pattern) space behaviour.

If HEAD manages to make the chunks available before they are complete 
(before it's known how long they will be), it's even awesomer than I'd have 
dared to hope.
Okay, so I'll have to try HEAD.

> > I can now say more. Looking at Data.Text.Lazy.replace,
> >
> > replace s d = intercalate d . split s
> >
> > , I also got a space leak with that for BS.Lazy's intercalate and
> >
> > stringsearch's split.
>
> How did you observe the space leak? Looking at -hT charted with hp2ps
> shows me nothing, and the program executes in constant space regardless
> of input size.

top and +RTS -s. I've attached the -hT graph of a run on a 74.3MB file with 
a pattern that does not occur. It shows exactly the behaviour I expected 
from the code of split, pretty constant growth of the heap until about 
twice the file size is reached, then fast and pretty constant shrinking as 
the text is output. The graphs are much more interesting if you do 
replacements of patterns without large gaps between occurrences.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: nbenchText.ps
Type: application/postscript
Size: 7826 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20100908/35ccf69c/nbenchText.ps