[Haskell-cafe] Data.Text performance problem

Daniel Fischer daniel.is.fischer at web.de
Sun Sep 12 17:01:27 EDT 2010


On Sunday 12 September 2010 21:23:50, Petr Prokhorenkov wrote:
> I experienced a following problem while dealing with some text
> processing.
>
> I have a text and want to get the same text with parts enclosed into {}
> or [] stripped away. Substituting them with a ' ' would also work.
>
> Here is the code I wrote (T is Data.Text):
>
> stripBrackets :: T.Text -> T.Text
> stripBrackets text = snd $ T.mapAccumL f 0 text where
>     f depth c = let
>        depth' = depth + d' c
>         c' | depth > 0 || depth' > 0 = ' '
>
>           | otherwise = c
>
>         in
>        (depth', c')
>
>    d' '{' = 1
>    d' '[' = 1
>     d' '}' = -1
>    d' ']' = -1
>     d' _   = 0
>
> The only problem is that it takes about a minute to complete on 3GHz+
> processor when text is a 30k chars long.
>
> Any ideas how to improve this code?

First, is it intentional that

stripBrackets "a]b[c]" = "a]b[c]" and not "a]b   " ?

Also that it doesn't distinguish between the types of brackets, e.g.

stripBrackets "{a]" = "   " ?

Of course, if you know that the text is properly bracketed, that's not 
important.

Concerning the performance, it seems that mapAccumL actually has quadratic 
complexity instead of linear because it doesn't get fused.
Ouch.

I think

stripBrackets :: T.Text -> T.Text
stripBrackets text = T.concat $ go 0 text
  where
    go depth inp = case T.breakBy (`elem` "{[]}") inp of
                    (pre,post) ->
                      case T.uncons post of
                        Nothing -> if depth > 0
                                    then [T.pack $ replicate (T.length pre) 
' ']
                                    else [pre]
                        Just (c,tl) ->
                            let depth'
                                  | c `elem` "{[" = depth + 1
                                  | otherwise = depth - 1
                            in (if depth > 0
                                 then T.pack (replicate (T.length pre) ' ')
                                 else pre) :
                                    T.pack (if depth > 0 || depth' > 0
                                                then " "
                                                else [c])
                                    : go depth' tl

has the same semantics as your function. It's not blazingly fast either 
(using String for that purpose is much faster), but it's orders of 
magnitude faster than T.mapAccumL.

Note that it's easy to remove the parts enclosed in brackets instead of 
replacing it by ' ' here.

>
> --
> Regards, Petr

Cheers,
Daniel


More information about the Haskell-Cafe mailing list