[Haskell-cafe] Searching of several substrings (with Data.Text ?)

Daniel Fischer daniel.is.fischer at googlemail.com
Tue Jul 5 20:55:50 CEST 2011


On Tuesday 05 July 2011, 20:01:26, Tillmann Vogt wrote:
> Hi,
> 
> For my font library I need A function that can handle ligatures. It can
> be explained best with an example:
> 
> f  ["Th", "ff", "fi", "fl", "ffi"]  "The fluffiest bunny"
> 
> should be evaluated to
> 
> ["Th", "e", " ", "fl", "u", "ffi", "e", "s", "t", " ", "b", "u", "n",
> "n", "y" ]
> 
> I looked at Data.Text
> http://hackage.haskell.org/packages/archive/text/0.5/doc/html/Data-Text.
> html and
> http://hackage.haskell.org/packages/archive/stringsearch/0.3.3/doc/html/
> Data-ByteString-Search.html
> 
> but they don't have a function that can search several substrings in one
> run.

Well, Data.ByteString[.Lazy].KarpRabin does provide simultaneous search for 
multiple substrings - it does not, however, provide direct splitting.

But in my tests, unless the number of substrings was large, multiple 
separate (Boyer-Moore) passes with manual merging of the offset lists were 
much faster [there's a possible space leak for lazy ByteStrings if any 
pattern does not appear in a long substring of the source, so that would be 
a point for Data.ByteString.Lazy.KarpRabin], and I didn't know of any real 
use-case, so I did not pursue it further and considered it just an 
interesting curiosity.

I suppose using a regex package as Johannes Waldmann suggested would be the 
easiest way (probably also faster).

If you submit a feature request, however, I would look into expanding the 
offered functionality (but I'll be on vacation soon, so performance would 
have to wait; I suppose something could be gained there).

> I guess that searching a text again and again for every substring is not
> very efficient and it can be done in one run.

Well, it is surprisingly efficient, compared to (my implementation of) the 
Karp-Rabin algorithm at least.

> Although I may figure this out myself I think such a function could be
> so common that someone has done it or can give me some tips.
> 
> Thanks

Cheers,
Daniel



More information about the Haskell-Cafe mailing list