[Haskell-cafe] Searching of several substrings (with Data.Text ?)
Daniel Fischer
daniel.is.fischer at googlemail.com
Tue Jul 5 20:55:50 CEST 2011
On Tuesday 05 July 2011, 20:01:26, Tillmann Vogt wrote:
> Hi,
>
> For my font library I need A function that can handle ligatures. It can
> be explained best with an example:
>
> f ["Th", "ff", "fi", "fl", "ffi"] "The fluffiest bunny"
>
> should be evaluated to
>
> ["Th", "e", " ", "fl", "u", "ffi", "e", "s", "t", " ", "b", "u", "n",
> "n", "y" ]
>
> I looked at Data.Text
> http://hackage.haskell.org/packages/archive/text/0.5/doc/html/Data-Text.
> html and
> http://hackage.haskell.org/packages/archive/stringsearch/0.3.3/doc/html/
> Data-ByteString-Search.html
>
> but they don't have a function that can search several substrings in one
> run.
Well, Data.ByteString[.Lazy].KarpRabin does provide simultaneous search for
multiple substrings - it does not, however, provide direct splitting.
But in my tests, unless the number of substrings was large, multiple
separate (Boyer-Moore) passes with manual merging of the offset lists were
much faster [there's a possible space leak for lazy ByteStrings if any
pattern does not appear in a long substring of the source, so that would be
a point for Data.ByteString.Lazy.KarpRabin], and I didn't know of any real
use-case, so I did not pursue it further and considered it just an
interesting curiosity.
I suppose using a regex package as Johannes Waldmann suggested would be the
easiest way (probably also faster).
If you submit a feature request, however, I would look into expanding the
offered functionality (but I'll be on vacation soon, so performance would
have to wait; I suppose something could be gained there).
> I guess that searching a text again and again for every substring is not
> very efficient and it can be done in one run.
Well, it is surprisingly efficient, compared to (my implementation of) the
Karp-Rabin algorithm at least.
> Although I may figure this out myself I think such a function could be
> so common that someone has done it or can give me some tips.
>
> Thanks
Cheers,
Daniel
More information about the Haskell-Cafe
mailing list