substring search api

Tue Sep 18 10:58:55 EDT 2007

In message <20070918143444.C08F894047 at webmail220.herald.ox.ac.uk> Duncan Coutts
<duncan.coutts at worc.ox.ac.uk> writes:
> I want to start a discussion about a string searching api, especially with
> respect to ByteStrings.
> compare this to our standard List.break function.
> 
> > break isSpace "foo bar" = ("foo", " bar")
> > break isSpace "foobar"  = ("foo bar", "")

Oops, copy'n'pasto

> break isSpace "foobar"  = ("foobar", "")

obviously.

> findSubstrings :: ByteString -> ByteString -> [ByteString]
> 
> where each result begins an occurrence and includes the trailing text before 
> the next occurrence.

It has been pointed out to me that there are two styles here that correspond
naturally to overlapping or non-overlapping searches.

If we want to look for all occurrences, even occurrences that overlap then it
makes sense for each result to include the whole tail of the string, not just up
to the next occurrence. eg:

> findSubstrings "foo" "blah foo bar foo baz" =
>  ["foo bar foo baz", "foo baz"]

then it's clear we do not need an initial span since that's just the whole string.

If we want to split into non-overlapping spans then it makes more sense to
provide the initial span too:

> findSubstrings "foo" "blah foo bar foo baz" =
>  ("blah ", ["foo bar ", "foo baz"])

Tagsoup makes this kind of distinction:
http://www.cs.york.ac.uk/fp/haddock/tagsoup/Text-HTML-TagSoup.html#v%3Apartitions

I have to say, I'm rather inclined to not prejudge the issue and not provide any
findSubstrings function at the moment and wait and see what the common patterns
are and if any are worth putting in the lib.

So perhaps that's my straw-man proposal:
  * change BS.findSubstring to be :: BS -> BS -> (BS, BS)
     in the style of List.break
  * remove the current BS.findSubstrings

and of course to also add findSubstring with the equivalent type for the
ByteString.Lazy module.

Duncan