substring search api

Tue Sep 18 11:12:03 EDT 2007

My opinions on this are...

Duncan Coutts wrote:
> I want to start a discussion about a string searching api, especially with
> respect to ByteStrings.
> 
> I'm concerned that we're about to release a bytestring package along with
> ghc-6.8 that will freeze a less than ideal string search api in stone. We have
> an opportunity to make that better since at the moment I think there are few
> serious users of that api since the current implementation is very slow. However
> in this release cycle we're going to get optimised string searching algorithms
> and then I expect the number of users to pick up. I'd be a shame if that forces
> us to keep a not very nice api.

I think 'nice' needs to be pinned down.  There are the semantic properties we need:

(I) Must not prevent use of optimized algorithms
(II) Must not force whole lazy string into memory (unless required)
(III) Must not force holding onto start of lazy string

Note that these are only an issue with ByteString.Lazy

And the API characteristics: obvious/intuitive/consistent/easy to
learn/powerful/... (pick some).

> 
> That said, here is the current api.
> 
> -- | Find the indexes of all (possibly overlapping) occurances of a
> -- substring in a string.
> findSubstrings :: ByteString -- ^ String to search for.
>                -> ByteString -- ^ String to seach in.
>                -> [Int]

This style is often what is desired.  It should be offered.

> 
> -- | Get the first index of a substring in another string,
> --   or 'Nothing' if the string is not found.
> findSubstring :: ByteString -- ^ String to search for.
>               -> ByteString -- ^ String to seach in.
>               -> Maybe Int
> findSubstring p s = listToMaybe (findSubstrings p s)
> 
> -- | Check whether one string is a substring of another.
> isSubstringOf :: ByteString -- ^ String to search for.
>               -> ByteString -- ^ String to search in.
>               -> Bool
> isSubstringOf p s = not (null (findSubstrings p s))

Those two are convenience wrappers.

> 
> We only have it implemented for strict ByteStrings at the moment. This api isn't
> too bad for the strict ByteString implementation since we can use indexes to
> select substrings in O(1) time. However for lazy bytestrings index are much less
> good since we have to traverse the list of chunks to use the indexes.
> 
> A more compelling argument is that it's not very Haskelly to use indexes. We
> normally deal in spans rather than offsets and lengths.

Users may very well need the offsets, though.

> 
> One possible alternative that I think is more Haskelly and would suit the strict
> and lazy (and indeed ordinary list []) implementation is:
> 
> findSubstring :: ByteString -- ^ String to search for.
>               -> ByteString -- ^ String to seach in.
>               -> (ByteString,  -- span before search string
>                   ByteString)  -- span beginning search string, or empty

That means one needs to call some "length" function to get the offsets.  This
requires forcing the whole input string to get the length of the second found
location.  This is _not_ an alternative to [Int]

> compare this to our standard List.break function.
> 
>> break isSpace "foo bar" = ("foo", " bar")
>> break isSpace "foobar"  = ("foo bar", "")
> 
> So for findSubstring we'd have:
> 
>> findSubstring "bar" "foo bar baz"      = ("foo ","bar baz")
>> findSubstring "notfound" "foo bar baz" = ("foo bar baz", "")
> 
> I quite like this as the primitive.
> 
> isSubstringOf is easily definable in terms of that.
> 
> I should note that there is really no performance penalty for working in terms
> of spans rather than indexes. Indeed for the lazy case it'd be faster.

For the single return of findSubstring, I agree the performance difference is small.

> 
> findSubstrings is much less clear. What's not clear is what the result type and
> content should be. The current impl that gives us a list of indexes of
> possibly-overlapping occurrences seems straightforward. What is the break/span
> style equivalent?
> 
> A couple possibilities:
> 
> findSubstrings :: ByteString -> ByteString -> [ByteString]
> 
> where each result begins an occurrence and includes the trailing text before the
> next occurrence. Of course that means we loose any initial span. So we could have:
> 
> findSubstrings :: ByteString -> ByteString -> (ByteString, [ByteString])
> 
> that includes the initial span.
> 
> Then there's the issue of overlapping spans or not. Overlapping seems more
> general but also seems confusing when done in terms of spans.
> 
> Or we could just side-step the issue and not provide a findSubstrings at all and
> let people implement the behavior they want in terms of findSubstring. Afterall
> there is no 'many' tokenising variant of List.span or List.break, probably
> because of the wide variation in requirements.

This violates the (I) property: Good search algorithms often work better finding
all the hits for "findSubstrings" and work worse when built from the
"findSubstring" primitive.

> 
> I'd very much appreciate peoples opinions on this. Though bear in mind that I'm
> not looking for a large string searching framework, I want to pick a supportable
> primitive for strict and lazy bytestring searching that can go into the
> bytestring library now so we can avoid freezing a worse api.
> 
> Duncan

The searching API of regex-base could be a model.  It could even be re-used, as
long as your API allows one to make RegexLike instances.  This is just a
slightly degenerate case of regex searching, after all.

The Regular Expression analogy is a good one.  Note that the KMP and Boyer-Moore
algorithms pre-process the searched-for string much like a regular expression is
compiled.  This work can be cached by "saving" the one-argument curried version
of the findSubstrings function when the same searched-for string is sought in
several searched-in strings.  (Imagine looking for the target string in a series
of files).

Thus KMP and Boyer-Moore could just be different back-ends for regex-base.

But since finding literal string is a simpler case, it may well be that a
simpler API is called for, with simpler type classes.

Cheers,
  Chris Kuklewicz