[Haskell-beginners] approximate string search

Tue Apr 26 23:17:02 UTC 2016

Hi,

I'm writing a Haskell program which is looking for whether or not a string
(needle) exists within a larger string (haystack) given a Levenshtein
distance. Note that it shouldn't calculate the Levenshtein distance between
the needle and haystack. For example,

needle: AAA
haystack: TTTTTAATTTTT

The needle would be "found" if the Levenshtein distance is set at 1,
because dist("AAT", "AAA") == 1.

The following almost works:

distance(needle, haystack) - (len(haystack) - len(needle))

But it does not handle deletions correctly.

I've previously written a program which does this approximate search in
Java with the Wu-Manber extension of the Bitap, or shift-or, algorithm. The
same strategy seems difficult to code up in Haskell because it's very
"stateful" and involves a lot of bit-fiddling. (Maybe it's actually quite
simple, but I'm not sure how I would go about doing this.)

As a further complication, I'd prefer to keep the data packed as
ByteStrings, as I'll be dealing with around 200 GiBs of data (split and
parallelized over a cluster, but still a lot). I don't really know how to
deal with ByteStrings without using the functions that come along in the
ByteString module or unpacking the ByteString.

I've been using the language-spelling package, which has a module which can
calculate the Levenshtein distance between two ByteStrings. I see that it
uses ListLike. I'm not really sure how it works, but I assume it makes
ByteString an instance of ListLike, and then you have access to all of the
ListLike methods?

Does anyone have any advice on how I should proceed with this? I don't mind
learning new things, but I don't really know what the best strategy is or
where to look.

-- Mason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/beginners/attachments/20160426/5bda2698/attachment.html>