[Haskell-beginners] approximate string search
Mason Lai
masonmlai at gmail.com
Tue Apr 26 23:17:02 UTC 2016
Hi,
I'm writing a Haskell program which is looking for whether or not a string
(needle) exists within a larger string (haystack) given a Levenshtein
distance. Note that it shouldn't calculate the Levenshtein distance between
the needle and haystack. For example,
needle: AAA
haystack: TTTTTAATTTTT
The needle would be "found" if the Levenshtein distance is set at 1,
because dist("AAT", "AAA") == 1.
The following almost works:
distance(needle, haystack) - (len(haystack) - len(needle))
But it does not handle deletions correctly.
I've previously written a program which does this approximate search in
Java with the Wu-Manber extension of the Bitap, or shift-or, algorithm. The
same strategy seems difficult to code up in Haskell because it's very
"stateful" and involves a lot of bit-fiddling. (Maybe it's actually quite
simple, but I'm not sure how I would go about doing this.)
As a further complication, I'd prefer to keep the data packed as
ByteStrings, as I'll be dealing with around 200 GiBs of data (split and
parallelized over a cluster, but still a lot). I don't really know how to
deal with ByteStrings without using the functions that come along in the
ByteString module or unpacking the ByteString.
I've been using the language-spelling package, which has a module which can
calculate the Levenshtein distance between two ByteStrings. I see that it
uses ListLike. I'm not really sure how it works, but I assume it makes
ByteString an instance of ListLike, and then you have access to all of the
ListLike methods?
Does anyone have any advice on how I should proceed with this? I don't mind
learning new things, but I don't really know what the best strategy is or
where to look.
-- Mason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/beginners/attachments/20160426/5bda2698/attachment.html>
More information about the Beginners
mailing list