[Haskell-beginners] approximate string search

Sylvain Henry sylvain at haskus.fr
Wed Apr 27 03:50:04 UTC 2016


Hi,

The ListLike instance for ByteString is defined here: 
https://hackage.haskell.org/package/ListLike-3.1.7.1/docs/src/Data-ListLike-Instances.html

The implementation of Levenshtein distance in language-spelling is here: 
https://hackage.haskell.org/package/language-spelling-0.3.2/docs/src/Language-Distance.html#levenshtein

It seems to use only O(1) operations on the Bytestring: length and index 
(!!).
It uses an unboxed mutable vector in the ST monad.
It is specialized for ByteString.

So I think it should be quite efficient already. If you want you can use 
the same approach to implement the other algorithm, with Data.Bits for 
the bit-fiddling: 
https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Bits.html

Cheers,
Sylvain


On 27/04/2016 01:17, Mason Lai wrote:
> Hi,
>
> I'm writing a Haskell program which is looking for whether or not a 
> string (needle) exists within a larger string (haystack) given a 
> Levenshtein distance. Note that it shouldn't calculate the Levenshtein 
> distance between the needle and haystack. For example,
>
> needle: AAA
> haystack: TTTTTAATTTTT
>
> The needle would be "found" if the Levenshtein distance is set at 1, 
> because dist("AAT", "AAA") == 1.
>
> The following almost works:
>
> distance(needle, haystack) - (len(haystack) - len(needle))
>
> But it does not handle deletions correctly.
>
> I've previously written a program which does this approximate search 
> in Java with the Wu-Manber extension of the Bitap, or shift-or, 
> algorithm. The same strategy seems difficult to code up in Haskell 
> because it's very "stateful" and involves a lot of bit-fiddling. 
> (Maybe it's actually quite simple, but I'm not sure how I would go 
> about doing this.)
>
> As a further complication, I'd prefer to keep the data packed as 
> ByteStrings, as I'll be dealing with around 200 GiBs of data (split 
> and parallelized over a cluster, but still a lot). I don't really know 
> how to deal with ByteStrings without using the functions that come 
> along in the ByteString module or unpacking the ByteString.
>
> I've been using the language-spelling package, which has a module 
> which can calculate the Levenshtein distance between two ByteStrings. 
> I see that it uses ListLike. I'm not really sure how it works, but I 
> assume it makes ByteString an instance of ListLike, and then you have 
> access to all of the ListLike methods?
>
> Does anyone have any advice on how I should proceed with this? I don't 
> mind learning new things, but I don't really know what the best 
> strategy is or where to look.
>
> -- Mason
>
>
> _______________________________________________
> Beginners mailing list
> Beginners at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/beginners

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/beginners/attachments/20160427/aec95706/attachment-0001.html>


More information about the Beginners mailing list