finding sublist

Ketil Z. Malde ketil@ii.uib.no
06 May 2002 09:30:02 +0200


"Garner, Robin" <Robin.Garner@crsrehab.gov.au> writes:

> See Dijkstra's 'Discipline of Programming' for an o(M + N) algorithm - naive
> approches are o(MN) where M and N are the length of the list and substring
> respectively.

> Essentially the algorithm calculates a sort of autocorrelation table of the
> substring, showing where to resume the search after a failed match.

There's also the KMP (Knuth, Morris, Pratt) algorithm, which is
similar.  See Dan Gusfield: "Algorithms of strings, trees and
sequences" for lots of this stuff.

However: it is very hard to beat the naive implementation
(i.e. something like '\p -> or . map isPrefixOf p . tails', untested)
in the expected case, at least in my experience.  With an alphabet
size of s, you will statistically match the first character only in
1/s of the cases, the first and the second 1/s^2, and so on, so unless
your data are a bit peculiar (e.g. looking for "aaa...aab" in a
sequence of 'a's), the constant factors of the more complex algorithms
will probably not make it worthwhile.

On the other hand, if you need to search for many different patterns
in the same data, look at the suffix tree algorithms.  If they're too
difficult to implement effectively in a functional language, it seems
you can get similar results (in the expected case) by using tries.

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants