[Haskell-cafe] Re: Knuth Morris Pratt for Lazy Bytestrings implementation

Wed Aug 1 15:23:58 EDT 2007

I am still working on improving your code.

And I have a "Is this a bug?" question:

The "lookup = computeLookup pat" defines lookup to take an Int which represents
the index into pat, where this index is 0 based and the 0th item is the head of pat.

Now look at "matchLength :: Int; matchLength = let ... in matchLength' (B.drop
(fromIntegral patShift) pat) str 0"

That starts the "pat" from a shorter version that starts at 'patShift' and sets
the last argument to matchLength' to 0.

matchLength' is in the ... as:

let matchLength' pat str cnt | B.null pat = cnt
    matchLength' pat str cnt | B.null str = 0
    matchLength' pat str cnt | (B.head pat == B.head str) =
      matchLength' (B.drop 1 pat) (B.drop 1 str) (cnt + 1)
                             |   otherwise = cnt

I see that cnt is incremented until
  * all of pat is matched, return cnt
  * all of str is used, return 0
  * a mismatch is found, return cnt

So a mismatch means that the character at index (patShift+cnt) in the original
'pat' had the mismatch.

Now I see this that uses the 'cnt' returned from matchLength:

shiftAmt = {-# SCC "kmpMatch_shiftAmt" #-} lookup matchLength

And this is the bug.  lookup should be given (shiftAmt + matchLength).

This was not clear or obvious since
  * Nearly everything is Int or Int64
  * The name "pat" in matchLength' shadows "pat" in kmpMatch

This was not caught by quickcheck since it is an internal value and only rarely
will trigger an error where a legitimate match is missed.  If there is no match
then this bug does not cause an incorrect answer (I think).

I will fix this and continue hacking on it.

-- 
Chris