Announcing Very Fast Searching of ByteStrings
ChrisK
haskell at list.mightyreason.com
Fri Aug 17 09:49:28 EDT 2007
Attached is the Boyer-Moore algorithm implemented for strict and lazy
bytestrings (and combinations thereof). It finds all the overlapping instances
of the pattern inside the target.
I have performance tuned it. But the performance for searching a strict
bytestring is better then for a lazy bytestring (even if they only had a single
strict chunk), which almost certainly means I was not clever enough to get GHC
to produce the optimal code.
There is much more description in the module's haddock header.
Hopefully Don or other ByteString experts/maintainers can tweak this even further.
Also attaches is a Knuth-Morris-Pratt module which find non-overlapping
instances and is slightly slower on my benchmarks.
Happy Searching,
Chris Kuklewicz
-------------- next part --------------
{-# OPTIONS_GHC -fbang-patterns -O2 #-}
-- | BoyerMoore module for searching String and Lazy ByteStrings
--
-- Authors: Daniel Fischer <daniel.is.fischer `at` web.de>
-- Chris Kuklewicz <haskell `at` list.mightyreason.com>
--
-- License: BSD3
--
-- This module exports 4 search functions: 'bmMatchLL', 'bmMatchLS',
-- 'bmMatchSL', and 'bmMatchSS'.
--
-- The first parameter is always the pattern string. The second
-- parameter is always the target string to be searched. The returned
-- list is all the locations of overlapping patterns. A returned Int
-- or Int64 is an index into the target string which is aligned to the
-- head of the pattern string. Strict targets return Int indices and
-- lazy targets return Int64 indices. All returned lists are computed
-- and returned in a lazy fashion.
--
-- 'bmMatchLL' and 'bmMatchLS' take lazy bytestrings as patterns. For
-- performance, if the pattern is not a single strict chunk then all
-- the the pattern chunks will copied into a concatenated strict
-- bytestring. This limits the patterns to a length of (maxBound ::
-- Int).
--
-- 'bmMatchLL' and 'bmMatchSL' take lazy bytestrings are targets.
-- These are written so that while they work they will not retain a
-- reference to all the earlier parts of the the lazy bytestring.
-- This means the garbage collector would be able to keep only a small
-- amount of the target string and free the rest.
--
-- If given an empty pattern the search will always return an empty
-- list.
--
-- These can all be usefully curried. Given only a pattern the
-- curried version will compute the supporting lookup tables only
-- once, allowing for efficient re-use. Similarly, the curried
-- 'bmMatchLL' and 'bmMatchLS' will compute the concatenated pattern
-- only once.
--
-- Overflow warning: the current code uses Int to keep track of the
-- locations in the target string. If the length of the pattern plus
-- the length of any strict chunk of the target string is greater or
-- equal to (maxBound :: Int) then this will overflow causing an
-- error. I try to detect this and call 'error' before a segfault
-- occurs.
--
-- Performance: Operating on a strict target string is faster than a
-- lazy target string. It is unclear why the performance gap is as
-- large as it is (patches welcome). To slightly ameliorate this, if
-- the lazy string is a single chunk then a copy of the strict
-- algorithm is used.
--
-- Complexity: Preprocessing the pattern string is O(patternLength).
-- The search performance is O(targetLength / patternLength) in the
-- best case, allowing it to go faster than a Knuth-Morris-Pratt
-- algorithm. With a non-periodic pattern the worst case uses
-- (3*targetLength) comparisons. The periodic pattern worst case is
-- quadratic O(targetLength*patternLength) complexity. Improvements
-- (e.g. Turbo-Boyer-Moore) to catch and linearize worst case
-- performance slow down the loop significantly.
--
-- Descriptions of the algorithm can be found at
-- http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140
-- and
-- http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm
module BoyerMoore ( bmMatchLL
, bmMatchLS
, bmMatchSL
, bmMatchSS
) where
import qualified Data.ByteString as S (ByteString,null,length,concat,unpack)
import qualified Data.ByteString.Lazy as L (ByteString,toChunks)
import qualified Data.ByteString.Base as B (unsafeIndex)
import qualified Data.ByteString.Char8 as SC (pack) -- used for testing
import Data.Array.Base (unsafeAt,unsafeRead,unsafeWrite)
import Data.Array.ST (newArray,newArray_,runSTUArray)
import Data.Array.IArray (array,accumArray,assocs)
import Data.Array.Unboxed (UArray)
import Data.Word (Word8)
import Data.Int (Int64)
{-# INLINE bmMatchLL #-}
bmMatchLL :: L.ByteString -> L.ByteString -> [Int64]
bmMatchLL pat = let search = bmMatchSSsd (S.concat (L.toChunks pat))
in search . L.toChunks
{-# INLINE bmMatchLS #-}
bmMatchLS :: L.ByteString -> S.ByteString -> [Int]
bmMatchLS pat = bmMatchSSd (S.concat (L.toChunks pat))
{-# INLINE bmMatchSL #-}
bmMatchSL :: S.ByteString -> L.ByteString -> [Int64]
bmMatchSL pat = let search = bmMatchSSsd pat
in search . L.toChunks
{-# INLINE bmMatchSS #-}
bmMatchSS :: S.ByteString -> S.ByteString -> [Int]
bmMatchSS pat = bmMatchSSd pat
bmMatchSSd :: S.ByteString -> S.ByteString -> [Int]
bmMatchSSd pat | S.null pat = const []
| otherwise =
let !patLen = S.length pat
!patEnd = pred patLen
!maxStrLen = maxBound - patLen
!occT = occurs pat -- used to compute bad-character shift
!suffT = suffShifts pat -- used to compute good-suffix shift
!skip = unsafeAt suffT 0 -- used after each matching position is found
-- 0 < skip <= patLen
{-# INLINE patAt #-}
patAt :: Int -> Word8
patAt !i = B.unsafeIndex pat i
searcher str | maxStrLen <= S.length str = error "Overflow error in BoyerMoore.bmMatchSSd"
| otherwise =
let !strLen = S.length str
!maxDiff = strLen-patLen
{-# INLINE strAt #-}
strAt :: Int -> Word8
strAt !i = B.unsafeIndex str i
findMatch !diff !patI =
case strAt (diff+patI) of
c | c==patAt patI -> if patI == 0
then diff :
let diff' = diff + skip
in if maxDiff < diff'
then []
else findMatch diff' patEnd
else findMatch diff (pred patI)
| otherwise -> let {-# INLINE badShift #-}
badShift = patI - unsafeAt occT (fromIntegral c)
-- (-patEnd) < badShift <= patLen
{-# INLINE goodShift #-}
goodShift = unsafeAt suffT patI
-- 0 < goodShift <= patLen
diff' = diff + max badShift goodShift
in if maxDiff < diff'
then []
else findMatch diff' patEnd
in if maxDiff < 0
then []
else findMatch 0 patEnd
in searcher
-- release is used to keep the zipper in bmMatchSSs from remembering
-- the leading part of the searched string. The deep parameter is the
-- number of characters that the past needs to hold. This ensures
-- lazy streaming consumption of the searched string.
{-# INLINE release #-}
release :: Int -> [S.ByteString] -> [S.ByteString]
release !deep _ | deep <= 0 = []
release !deep (!x:xs) = let !rest = release (deep-S.length x) xs in x : rest
release _ [] = error "BoyerMoore 'release' could not find enough past of length deep!"
bmMatchSSsd :: S.ByteString -> [S.ByteString] -> [Int64]
bmMatchSSsd pat | S.null pat = const []
| otherwise =
let !patLen = S.length pat
!patEnd = pred patLen
!longestStr = maxBound - patLen
!occT = occurs pat -- used to compute bad-character shift
!suffT = suffShifts pat -- used to compute good-suffix shift
!skip = unsafeAt suffT 0 -- used after each matching position is found
-- 0 < skip <= patLen
{-# INLINE patAt #-}
patAt :: Int -> Word8
patAt !i = B.unsafeIndex pat i
searcher string =
let -- seek is used to position the "zipper" of
-- (past,str,future) to the correct S.ByteString to search
-- with matcher. This is done by ensuring 0 <= strPos <
-- strLen where (strPos == diffPos+patPos). Note that
-- future is not a strict parameter. The character being
-- compared will then be (strAt strPos) and (patAt
-- patPos). Splitting this into specialized versions
-- seems like going too, and is only useful if pat is
-- close to (or larger than) the chunk size.
seek :: Int64 -> [S.ByteString] -> S.ByteString -> [S.ByteString] -> Int -> Int -> [Int64]
seek !prior !past !str future !diffPos !patPos | (diffPos+patPos) < 0 = {-# SCC "seek/past" #-}
case past of
[] -> error "seek back too far!"
(h:t) -> let hLen = S.length h
in seek (prior - fromIntegral hLen) t h (str:future) (diffPos + hLen) patPos
| strLen <= (diffPos+patPos) = {-# SCC "seek/future" #-}
case future of
[] -> []
(h:t) -> let {-# INLINE prior' #-}
prior' = prior + fromIntegral strLen
!diffPos' = diffPos - strLen
{-# INLINE past' #-}
past' = release (-diffPos') (str:past)
in if maxStrLen <= S.length h
then error "Overflow in BoyerMoore.bmMatchSSsd"
else seek prior' past' h t diffPos' patPos
| otherwise = {-# SCC "seek/str" #-}
-- matcher is the tight loop that walks backwards from the end
-- of the pattern checking for matching characters. The upper
-- bound of strLen is checked only when strI is shifted
-- upwards to strI'. The lower bound must be checked.
let matcher !diff !patI =
case strAt (diff+patI) of
c | c==patAt patI ->
if patI == 0
then prior + fromIntegral (diff+patI) :
let !diff' = (diff+patI) + skip -- Assert : diff < diff'
in if maxDiff < diff'
then seek prior past str future diff' patEnd
else if diff' < 0
then matcher diff' patEnd
else matcherF diff' patEnd
else if (diff+patI) == 0 -- diff < 0 means need to check underflow
then seek prior past str future diff (pred patI)
else matcher diff (pred patI)
| otherwise ->
let {-# INLINE badShift #-}
badShift = patI - unsafeAt occT (fromIntegral c)
-- (-patEnd) < badShift <= patLen
{-# INLINE goodShift #-}
goodShift = unsafeAt suffT patI
-- 0 < goodShift <= patLen
-- Assert : diff < diff'
!diff' = diff + max badShift goodShift
in if maxDiff < diff'
then seek prior past str future diff' patEnd
else if diff' < 0
then matcher diff' patEnd
else matcherF diff' patEnd
-- mathcherF only needs to check overflow since 0<=diff
matcherF !diff !patI =
case strAt (diff+patI) of
c | c==patAt patI ->
if patI == 0
then prior + fromIntegral (diff+patI) :
let !diff' = (diff+patI) + skip -- Assert : diff < diff'
in if maxDiff < diff'
then seek prior past str future diff' patEnd
else matcherF diff' patEnd
else matcherF diff (pred patI) -- 0 <= diff means no need to check underflow
| otherwise ->
let {-# INLINE badShift #-}
badShift = patI - unsafeAt occT (fromIntegral c)
-- (-patEnd) < badShift <= patLen
{-# INLINE goodShift #-}
goodShift = unsafeAt suffT patI
-- 0 < goodShift <= patLen
-- Assert : diff < diff'
!diff' = diff + max badShift goodShift
in if maxDiff < diff'
then seek prior past str future diff' patEnd
else matcherF diff' patEnd
in if diffPos < 0
then matcher diffPos patPos
else matcherF diffPos patPos
where !strLen = S.length str
!maxDiff = strLen - patLen
!maxStrLen = pred ((maxBound::Int) - patLen)
{-# INLINE strAt #-}
strAt :: Int -> Word8
strAt !i = B.unsafeIndex str i
in case string of
[] -> []
[str] -> -- Steal the quick findMatch from bmMatchSSd for this case:
let findMatch !diff !patI =
case strAt (diff+patI) of
c | c==patAt patI -> if patI == 0
then fromIntegral diff :
let diff' = diff + skip
in if maxDiff < diff'
then []
else findMatch diff' patEnd
else findMatch diff (pred patI)
| otherwise -> let {-# INLINE badShift #-}
badShift = patI - unsafeAt occT (fromIntegral c)
-- (-patEnd) < badShift <= patLen
{-# INLINE goodShift #-}
goodShift = unsafeAt suffT patI
-- 0 < goodShift <= patLen
diff' = diff + max badShift goodShift
in if maxDiff < diff'
then []
else findMatch diff' patEnd
!strLen = S.length str
!maxDiff = strLen - patLen
!maxStrLen = ((maxBound::Int) - patLen)
{-# INLINE strAt #-}
strAt :: Int -> Word8
strAt !i = B.unsafeIndex str i
in if maxStrLen <= strLen
then error "Overflow in BoyerMoore.bmMatchSSsd"
else findMatch 0 patEnd
(str:future) -> if ((maxBound::Int) - patLen) <= S.length str
then error "Overflow in BoyerMoore.bmMatchSSsd"
else seek 0 [] str future 0 patEnd
in searcher
{- Format of bad character table generated by occurs:
Index is good for Word8 / ASCII searching only.
The last character (at the last index) in pat is ignored.
Excluding that last element, the value is largest index of occurances of that Word8 in the pat.
The default value for Word8's not in the pattern is (-1).
Range of values: -1 <= value < length of pattern
-}
{-# INLINE occurs #-}
occurs :: S.ByteString -> UArray Word8 Int
occurs !pat | patEnd < 0 = emptyOccurs
| otherwise = runSTUArray
(do ar <- newArray (minBound,maxBound) (-1)
let loop !i | i == patEnd = return ar
| otherwise = do unsafeWrite ar (fromEnum $ pat `B.unsafeIndex` i) i
loop (succ i)
loop 0)
where
!patEnd = pred (S.length pat)
emptyOccurs :: UArray Word8 Int
emptyOccurs = accumArray const (-1) (minBound,maxBound) []
{- Non ST variants of occurs
occurs' :: S.ByteString -> UArray Word8 Int
occurs' !pat = accumArray (flip const) (-1) (0,255)
[ (pat `B.unsafeIndex` i, i) | i <- [0..pred (S.length pat)] ]
occurs'' :: S.ByteString -> UArray Word8 Int
occurs'' !pat = accumArray (flip const) (-1) (minBound,maxBound) $ zip (init $ S.unpack pat) [0..]
-}
{-
suffLengths uses a ST array to allow for strict querying of previously
filled in values durring the fill loops.
Format for suffLengths array:
Valid index range is the same as for the pat.
The value at index k is used when there is a mismatch at index k in
pat after checking that all indices j where j > k correctly match.
For all indices consider the prefix of pat that ends with the
character at that index. Now the value of suffLength is the number of
character at the end of this prefix that are identical to the end of
pat.
By the above definition, the last index has the length of the pattern
as its value, since the whole pattern is compared to itself and the
overlap is always the whole pattern length. And the maximum value at
index k is (k+1).
This value itself is a non-negative integer less than the length of
pat except for the last index, where the value is the length of pat.
For most positions the value will be 0. Aside from the at the last
index the value can be non-zero only at indices where the last
character of the pat occurs earlier in pat.
-}
{-# INLINE suffLengths #-}
suffLengths :: S.ByteString -> UArray Int Int
suffLengths !pat | 0==patLen = array (0,-1) []
| otherwise = runSTUArray
(do ar <- newArray_ (0,patEnd)
unsafeWrite ar patEnd patLen
let {-# INLINE matchSuffix #-}
matchSuffix !idx !from = do
let !d = patEnd - idx
helper !i | i < 0 || (pat `B.unsafeIndex` i) /= (pat `B.unsafeIndex` (i+d)) = i
| otherwise = helper (pred i)
pre' = helper from
unsafeWrite ar idx (idx-pre')
idxLoop (pred idx) pre' start
idxLoop !idx !pre !end
| idx < 0 = return ar
| pre < idx = do matching <- unsafeRead ar end -- try and reuse old result
if pre + matching < idx -- check if old matching length is too long for current idx
then do unsafeWrite ar idx matching
idxLoop (pred idx) pre (pred end)
else matchSuffix idx pre
| otherwise = matchSuffix idx idx
idxLoop start start start) -- the third argument, the initial value of "end", is never used and does not matter.
where
!patLen = S.length pat
!patEnd = pred patLen
!start = pred patEnd
{- Format for suffShifts:
The valid index range is the same as for pat.
The index k is used when there is a mismatch at pat index k and all
indices j where j > k have matched.
The value is the smallest number of characters one can advance the
pattern such that there the shifted pattern agrees at the already
checked positions j>k.
Thus the value range is : 0 < value <= length of pattern
-}
{-# INLINE suffShifts #-}
suffShifts :: S.ByteString -> UArray Int Int
suffShifts !pat | patLen == 0 = array (0,-1) []
| otherwise = runSTUArray
(do ar <- newArray (0,patEnd) patLen
let preShift !idx !j -- idx counts down and j starts at 0 and is non-decreasing
| idx < 0 = return ()
| suff `unsafeAt` idx == idx+1 =
do let !shf = patEnd - idx
fill_to_shf !i | i==shf = return ()
| otherwise = do unsafeWrite ar i shf
fill_to_shf (succ i)
fill_to_shf j
preShift (pred idx) shf
| otherwise = preShift (pred idx) j
sufShift !idx
| idx == patEnd = return ar
| otherwise = do unsafeWrite ar (patEnd - (suff `unsafeAt` idx)) (patEnd - idx)
sufShift (succ idx)
preShift start 0
sufShift 0)
where
!patLen = S.length pat
!patEnd = pred patLen
!start = pred patEnd
!suff = suffLengths pat
{- TESTING SECTION for suffLengths suffShifts occurs -}
testPats =
[ "ANPANMAN"
, "A"
, "AA"
, "AAA"
, "AAAA"
, "AAABBB"
, "BBBAAABBB"
, "ABC"
, "AB"
, "ABCD"
, "ABCDABCD"
, "DCBAABCD"
, "GCAGAGAG"
, "AGAGAG"
, "GAGAG"
, "AGAG"
, "GAG"
, ""
]
rightLens =
[ [(0,0),(1,2),(2,0),(3,0),(4,2),(5,0),(6,0),(7,8)]
, [(0,1)]
, [(0,1),(1,2)]
, [(0,1),(1,2),(2,3)]
, [(0,1),(1,2),(2,3),(3,4)]
, [(0,0),(1,0),(2,0),(3,1),(4,2),(5,6)]
, [(0,1),(1,2),(2,3),(3,0),(4,0),(5,0),(6,1),(7,2),(8,9)]
, [(0,0),(1,0),(2,3)]
, [(0,0),(1,2)]
, [(0,0),(1,0),(2,0),(3,4)]
, [(0,0),(1,0),(2,0),(3,4),(4,0),(5,0),(6,0),(7,8)]
, [(0,1),(1,0),(2,0),(3,0),(4,0),(5,0),(6,0),(7,8)]
, [(0,1),(1,0),(2,0),(3,2),(4,0),(5,4),(6,0),(7,8)]
, [(0,0),(1,2),(2,0),(3,4),(4,0),(5,6)]
, [(0,1),(1,0),(2,3),(3,0),(4,5)]
, [(0,0),(1,2),(2,0),(3,4)]
, [(0,1),(1,0),(2,3)]
, []
]
rightSuffs =
[ [(0,6),(1,6),(2,6),(3,6),(4,6),(5,3),(6,8),(7,1)]
, [(0,1)]
, [(0,1),(1,2)]
, [(0,1),(1,2),(2,3)]
, [(0,1),(1,2),(2,3),(3,4)]
, [(0,6),(1,6),(2,6),(3,1),(4,2),(5,3)]
, [(0,6),(1,6),(2,6),(3,6),(4,6),(5,6),(6,1),(7,2),(8,3)]
, [(0,3),(1,3),(2,1)]
, [(0,2),(1,1)]
, [(0,4),(1,4),(2,4),(3,1)]
, [(0,4),(1,4),(2,4),(3,4),(4,8),(5,8),(6,8),(7,1)]
, [(0,7),(1,7),(2,7),(3,7),(4,7),(5,7),(6,7),(7,1)]
, [(0,7),(1,7),(2,7),(3,2),(4,7),(5,4),(6,7),(7,1)]
, [(0,2),(1,2),(2,4),(3,4),(4,6),(5,1)]
, [(0,2),(1,2),(2,4),(3,4),(4,1)]
, [(0,2),(1,2),(2,4),(3,1)]
, [(0,2),(1,2),(2,1)]
, []
]
prop_occurs :: String -> Bool
prop_occurs [] = occurs (SC.pack []) == accumArray (flip const) (-1) (minBound,maxBound) []
prop_occurs x = let s = SC.pack x
in occurs s == occurs' s
where occurs' :: S.ByteString -> UArray Word8 Int
occurs' !pat = accumArray (flip const) (-1) (minBound,maxBound) $ zip (init $ S.unpack pat) [0..]
testOccurs = all prop_occurs testPats
testLens = rightLens == map (assocs . suffLengths . SC.pack) testPats
testSuffs = rightSuffs == map (assocs . suffShifts . SC.pack) testPats
-------------- next part --------------
{-# OPTIONS_GHC -fbang-patterns -O2 #-}
{-|
Code by Justin Bailey (jgbailey at gmail.com) and
Chris Kuklewicz (haskell at list.mightyreason.com).
License: BSD3
Execute 'allTests' to run quickcheck and regression tests.
This finds non-overlapping patterns with a Knuth-Morris-Pratt
algorithm.
-}
module KMPSeq ( kmpMatchLL
, kmpMatchLS
, kmpMatchSS
, kmpMatchSL) where
import qualified Data.Array.Base as Base (unsafeAt)
import qualified Data.Array.Unboxed as Unboxed (UArray)
import qualified Data.Array.IArray as IArray (array)
import Data.List as List (map, filter, length, null, take, maximum, foldr, all, drop)
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString as S
import qualified Data.ByteString.Base as B (unsafeHead,unsafeTail,unsafeDrop,unsafeIndex)
import GHC.Int (Int64)
import Test.QuickCheck
import Debug.Trace (trace)
{-|
Returns list of indices for a given substring in a search string, or the empty
list if none were found.
Uses the Knuth-Morris-Pratt fast string matching algorithm.
-}
{-# INLINE kmpMatchLL #-}
kmpMatchLL :: L.ByteString -- ^ Pattern to search for.
-> L.ByteString -- ^ String to search.
-> [Int64] -- ^ List of indices where string was found.
kmpMatchLL pat = let search = kmpMatchSSs' (S.concat (L.toChunks pat)) in search . L.toChunks
{-|
Returns list of indices for a given substring in a search string, or the empty
list if none were found.
Uses the Knuth-Morris-Pratt fast string matching algorithm.
-}
{-# INLINE kmpMatchLS #-}
kmpMatchLS :: L.ByteString -- ^ Pattern to search for.
-> S.ByteString -- ^ String to search.
-> [Int64] -- ^ List of indices where string was found.
kmpMatchLS pat = let search = kmpMatchSSs' (S.concat (L.toChunks pat)) in search . (:[])
{-|
Returns list of indices for a given substring in a search string, or the empty
list if none were found.
Uses the Knuth-Morris-Pratt fast string matching algorithm.
-}
{-# INLINE kmpMatchSS #-}
kmpMatchSS :: S.ByteString -- ^ Pattern to search for.
-> S.ByteString -- ^ String to search.
-> [Int64] -- ^ List of indices where string was found.
kmpMatchSS pat = let search = kmpMatchSSs' pat in search . (:[])
{-|
Returns list of indices for a given substring in a search string, or the empty
list if none were found.
Uses the Knuth-Morris-Pratt fast string matching algorithm.
-}
{-# INLINE kmpMatchSL #-}
kmpMatchSL :: S.ByteString -- ^ Pattern to search for.
-> L.ByteString -- ^ String to search.
-> [Int64] -- ^ List of indices where string was found.
kmpMatchSL pat = let search = kmpMatchSSs' pat in search . L.toChunks
kmpMatchSSs' :: S.ByteString -> [S.ByteString] -> [Int64]
kmpMatchSSs' pat | S.null pat = const []
| otherwise =
let !patLen = S.length pat -- Evaluate S.length once;
!lookupTable = computeLookup pat -- lower bound of UArray must be 0 for Base.unsafeAt, but index 0 will never be looked up
searcher :: Int64 -> Int -> [S.ByteString] -> [Int64]
searcher _ _ [] = []
searcher !prior !patStart (!str:strRest) =
let !strLen = S.length str -- Evaluate S.length once;
findMatch :: Int -> Int -> [Int64]
findMatch !strIndex !patIndex | patIndex == patLen = (prior + fromIntegral strIndex - fromIntegral patLen) : findMatch strIndex 0
| strIndex == strLen = searcher (prior + fromIntegral strLen) patIndex strRest
| otherwise =
if (B.unsafeIndex str strIndex) == (B.unsafeIndex pat patIndex)
then findMatch (succ strIndex) (succ patIndex)
else if patIndex == 0
then findMatch (succ strIndex) 0
else findMatch strIndex (Base.unsafeAt lookupTable patIndex) -- here 1 <= patIndex <= patLen-1
in
findMatch 0 patStart
in searcher 0 0
{-|
Given our pattern, get all the prefixes of the pattern. For each of those
prefixes, find the longest prefix from the original pattern that is also a
suffix of the prefix segment being considered, and is not equal to it. The
argument given to overlap is the length of the prefix matched so far, and the
length of the longest prefix, which is a suffix and is not equal to it, is the
value overlap returns.
If a given prefix has no possible overlap, it is mapped to -1.
-}
overlap :: S.ByteString -> [(Int, Int)]
overlap pat =
let patternLength = S.length pat
-- Given an index into the pattern (representing a substring), find the longest prefix of
-- the pattern which is a suffix of the substring given, without being
-- equal to it.
--
-- patIdx represents the index of the last character in the prefix, not the
-- character after it. Therefore, compare the pattern starting at the first
-- character of the prefix, not the zeroth.
longestSuffix !patIdx =
let longestSuffix' !shiftPrefix !prefixIdx
| shiftPrefix == patIdx = 0 -- No match
| shiftPrefix + prefixIdx == patIdx = prefixIdx -- Suffix found.
-- Compare pattern to itself, but shifted, here.
| B.unsafeIndex pat (shiftPrefix + prefixIdx) == B.unsafeIndex pat prefixIdx = longestSuffix' shiftPrefix (prefixIdx + 1)
| otherwise = longestSuffix' (shiftPrefix + 1) 0
in
longestSuffix' 1 0
in
(0, 0) : [(matchLen, longestSuffix matchLen) | matchLen <- [1 .. patternLength - 1]]
-- List.map (\prefix -> (fromIntegral $ S.length prefix, fromIntegral $ longestPreSuffix prefix)) prefixes
{-|
Given a string representing a search pattern, this function
returns a function which represents, for each prefix of that
pattern, the maximally long prefix of the pattern which is a suffix
of the indicated pattern segment.
If there is no such prefix, 0 is returned.
-}
computeLookup :: S.ByteString -> Unboxed.UArray Int Int
computeLookup pat =
let patLen = fromIntegral $ S.length pat
table :: Unboxed.UArray Int Int
table = {-# SCC "computeLookup_table" #-} IArray.array (0, patLen - 1) (overlap pat)
in table
-- Types, instances and utility functions for testing purposes.
newtype PatternChar = PatternChar Char
deriving Show
instance Arbitrary PatternChar where
arbitrary = oneof (List.map (return . PatternChar) ['a', 'b', 'c', 'd'])
coarbitrary = undefined
-- Holds the search pattern, the search string, and the expected
-- position. Used to test for strings that have failed in the past.
data Regressions = R String String Int64
deriving Show
instance Arbitrary Regressions where
arbitrary =
oneof $ map return
[R "ccb" "abcdcccb" 5,
R "bbaa" "bdcdbbbbaa" 6,
R "cccadadc" "adaccccadadc" 4,
R "bbdbbdaabdb" "dbbdbbbdbbdaabdb" 5,
R "bbdbbd" "dbbdbbbdbbd" 5,
R "ccbb" "acccbb" 2,
R "bbbb" "dddcaaaddaaabdacbcccabbada" (-1), -- This string has caused an infinite loop.
R "bbc" "bbbbc" 2,
R "" "" (-1)]
coarbitrary = undefined
patternsToString :: [PatternChar] -> L.ByteString
patternsToString chars = L.pack $ List.foldr (\(PatternChar char) str -> (toEnum $ fromEnum char) : str) [] chars
patternsToStrictString :: [PatternChar] -> S.ByteString
patternsToStrictString chars = S.pack $ List.foldr (\(PatternChar char) str -> (toEnum $ fromEnum char) : str) [] chars
regressionsToLazy :: Regressions -> (L.ByteString, L.ByteString, Int64)
regressionsToLazy (R pat srch idx) = (toLazyBS pat, toLazyBS srch, idx)
toLazyBS :: String -> L.ByteString
toLazyBS = L.pack . List.map (toEnum . fromEnum)
toStrictBS :: String -> S.ByteString
toStrictBS = S.pack . List.map (toEnum . fromEnum)
-- Test that 0 and 1 element always return 0, if present.
prop_testZero :: [PatternChar] -> Property
prop_testZero pat =
let table = computeLookup (patternsToStrictString pat)
in
not (List.null pat) ==>
if List.length pat > 1
then Base.unsafeAt table 0 == 0 && Base.unsafeAt table 1 == 0
else Base.unsafeAt table 0 == 0
-- Test that all overlaps found are actually prefixes of the
-- pattern string
prop_testSubset :: [PatternChar] -> Property
prop_testSubset pat =
let patStr = patternsToString pat
table = computeLookup (patternsToStrictString pat)
prefix len = L.take (fromIntegral len)
testPrefix len =
if Base.unsafeAt table len == 0
then True
else (prefix (Base.unsafeAt table len) patStr) `L.isPrefixOf` (prefix len patStr)
in
not (List.null pat) ==>
trivial (L.null patStr) $
List.all testPrefix [0 .. List.length pat - 1]
-- Test that the prefix given is the maximal prefix. That is,
-- add one more character makes it either equal to the string
-- or not a prefix.
prop_testCorrectPrefix :: [PatternChar] -> Property
prop_testCorrectPrefix pat =
let patStr = patternsToString pat
table = computeLookup (patternsToStrictString pat)
isNeverSuffix len =
let origPrefix = prefix len patStr
-- Drop 1 to remove empty list
allPrefixes = List.drop 1 $ L.inits origPrefix
in
List.all (\p -> L.null p || p == origPrefix || not ((L.reverse p) `L.isPrefixOf` (L.reverse origPrefix))) allPrefixes
prefix len = L.take (fromIntegral len)
-- True if the prefix returned from table for the length given produces
-- a string which is a suffix of the original prefix.
isRealSuffix len = (L.reverse (prefix (Base.unsafeAt table len) patStr)) `L.isPrefixOf` (L.reverse $ prefix len patStr)
isLongestSuffix len =
let prefixPlus = prefix (Base.unsafeAt table len + 1) patStr
inputPrefix = prefix len patStr
in
prefixPlus == inputPrefix ||
not ((L.reverse prefixPlus) `L.isPrefixOf` (L.reverse inputPrefix))
testTable len =
if Base.unsafeAt table len == 0
then isNeverSuffix len
else isRealSuffix len &&
isLongestSuffix len
in
not (List.null pat) ==>
List.all testTable [0 .. List.length pat - 1]
-- Verify that, if a match is found, it is where it's supposed to be in
-- the string and it can be independently found by other means.
prop_testMatch :: [PatternChar] -> [PatternChar] -> Property
prop_testMatch pat search =
let patStr = patternsToString pat
searchStr = patternsToString search
strictStr = patternsToStrictString search
strictPat = patternsToStrictString pat
patLen = L.length patStr
searchLen = L.length searchStr
matches = kmpMatchLL patStr searchStr
verify matchIdx =
if matchIdx > -1
then (L.take patLen $ L.drop matchIdx $ searchStr) == patStr && strictPat `S.isSubstringOf` strictStr
else (L.null patStr && L.null searchStr) || not (strictPat `S.isSubstringOf` strictStr)
in
not (List.null pat) ==>
trivial (L.null searchStr) $
classify (patLen > searchLen) "Bigger pattern than search" $
classify (patLen < searchLen) "Smaller pattern than search" $
classify (patLen == searchLen) "Equal pattern and search" $
classify (not (null matches)) "Match Exists" $
classify (null matches) "Match Doesn't Exist" $
all verify matches
-- Test a pattern that was known to fail.
prop_testBadPat :: [PatternChar] -> Property
prop_testBadPat search =
let patStr = toLazyBS "bbc"
patLen = L.length patStr
searchStr = patternsToString search
matches = kmpMatchLL patStr searchStr
strictStr = patternsToStrictString search
strictPat = toStrictBS "bbc"
verify matchIdx =
if matchIdx > -1
then (L.take patLen $ L.drop matchIdx $ searchStr) == patStr && strictPat `S.isSubstringOf` strictStr
else (L.null patStr && L.null searchStr) || not (strictPat `S.isSubstringOf` strictStr)
in
trivial (List.null search) $
all verify matches
-- Test that a pattern on the end of the string is found OK.
prop_testEndPattern :: [PatternChar] -> [PatternChar] -> Property
prop_testEndPattern pat search =
let patStr = patternsToString pat
searchStr = patternsToString (search ++ pat)
matches = kmpMatchLL patStr searchStr
strictStr = patternsToStrictString (search ++ pat)
strictPat = patternsToStrictString pat
patLen = L.length patStr
verify matchIdx = (L.take patLen $ L.drop matchIdx $ searchStr) == patStr && strictPat `S.isSubstringOf` strictStr
in
not (List.null pat) ==> all verify matches
prop_testRegressions :: Regressions -> Bool
prop_testRegressions r =
let (pat, srch, idx) = regressionsToLazy r
matches = kmpMatchLL pat srch
in
if null matches
then idx == -1
else (head matches) == idx
props1 = [ prop_testZero
, prop_testSubset
, prop_testCorrectPrefix
, prop_testBadPat
]
props2 = [ prop_testMatch
, prop_testEndPattern
]
allTests = do
mapM_ quickCheck props1
mapM_ quickCheck props2
quickCheck prop_testRegressions
More information about the Libraries
mailing list