[Haskell-cafe] Substring replacements
Daniel Fischer
daniel.is.fischer at web.de
Thu Dec 15 15:07:11 EST 2005
Am Donnerstag, 15. Dezember 2005 02:39 schrieben Sie:
> From: "Branimir Maksimovic" <bmaxa at hotmail.com>
>
> >To: daniel.is.fischer at web.de
> >CC: Haskell-Cafe at haskell.org
> >Subject: Re: [Haskell-cafe] Substring replacements
> >Date: Thu, 15 Dec 2005 00:55:02 +0000
> >
> >>From: Daniel Fischer <daniel.is.fischer at web.de>
> >>To: "Branimir Maksimovic" <bmaxa at hotmail.com>
> >>CC: Haskell-Cafe at haskell.org
> >>Subject: Re: [Haskell-cafe] Substring replacements
> >>Date: Wed, 14 Dec 2005 20:40:06 +0100
> >>
> >>Hi Bane,
> >>
> >>nice algorithm. Since comparing chars _is_ cheap, it is to be expected
> >>that
> >>all the hash-rotating is far more costly for short search patterns. The
> >>longer the pattern, the better this gets, I think -- though nowhere near
> >>KMP
> >>(or would it?).
> >
> >Yes,KMP is superior in single pattern search. We didn't tried Boyer-Moore
> >algorithm yet, though. But I think it would be
> >difficult to implement it in Haskell efficiently as it searches backwards
> >and jumps around, and we want memory savings.
> >Though, I even didn't tried yet, but it is certainly very interesting.
>
> Forget what I've said.
> Boyer-Moore *can* be implemented efficiently, it is similar to KMP it goes
> forward, but when it finds last character in pattern, than starts to search
> backwards.
> This can be implemented easilly as Haskell lists naturaly reverse order
> when putting from one list to other.
> Heh, never say never :)
> As I see from documents Boyer-Moore has best performance on average
> and should be better than KMP.
>
> Greetings,Bane.
>
Well, I also thought that all the jumping around in Boyer-Moore wasn't too
good (after each shift we must bite off a chunk from the remaining input,
pushing that onto the stack, which costs something). But I gave it a try
today and here's what I came up with:
import Data.List (tails)
import Data.Map (Map)
import qualified Data.Map as Map
import Data.Array.Unboxed
searchRep :: String -> String -> String -> String
searchRep src rp str = run (reverse $ take len1 str) $ drop len1 str
where
len = length src
len1 = len-1
pat :: UArray Int Char
pat = listArray (0,len1) src
ch = pat!len1
badChar :: Map Char Int
badChar = Map.fromList $ zip src [0 .. ]
getBc c = case Map.lookup c badChar of
Just n -> n
Nothing -> -1
suffs :: UArray Int Int
suffs = listArray (0,len1) $! init $! map (pr 0 crs) $! tails crs
where
crs = reverse src
pr n (x:xs) (y:ys) | x == y = pr (n+1) xs ys
pr n _ _ = n
bmGs0 :: UArray Int Int
bmGs0 = array (0,len1) [(j,k) | (k,k') <- zip (tail $! help) help, j <-
[k' .. k-1]]
help = [k | k <- [0 .. len], k == len || suffs!k == len-k]
bmGs :: UArray Int Int
bmGs = bmGs0 // [(len1-suffs!k,k) | k <- [len1,len-2 .. 1]]
run by "" = reverse by
run by (c:cs)
| c == ch = process (c:by) cs
| otherwise = run (c:by) cs
roll n xs ys | n <= 0 = (xs, ys)
roll n xs (y:ys) = roll (n-1) (y:xs) ys
roll _ xs "" = (xs, "")
walk n "" = (n,"")
walk n st@(c:cs)
| n < 0 = (n,st)
| c == pat!n = walk (n-1) cs
| otherwise = (n,st)
process con left
| i < 0 = reverse pass ++ rp ++ run "" left
| otherwise = {- bye ++ -} run ncon nleft
where
(i,pass) = walk len1 con
d = if null pass then i+1 else max (bmGs!i) (i - getBc (head pass))
-- bye = reverse $! drop (len-d) con
(ncon,nleft) = roll (d-1) {- (take (len-d) con) -} con left
it's not as fast as KMP for the tests, but not too bad.
Commenting out 'bye' gives a bit of extra speed, but if it's _long_ before a
match (if any), we'd be better off relieving our memory with 'bye', I think.
Any improvements are welcome, certainly some of you can do much better.
Cheers,
Daniel
P.S. As an algorithm, I prefer KMP, it's quite elegant whereas BM is somewhat
fussy.
More information about the Haskell-Cafe
mailing list