[Haskell-cafe] Substring replacements

Thu Dec 15 15:07:11 EST 2005

Am Donnerstag, 15. Dezember 2005 02:39 schrieben Sie:
> From: "Branimir Maksimovic" <bmaxa at hotmail.com>
>
> >To: daniel.is.fischer at web.de
> >CC: Haskell-Cafe at haskell.org
> >Subject: Re: [Haskell-cafe] Substring replacements
> >Date: Thu, 15 Dec 2005 00:55:02 +0000
> >
> >>From: Daniel Fischer <daniel.is.fischer at web.de>
> >>To: "Branimir Maksimovic" <bmaxa at hotmail.com>
> >>CC: Haskell-Cafe at haskell.org
> >>Subject: Re: [Haskell-cafe] Substring replacements
> >>Date: Wed, 14 Dec 2005 20:40:06 +0100
> >>
> >>Hi Bane,
> >>
> >>nice algorithm. Since comparing chars _is_ cheap, it is to be expected
> >>that
> >>all the hash-rotating is far more costly for short search patterns. The
> >>longer the pattern, the better this gets, I think -- though nowhere near
> >>KMP
> >>(or would it?).
> >
> >Yes,KMP is superior in single pattern search. We didn't tried Boyer-Moore
> >algorithm yet, though. But I think it would be
> >difficult to implement it in Haskell efficiently as it searches backwards
> >and jumps around, and we want memory savings.
> >Though, I even didn't tried yet, but it is certainly very interesting.
>
> Forget what I've said.
> Boyer-Moore *can* be implemented efficiently, it is similar to KMP it goes
> forward, but when it finds last character in pattern, than starts to search
> backwards.
> This can be implemented easilly as Haskell lists naturaly reverse order
> when putting from one list to other.
> Heh, never say never :)
> As I see from documents Boyer-Moore has best performance on average
> and should be better than KMP.
>
> Greetings,Bane.
>
Well, I also thought that all the jumping around in Boyer-Moore wasn't too 
good (after each shift we must bite off a chunk from the remaining input, 
pushing that onto the stack, which costs something). But I gave it a try 
today and here's what I came up with:

import Data.List (tails)
import Data.Map (Map)
import qualified Data.Map as Map
import Data.Array.Unboxed

searchRep :: String -> String -> String -> String
searchRep src rp str = run (reverse $ take len1 str) $ drop len1 str
    where
      len = length src
      len1 = len-1
      pat :: UArray Int Char
      pat = listArray (0,len1) src
      ch = pat!len1
      badChar :: Map Char Int
      badChar = Map.fromList $ zip src [0 .. ]
      getBc c = case Map.lookup c badChar of
                   Just n  -> n
		   Nothing -> -1
      suffs :: UArray Int Int
      suffs = listArray (0,len1) $! init $! map (pr 0 crs) $! tails crs
              where
	        crs = reverse src
		pr n (x:xs) (y:ys) | x == y = pr (n+1) xs ys
		pr n _ _ = n
      bmGs0 :: UArray Int Int
      bmGs0 = array (0,len1) [(j,k) | (k,k') <- zip (tail $! help) help, j <- 
[k' .. k-1]]
      help = [k | k <- [0 .. len], k == len || suffs!k == len-k]
      bmGs :: UArray Int Int
      bmGs = bmGs0 // [(len1-suffs!k,k) | k <- [len1,len-2 .. 1]]
      run by "" = reverse by
      run by (c:cs)
        | c == ch   = process (c:by) cs
	| otherwise = run (c:by) cs
      roll n xs ys | n <= 0 = (xs, ys)
      roll n xs (y:ys) = roll (n-1) (y:xs) ys
      roll _ xs "" = (xs, "")
      walk n "" = (n,"")
      walk n st@(c:cs)
        | n < 0      = (n,st)
	| c == pat!n = walk (n-1) cs
	| otherwise  = (n,st)
      process con left
        | i < 0     = reverse pass ++ rp ++ run "" left
	| otherwise = {- bye ++ -} run ncon nleft
	  where
	     (i,pass) = walk len1 con
	     d = if null pass then i+1 else max (bmGs!i) (i - getBc (head pass))
	     -- bye = reverse $! drop (len-d) con
	     (ncon,nleft) = roll (d-1) {- (take (len-d) con) -} con left

it's not as fast as KMP for the tests, but not too bad.
Commenting out 'bye' gives a bit of extra speed, but if it's _long_ before a 
match (if any), we'd be better off relieving our memory with 'bye', I think.

Any improvements are welcome, certainly some of you can do much better.

Cheers,
Daniel

P.S. As an algorithm, I prefer KMP, it's quite elegant whereas BM is somewhat 
fussy.