[Haskell-cafe] Substring replacements

Thu Dec 15 21:36:47 EST 2005

>From: Daniel Fischer <daniel.is.fischer at web.de>
>To: "Branimir Maksimovic" <bmaxa at hotmail.com>
>CC: Haskell-Cafe at haskell.org
>Subject: Re: [Haskell-cafe] Substring replacements
>Date: Thu, 15 Dec 2005 21:07:11 +0100
>
>Am Donnerstag, 15. Dezember 2005 02:39 schrieben Sie:
> > From: "Branimir Maksimovic" <bmaxa at hotmail.com>
> >
> > >To: daniel.is.fischer at web.de
> > >CC: Haskell-Cafe at haskell.org
> > >Subject: Re: [Haskell-cafe] Substring replacements
> > >Date: Thu, 15 Dec 2005 00:55:02 +0000
> > >
> > >>From: Daniel Fischer <daniel.is.fischer at web.de>
> > >>To: "Branimir Maksimovic" <bmaxa at hotmail.com>
> > >>CC: Haskell-Cafe at haskell.org
> > >>Subject: Re: [Haskell-cafe] Substring replacements
> > >>Date: Wed, 14 Dec 2005 20:40:06 +0100
> > >>
> > >>Hi Bane,
> > >>
> > >>nice algorithm. Since comparing chars _is_ cheap, it is to be expected
> > >>that
> > >>all the hash-rotating is far more costly for short search patterns. 
>The
> > >>longer the pattern, the better this gets, I think -- though nowhere 
>near
> > >>KMP
> > >>(or would it?).
> > >
> > >Yes,KMP is superior in single pattern search. We didn't tried 
>Boyer-Moore
> > >algorithm yet, though. But I think it would be
> > >difficult to implement it in Haskell efficiently as it searches 
>backwards
> > >and jumps around, and we want memory savings.
> > >Though, I even didn't tried yet, but it is certainly very interesting.
> >
> > Forget what I've said.
> > Boyer-Moore *can* be implemented efficiently, it is similar to KMP it 
>goes
> > forward, but when it finds last character in pattern, than starts to 
>search
> > backwards.
> > This can be implemented easilly as Haskell lists naturaly reverse order
> > when putting from one list to other.
> > Heh, never say never :)
> > As I see from documents Boyer-Moore has best performance on average
> > and should be better than KMP.
> >
> > Greetings,Bane.
> >
>Well, I also thought that all the jumping around in Boyer-Moore wasn't too
>good (after each shift we must bite off a chunk from the remaining input,
>pushing that onto the stack, which costs something). But I gave it a try
>today and here's what I came up with:
>
>import Data.List (tails)
>import Data.Map (Map)
>import qualified Data.Map as Map
>import Data.Array.Unboxed
>
>searchRep :: String -> String -> String -> String
>searchRep src rp str = run (reverse $ take len1 str) $ drop len1 str
>     where
>       len = length src
>       len1 = len-1
>       pat :: UArray Int Char
>       pat = listArray (0,len1) src
>       ch = pat!len1
>       badChar :: Map Char Int
>       badChar = Map.fromList $ zip src [0 .. ]
>       getBc c = case Map.lookup c badChar of
>                    Just n  -> n
>		   Nothing -> -1
>       suffs :: UArray Int Int
>       suffs = listArray (0,len1) $! init $! map (pr 0 crs) $! tails crs
>               where
>	        crs = reverse src
>		pr n (x:xs) (y:ys) | x == y = pr (n+1) xs ys
>		pr n _ _ = n
>       bmGs0 :: UArray Int Int
>       bmGs0 = array (0,len1) [(j,k) | (k,k') <- zip (tail $! help) help, j 
><-
>[k' .. k-1]]
>       help = [k | k <- [0 .. len], k == len || suffs!k == len-k]
>       bmGs :: UArray Int Int
>       bmGs = bmGs0 // [(len1-suffs!k,k) | k <- [len1,len-2 .. 1]]
>       run by "" = reverse by
>       run by (c:cs)
>         | c == ch   = process (c:by) cs
>	| otherwise = run (c:by) cs
>       roll n xs ys | n <= 0 = (xs, ys)
>       roll n xs (y:ys) = roll (n-1) (y:xs) ys
>       roll _ xs "" = (xs, "")
>       walk n "" = (n,"")
>       walk n st@(c:cs)
>         | n < 0      = (n,st)
>	| c == pat!n = walk (n-1) cs
>	| otherwise  = (n,st)
>       process con left
>         | i < 0     = reverse pass ++ rp ++ run "" left
>	| otherwise = {- bye ++ -} run ncon nleft
>	  where
>	     (i,pass) = walk len1 con
>	     d = if null pass then i+1 else max (bmGs!i) (i - getBc (head pass))
>	     -- bye = reverse $! drop (len-d) con
>	     (ncon,nleft) = roll (d-1) {- (take (len-d) con) -} con left
>
>it's not as fast as KMP for the tests, but not too bad.
>Commenting out 'bye' gives a bit of extra speed, but if it's _long_ before 
>a
>match (if any), we'd be better off relieving our memory with 'bye', I 
>think.
>
>Any improvements are welcome, certainly some of you can do much better.

It is fast on my machine except that you are using Map to lookup
for badChar which is O(log n).
I;ve placed this instead:
      badChar :: UArray Int Int
      badChar  = array (0,255) ([(i,-1) | i <- [0..255]] ++ proc src 0)
      proc [] _ = []
      proc (s:st) i = (ord s,i):proc  st (i+1)
      getBc c = badChar ! ord c

which gaved it significant boost, O(1) lookup.
Now it's faster then brute force method but 10% slower then KMP
with my test.
I've also performed tests on dual Xeon linux box and results are 
proportionally
the same as on my intel windows box.
KMP wins again 10% better then BM and 20-30% better then straightforward 
search,
which means that KMP is well suited for non indexed strings.

>
>Cheers,
>Daniel
>
>P.S. As an algorithm, I prefer KMP, it's quite elegant whereas BM is 
>somewhat
>fussy.

Yes, BM is for indexed structures.

Greetings, Bane.

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/