[Haskell-cafe] Substring replacements

Tue Dec 13 05:23:29 EST 2005

Am Montag, 12. Dezember 2005 16:28 schrieben Sie:
> From: Daniel Fischer <daniel.is.fischer at web.de>
>
> >To: "Branimir Maksimovic" <bmaxa at hotmail.com>
> >CC: Haskell-Cafe at haskell.org
> >Subject: Re: [Haskell-cafe] Substring replacements
> >Date: Mon, 12 Dec 2005 16:15:46 +0100
> >
> >Earlier today:
> > > Sorry, but
> > > Prelude SearchRep> searchReplace "abaaba" "##" "abababaaba"
> > > "abababaaba"
> > >
> > > I haven't analyzed the algorithm, so I don't know why exactly this
> >
> >fails.
> >
> > > I'll take a look sometime soon.
> >
> >I found the problem (one at least).
> >Say the pattern to be replaced begins with 'a' and we have a sufficiently
> >long
> >match with the pattern starting at the first 'a' in the String. Upon
> >encountering the second 'a', while the first pattern still matches, you
> >start
> >pushing onto the rollback-stack. But that isn't inspected anymore, so if
> >the
> >actual occurence of the pattern starts at the third (or fourth, n-th)
> >occurence of 'a' and that is already pushed onto the rollback, you miss
> > it.
>
> I've corrected this with adjusting rollback position. if rollBack is null
> then
> search for rollback starts at second character if not starts at same as
> searhed
> character because I skip what was searched. That's all.
> Though I'm not so sure now when I read this.
>
Still not working:

*New> searchReplace "abababc" "#" "ababababababc"
"ababababababc"
*New> searchReplace1 "abababc" "#" "ababababababc"
"ababababababc"

> >So the question is, can we find a cheap test to decide whether to use KMP
> >or
> >Bulat's version?
>
> In real world situation your KMP will always be fastest on average.
> I like that we are not using C arrays as then we have advantage
> of lazyness and save on memory usage. C++ program will be faster
> on shorter strings but on this large strings will loose due memory
> latency. and with your test, both programs are very fast.
>
> Greetings, Bane.
>

On my 256MB RAM AMD Duron 1200 MHz, Bulat's version is consistently about 20% 
faster than my KMP on your test -- btw, I unboxed the pat array, which gave a 
bit of extra speed, but not much. 
And apologies to Sebastian Sylvan, I also included an unboxed version of bord, 
built from the boxed version, and that sped things further up -- not much, 
again, but there it is.
I wonder about this difference, -10% on one system and +20% on another system, 
ist that normal?

Cheers, Daniel
------------------------------------------------------------------------------------------
Up-To-Date version of KMP:

import Data.Array.Unboxed (UArray, listArray, (!))
import qualified Data.Array as A (array, (!), elems)

searchReplace :: String -> String -> String -> String
searchReplace "" _ str = str
searchReplace src@(c:cs) dst str = process 0 str ""
    where
      len = {-# scc "len" #-} length src
      pat :: UArray Int Char
      pat = {-# scc "pat" #-} listArray (0,len-1) src
      bord ={-# scc "bord" #-} A.array (0,len) $ (0,-1):(1,0):
                 [(i+1,getBord (pat!i) i + 1) | i <- [1 .. len-1]]
      getBord s n
         | m < 0      = m
	 | s == pat!m = m
	 | otherwise  = getBord s m
	   where
	     m = bord A.! n
      bor :: UArray Int Int
      bor = listArray (0,len) $ A.elems bord
      getBor s n
         | m < 0 || s == pat!m = m
	 | otherwise = getBor s m
	   where
	     m = bor!n
      process n str _ | n >= len = {-# scc "process" #-} dst ++ process 0 str 
""
      process _ "" mat = {-# scc "process" #-} reverse mat
      process 0 (s:st) _
         | s == c    = {-# scc "process" #-} process 1 st [s]
	 | otherwise = {-# scc "process" #-} s:process 0 st ""
      process n str@(s:st) mat
         | s == pat!n = {-# scc "process" #-} process (n+1) st (s:mat)
	 | otherwise  = {-# scc "process" #-}
	                let j = getBor s n
	                    (ret,skip) = splitAt j mat
			in if j < 0 then reverse mat ++ process 0 str ""
			   else reverse skip ++ process (j+1) st (s:ret)