[Haskell-cafe] Substring replacements

Daniel Fischer daniel.is.fischer at web.de
Wed Dec 14 14:40:06 EST 2005


Hi Bane,

nice algorithm. Since comparing chars _is_ cheap, it is to be expected that 
all the hash-rotating is far more costly for short search patterns. The 
longer the pattern, the better this gets, I think -- though nowhere near KMP 
(or would it?). However, I don't see how to (efficiently) do a multiple 
pattern search with KMP, so there -- if all patterns have the same length, 
otherwise I don't see -- Rabin-Karp would probably be the method of choice.

Am Mittwoch, 14. Dezember 2005 10:16 schrieben Sie:
> From: Daniel Fischer <daniel.is.fischer at web.de>
>
> >To: "Branimir Maksimovic" <bmaxa at hotmail.com>
> >CC: Haskell-Cafe at haskell.org
> >Subject: Re: [Haskell-cafe] Substring replacements
> >Date: Tue, 13 Dec 2005 11:23:29 +0100
>
> After seeing that your program is fastest (I've also tried one from
> http://haskell.org/hawiki/RunTimeCompilation but perhaps I'm not
> that good in converting to search replace?) I've decided to
> try with Rabin-Karp algorithm.
> This algorithm performs same operation as straightforward search,
> but compares hashes instead of chars.
> With ability to rotate hash (remove first, add next) characters
> there is also optimisation, that hash is calculated only for single
> next character rather again for whole substring.
> Unfortunatelly on my machine it is very cheap to compare
> characters so with my test hashing overweights character compare,
> except in your test when hash searching is faster then straightforward
> search.
>
> This is best I can write in terms of performance and readability.
> I've tried with getFst that returns Maybe but it was slower so I decided
> to return '\0' in case that argument is empty list, which renders '\0'
> unusable, but then I really doubt that 0 will be used in strings.
>
> -- Rabin-Karp string search algorithm, it is very effective in searching of
> set
> -- of patterns of length n on same string
> -- this program is for single pattern search, but can be crafted
> -- for multiple patterns of length m
>

I tuned it up somewhat:
import Data.List (isPrefixOf)
import Data.Char (ord)  -- using ord instead of fromEnum oddly makes it
-- faster for my test, but slower for yours, but only a whiff.

searchrep :: String -> String -> String -> String
searchrep "" _ str = str    -- or cycle rp, or error?
searchrep sr rp xs = hSearchRep xs  -- don't carry more around than necessary
    where
       len = length sr       -- we don't want that to be recomputed 
       hsrch = hash sr     -- neither that
       hSearchRep  "" = ""
       hSearchRep xs
           | null remaining = passed
	   | otherwise      = passed ++ rp ++ hSearchRep (drop len remaining)
	     where
	       (passed,remaining) = hSearch xs  -- ' xs (hash $ take len xs) ""
       hSearch xs = hSearch' xs hcmp "" -- since hSearch will be optimised
           where                                     -- away anyway, we might
	      hcmp = hash $ take len xs   -- as well eliminate it ourselves
       hSearch' "" _ got = (reverse got, "")
       hSearch' xxs@(x:xs) hcd got
           | hcd == hsrch && (sr `isPrefixOf` xxs) = (reverse got, xxs)
	   | otherwise                             = searchAgain -- one test less
	     where
	       searchAgain = case drop len xxs of
	        []     -> (reverse got ++ xxs, "")   -- then we know we're done
		(y:_)  -> hSearch' xs (hashRotate x y hcd) (x:got)
-- no need for fancy getFst anymore
-- making hashRotate local eliminates one argument, makes it faster
       hashRotate :: Char -> Char -> Int -> Int
       hashRotate cout cin hsh = 101*(hsh - 101^(len-1)*ord cout) + ord cin
-- using foldl for hash is an enormous boost
       hash :: String -> Int
       hash = foldl ((. ord) . (+) . (*101)) 0
-- hash str = foldl (\n c -> 101*n+ord c) 0 str
-- this is equally fast as the point-free version, easier to read, probably,
-- but I like an occasional pointless pointfreeness.

Now this beats everything but KMP on my test very clearly.
dafis at linux:~/Documents/haskell/Allotria/Search> time myhash; time myhash2
Working: seasearch replace  able seaseaseasearch baker ssseasearch charlie
True
Done


real    0m50.401s
user    0m49.990s
sys     0m0.060s
Working very long
True
Done

real    0m15.747s
user    0m15.630s
sys     0m0.020s

Still poor on your test, though.

Cheers,
Daniel



More information about the Haskell-Cafe mailing list