[Haskell-cafe] Substring replacements
daniel.is.fischer at web.de
Wed Dec 14 14:40:06 EST 2005
nice algorithm. Since comparing chars _is_ cheap, it is to be expected that
all the hash-rotating is far more costly for short search patterns. The
longer the pattern, the better this gets, I think -- though nowhere near KMP
(or would it?). However, I don't see how to (efficiently) do a multiple
pattern search with KMP, so there -- if all patterns have the same length,
otherwise I don't see -- Rabin-Karp would probably be the method of choice.
Am Mittwoch, 14. Dezember 2005 10:16 schrieben Sie:
> From: Daniel Fischer <daniel.is.fischer at web.de>
> >To: "Branimir Maksimovic" <bmaxa at hotmail.com>
> >CC: Haskell-Cafe at haskell.org
> >Subject: Re: [Haskell-cafe] Substring replacements
> >Date: Tue, 13 Dec 2005 11:23:29 +0100
> After seeing that your program is fastest (I've also tried one from
> http://haskell.org/hawiki/RunTimeCompilation but perhaps I'm not
> that good in converting to search replace?) I've decided to
> try with Rabin-Karp algorithm.
> This algorithm performs same operation as straightforward search,
> but compares hashes instead of chars.
> With ability to rotate hash (remove first, add next) characters
> there is also optimisation, that hash is calculated only for single
> next character rather again for whole substring.
> Unfortunatelly on my machine it is very cheap to compare
> characters so with my test hashing overweights character compare,
> except in your test when hash searching is faster then straightforward
> This is best I can write in terms of performance and readability.
> I've tried with getFst that returns Maybe but it was slower so I decided
> to return '\0' in case that argument is empty list, which renders '\0'
> unusable, but then I really doubt that 0 will be used in strings.
> -- Rabin-Karp string search algorithm, it is very effective in searching of
> -- of patterns of length n on same string
> -- this program is for single pattern search, but can be crafted
> -- for multiple patterns of length m
I tuned it up somewhat:
import Data.List (isPrefixOf)
import Data.Char (ord) -- using ord instead of fromEnum oddly makes it
-- faster for my test, but slower for yours, but only a whiff.
searchrep :: String -> String -> String -> String
searchrep "" _ str = str -- or cycle rp, or error?
searchrep sr rp xs = hSearchRep xs -- don't carry more around than necessary
len = length sr -- we don't want that to be recomputed
hsrch = hash sr -- neither that
hSearchRep "" = ""
| null remaining = passed
| otherwise = passed ++ rp ++ hSearchRep (drop len remaining)
(passed,remaining) = hSearch xs -- ' xs (hash $ take len xs) ""
hSearch xs = hSearch' xs hcmp "" -- since hSearch will be optimised
where -- away anyway, we might
hcmp = hash $ take len xs -- as well eliminate it ourselves
hSearch' "" _ got = (reverse got, "")
hSearch' xxs@(x:xs) hcd got
| hcd == hsrch && (sr `isPrefixOf` xxs) = (reverse got, xxs)
| otherwise = searchAgain -- one test less
searchAgain = case drop len xxs of
 -> (reverse got ++ xxs, "") -- then we know we're done
(y:_) -> hSearch' xs (hashRotate x y hcd) (x:got)
-- no need for fancy getFst anymore
-- making hashRotate local eliminates one argument, makes it faster
hashRotate :: Char -> Char -> Int -> Int
hashRotate cout cin hsh = 101*(hsh - 101^(len-1)*ord cout) + ord cin
-- using foldl for hash is an enormous boost
hash :: String -> Int
hash = foldl ((. ord) . (+) . (*101)) 0
-- hash str = foldl (\n c -> 101*n+ord c) 0 str
-- this is equally fast as the point-free version, easier to read, probably,
-- but I like an occasional pointless pointfreeness.
Now this beats everything but KMP on my test very clearly.
dafis at linux:~/Documents/haskell/Allotria/Search> time myhash; time myhash2
Working: seasearch replace able seaseaseasearch baker ssseasearch charlie
Working very long
Still poor on your test, though.
More information about the Haskell-Cafe