[Haskell-cafe] Substring replacements
Daniel Fischer
daniel.is.fischer at web.de
Wed Dec 14 14:40:06 EST 2005
Hi Bane,
nice algorithm. Since comparing chars _is_ cheap, it is to be expected that
all the hash-rotating is far more costly for short search patterns. The
longer the pattern, the better this gets, I think -- though nowhere near KMP
(or would it?). However, I don't see how to (efficiently) do a multiple
pattern search with KMP, so there -- if all patterns have the same length,
otherwise I don't see -- Rabin-Karp would probably be the method of choice.
Am Mittwoch, 14. Dezember 2005 10:16 schrieben Sie:
> From: Daniel Fischer <daniel.is.fischer at web.de>
>
> >To: "Branimir Maksimovic" <bmaxa at hotmail.com>
> >CC: Haskell-Cafe at haskell.org
> >Subject: Re: [Haskell-cafe] Substring replacements
> >Date: Tue, 13 Dec 2005 11:23:29 +0100
>
> After seeing that your program is fastest (I've also tried one from
> http://haskell.org/hawiki/RunTimeCompilation but perhaps I'm not
> that good in converting to search replace?) I've decided to
> try with Rabin-Karp algorithm.
> This algorithm performs same operation as straightforward search,
> but compares hashes instead of chars.
> With ability to rotate hash (remove first, add next) characters
> there is also optimisation, that hash is calculated only for single
> next character rather again for whole substring.
> Unfortunatelly on my machine it is very cheap to compare
> characters so with my test hashing overweights character compare,
> except in your test when hash searching is faster then straightforward
> search.
>
> This is best I can write in terms of performance and readability.
> I've tried with getFst that returns Maybe but it was slower so I decided
> to return '\0' in case that argument is empty list, which renders '\0'
> unusable, but then I really doubt that 0 will be used in strings.
>
> -- Rabin-Karp string search algorithm, it is very effective in searching of
> set
> -- of patterns of length n on same string
> -- this program is for single pattern search, but can be crafted
> -- for multiple patterns of length m
>
I tuned it up somewhat:
import Data.List (isPrefixOf)
import Data.Char (ord) -- using ord instead of fromEnum oddly makes it
-- faster for my test, but slower for yours, but only a whiff.
searchrep :: String -> String -> String -> String
searchrep "" _ str = str -- or cycle rp, or error?
searchrep sr rp xs = hSearchRep xs -- don't carry more around than necessary
where
len = length sr -- we don't want that to be recomputed
hsrch = hash sr -- neither that
hSearchRep "" = ""
hSearchRep xs
| null remaining = passed
| otherwise = passed ++ rp ++ hSearchRep (drop len remaining)
where
(passed,remaining) = hSearch xs -- ' xs (hash $ take len xs) ""
hSearch xs = hSearch' xs hcmp "" -- since hSearch will be optimised
where -- away anyway, we might
hcmp = hash $ take len xs -- as well eliminate it ourselves
hSearch' "" _ got = (reverse got, "")
hSearch' xxs@(x:xs) hcd got
| hcd == hsrch && (sr `isPrefixOf` xxs) = (reverse got, xxs)
| otherwise = searchAgain -- one test less
where
searchAgain = case drop len xxs of
[] -> (reverse got ++ xxs, "") -- then we know we're done
(y:_) -> hSearch' xs (hashRotate x y hcd) (x:got)
-- no need for fancy getFst anymore
-- making hashRotate local eliminates one argument, makes it faster
hashRotate :: Char -> Char -> Int -> Int
hashRotate cout cin hsh = 101*(hsh - 101^(len-1)*ord cout) + ord cin
-- using foldl for hash is an enormous boost
hash :: String -> Int
hash = foldl ((. ord) . (+) . (*101)) 0
-- hash str = foldl (\n c -> 101*n+ord c) 0 str
-- this is equally fast as the point-free version, easier to read, probably,
-- but I like an occasional pointless pointfreeness.
Now this beats everything but KMP on my test very clearly.
dafis at linux:~/Documents/haskell/Allotria/Search> time myhash; time myhash2
Working: seasearch replace able seaseaseasearch baker ssseasearch charlie
True
Done
real 0m50.401s
user 0m49.990s
sys 0m0.060s
Working very long
True
Done
real 0m15.747s
user 0m15.630s
sys 0m0.020s
Still poor on your test, though.
Cheers,
Daniel
More information about the Haskell-Cafe
mailing list