[Haskell-cafe] Spelling checker exercise
Daniel Fischer
daniel.is.fischer at web.de
Fri Jan 22 08:41:14 EST 2010
Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:
> Hello all,
>
> sorry to bring up an old chestnut, but I’m trying to improve my
> Haskell-fu by writing a small program, and chose Peter Norvig’s spelling
> checker as my exercise material (http://norvig.com/spell-correct.html).
>
> While I’ve gotten it working, and found it quite illuminating, I also
> found it to to be very slow. And then I discovered that, of course,
> others have been here well before me ([1] and [2]). Those discussions
> were very interesting, but the code they refer to is mostly not
> available, so the most I’ve gotten out of it so far is that:
>
> (a) I should be using strict left folds and strict Map insertions for
> the word frequency map (this shaved off about a second: ~5s -> ~4s for a
> single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably
> be using ByteString’s
That does help, but the worst part is building the map. That takes a couple
of seconds in Python, too. Just building the map takes 1.95s for Python,
3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and
6s with plain Strings here.
So I'd go with strict ByteStrings, although that takes a little more memory
than lazy, but waay less than Strings.
> (c) Using Set’s for the edit permutations probably isn’t worth it
> (although I found using plain lists made it about a second slower)
Might make a difference once you need to take two edit steps on a not very
short word.
>
> (b) is difficult because I’ve used matching patterns plus list
> comprehensions to generate the potential edits, and I really like how
> elegantly it pans out that way. Because ByteString’s are not lists, I
> can’t see a way to keep the current structure and use them.
Train with ByteStrings, then do the edits on Strings and pack for lookup.
>
> The code is at [3] (link to version at time of post). Profiling [4]
> shows:
>
> $ ./spelling becuase +RTS -p
> becuase -> because
> $ cat spelling.prof
> total time = 4.02 secs (201 ticks @ 20 ms)
> total alloc = 1,544,257,792 bytes (excludes profiling overheads)
>
> COST CENTRE MODULE %time %alloc
>
> train Main 52.7 19.7
> readFile Main 28.9 8.6
> wordsBy Main 10.9 49.5
> toLower Main 7.0 21.8
> ...
>
> So it appears that “train" (building the freq map) and “readFile” in
> “nwords" are the places to hone.
readFile does not appear in my profile.
If you insert an SCC for updateMap,
where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1
model
, you'll see that the really bad citizen is updateMap (splitWords is rather
bad, too, together they take some 95% of the time in that profile).
But once you start needing two edits (try korrekt), correct and edits1
start to show up. That shows that Norvig's algorithm isn't really good.
With two edit steps, you create a _lot_ of strings you need to look up, far
more than there are in the map. That takes time. It'll be better to scan
the map for entries with an edit distance (< 3) if you have a good method
to check that
(http://old.nabble.com/haskell-in-online-contests-td26546989.html contains
pointers for that).
Another thing is
allWords = keysSet wordCounts
Ouch. For each correction, you construct that set anew. Just use member
from Data.Map instead of Data.Set.member and look up the words in the map.
> I will look at using Bloom Filters or
> Trie’s instead of Data.Map, but I wonder if readFile should be taking
> nearly %30 of the run time, even for a 6MB file?
No way. But it doesn't seem to, from my GHC's point of view.
>
> Sorry to dump such a long post on the list — I’ll understand if no one
> can be bothered rehashing this. But, in summary I’d like to know:
>
> (a) how could I use ByteString’s for this to speed up I/O and reduce
> memory usage without losing the nice readability?
A small rewrite of your code, I would have designed it slightly differently
for using ByteStrings from the beginning, the packing in known and
known_edits2 isn't too beautiful.
----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString.Char8 as B
import Data.Char (toLower)
import Data.Map (Map, findWithDefault, insertWith', member)
import qualified Data.Map as Map (empty)
import Data.Set as Set (Set, fromList, toList, fold, null)
import Data.List (inits, tails, foldl')
import System.Environment (getArgs)
dataFile = "big.txt"
alphabet = "abcdefghijklmnopqrstuvwxyz"
splitWords :: B.ByteString -> [B.ByteString]
splitWords = filter (not . B.null) . B.splitWith isNogud . B.map toLower
isNogud :: Char -> Bool
isNogud c = c < 'a' || 'z' < c
train :: [B.ByteString] -> Map B.ByteString Int
train = foldl' updateMap Map.empty
where
updateMap model word = insertWith' (+) word 1 model
nwords :: IO (Map B.ByteString Int)
nwords = return . train . splitWords =<< B.readFile dataFile
edits1 :: String -> [String]
edits1 s = toList . fromList $ deletes ++ transposes ++ replaces ++ inserts
where
deletes = [a ++ bs | (a, _:bs) <- splits]
transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits]
replaces = [a ++ (c:bs) | (a, _:bs) <- splits, c <- alphabet]
inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet]
splits = zip (inits s) (tails s)
correct :: Map B.ByteString Int -> String -> String
correct wordCounts word = B.unpack . fst $ fold maxCount (B.pack "?", 0)
candidates
where
candidates :: Set B.ByteString
candidates =
known [word] `or` ((known $ edits1 word) `or` known_edits2 word)
known_edits2 :: String -> Set B.ByteString
known_edits2 w =
fromList [w3 | w1 <- edits1 w, w2 <- edits1 w1
, let w3 = B.pack w2, w3 `member` wordCounts]
known :: [String] -> Set B.ByteString
known ws = fromList [w | w <- map B.pack ws, w `member` wordCounts]
maxCount :: B.ByteString -> (B.ByteString, Int) -> (B.ByteString, Int)
maxCount word current@(_, currentMax)
| count > currentMax = (word, count)
| otherwise = current
where
count = findWithDefault 1 word wordCounts
or :: Set B.ByteString -> Set B.ByteString -> Set B.ByteString
or a b | Set.null a = b
| otherwise = a
main :: IO ()
main = do
args <- getArgs
wordCounts <- nwords
mapM_ (printCorrect wordCounts) args
where
printCorrect :: Map B.ByteString Int -> String -> IO ()
printCorrect wordCounts word =
putStrLn $ word ++ " -> " ++ correct wordCounts word
----------------------------------------------------------------------
>
> (b) should readFile be so slow?
>
> (c) any other tips
Choose a better algorithm for the two-edit case.
>
> Possibly all my questions could be answered if someone has the code from
> the old posts.
>
> Cheers,
>
> Matthew.
More information about the Haskell-Cafe
mailing list