[Haskell-cafe] Spelling checker exercise

Fri Jan 22 08:41:14 EST 2010

Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:
> Hello all,
>
> sorry to bring up an old chestnut, but I’m trying to improve my
> Haskell-fu by writing a small program, and chose Peter Norvig’s spelling
> checker as my exercise material (http://norvig.com/spell-correct.html).
>
> While I’ve gotten it working, and found it quite illuminating, I also
> found it to to be very slow. And then I discovered that, of course,
> others have been here well before me ([1] and [2]). Those discussions
> were very interesting, but the code they refer to is mostly not
> available, so the most I’ve gotten out of it so far is that:
>
>   (a) I should be using strict left folds and strict Map insertions for
> the word frequency map (this shaved off about a second: ~5s -> ~4s for a
> single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably
> be using ByteString’s

That does help, but the worst part is building the map. That takes a couple 
of seconds in Python, too. Just building the map takes 1.95s for Python, 
3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and 
6s with plain Strings here.

So I'd go with strict ByteStrings, although that takes a little more memory 
than lazy, but waay less than Strings.

>   (c) Using Set’s for the edit permutations probably isn’t worth it
> (although I found using plain lists made it about a second slower)

Might make a difference once you need to take two edit steps on a not very 
short word.

>
> (b) is difficult because I’ve used matching patterns plus list
> comprehensions to generate the potential edits, and I really like how
> elegantly it pans out that way. Because ByteString’s are not lists, I
> can’t see a way to keep the current structure and use them.

Train with ByteStrings, then do the edits on Strings and pack for lookup.

>
> The code is at [3] (link to version at time of post). Profiling [4]
> shows:
>
>   $ ./spelling becuase +RTS -p
>   becuase -> because
>   $ cat spelling.prof
>     total time =  4.02 secs (201 ticks @ 20 ms)
>     total alloc = 1,544,257,792 bytes (excludes profiling overheads)
>
>   COST CENTRE                    MODULE               %time %alloc
>
>   train                          Main                  52.7   19.7
>   readFile                       Main                  28.9    8.6
>   wordsBy                        Main                  10.9   49.5
>   toLower                        Main                   7.0   21.8
>   ...
>
> So it appears that “train" (building the freq map) and “readFile” in
> “nwords" are the places to hone.

readFile does not appear in my profile.
If you insert an SCC for updateMap,

where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 
model

, you'll see that the really bad citizen is updateMap (splitWords is rather 
bad, too, together they take some 95% of the time in that profile).

But once you start needing two edits (try korrekt), correct and edits1 
start to show up. That shows that Norvig's algorithm isn't really good.
With two edit steps, you create a _lot_ of strings you need to look up, far 
more than there are in the map. That takes time. It'll be better to scan 
the map for entries with an edit distance (< 3) if you have a good method 
to check that 
(http://old.nabble.com/haskell-in-online-contests-td26546989.html contains 
pointers for that).

Another thing is

allWords = keysSet wordCounts

Ouch. For each correction, you construct that set anew. Just use member 
from Data.Map instead of Data.Set.member and look up the words in the map.

> I will look at using Bloom Filters or
> Trie’s instead of Data.Map, but I wonder if readFile should be taking
> nearly %30 of the run time, even for a 6MB file?

No way. But it doesn't seem to, from my GHC's point of view.

>
> Sorry to dump such a long post on the list — I’ll understand if no one
> can be bothered rehashing this. But, in summary I’d like to know:
>
>   (a) how could I use ByteString’s for this to speed up I/O and reduce
> memory usage without losing the nice readability?

A small rewrite of your code, I would have designed it slightly differently 
for using ByteStrings from the beginning, the packing in known and 
known_edits2 isn't too beautiful.

----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString.Char8 as B
import Data.Char (toLower)
import Data.Map (Map, findWithDefault, insertWith', member)
import qualified Data.Map as Map (empty)
import Data.Set as Set (Set, fromList, toList, fold, null)
import Data.List (inits, tails, foldl')
import System.Environment (getArgs)

dataFile = "big.txt"
alphabet = "abcdefghijklmnopqrstuvwxyz"

splitWords :: B.ByteString -> [B.ByteString]
splitWords = filter (not . B.null) . B.splitWith isNogud . B.map toLower

isNogud :: Char -> Bool
isNogud c = c < 'a' || 'z' < c

train :: [B.ByteString] -> Map B.ByteString Int
train = foldl' updateMap Map.empty
  where 
    updateMap model word = insertWith' (+) word 1 model

nwords :: IO (Map B.ByteString Int)
nwords = return . train . splitWords =<< B.readFile dataFile

edits1 :: String -> [String]
edits1 s = toList . fromList $ deletes ++ transposes ++ replaces ++ inserts
  where
    deletes = [a ++ bs | (a, _:bs) <- splits]
    transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits]
    replaces = [a ++ (c:bs) | (a, _:bs) <- splits, c <- alphabet]
    inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet]
    splits = zip (inits s) (tails s)

correct :: Map B.ByteString Int -> String -> String
correct wordCounts word = B.unpack . fst $ fold maxCount (B.pack "?", 0) 
candidates
  where
    candidates :: Set B.ByteString
    candidates =
      known [word] `or` ((known $ edits1 word) `or` known_edits2 word)

    known_edits2 :: String -> Set B.ByteString
    known_edits2 w =
      fromList [w3 | w1 <- edits1 w, w2 <- edits1 w1
                    , let w3 = B.pack w2, w3 `member` wordCounts]

    known :: [String] -> Set B.ByteString
    known ws = fromList [w | w <- map B.pack ws, w `member` wordCounts]

    maxCount :: B.ByteString -> (B.ByteString, Int) -> (B.ByteString, Int)
    maxCount word current@(_, currentMax)
      | count > currentMax = (word, count)
      | otherwise = current
        where 
          count = findWithDefault 1 word wordCounts

    or :: Set B.ByteString -> Set B.ByteString -> Set B.ByteString
    or a b | Set.null a = b
           | otherwise = a

main :: IO ()
main = do
  args <- getArgs
  wordCounts <- nwords
  mapM_ (printCorrect wordCounts) args
  where
    printCorrect :: Map B.ByteString Int -> String -> IO ()
    printCorrect wordCounts word =
      putStrLn $ word ++ " -> " ++ correct wordCounts word
----------------------------------------------------------------------

>
>   (b) should readFile be so slow?
>
>   (c) any other tips

Choose a better algorithm for the two-edit case.

>
> Possibly all my questions could be answered if someone has the code from
> the old posts.
>
> Cheers,
>
> Matthew.