[Haskell-cafe] Spelling checker exercise

Tue Jan 26 15:06:10 EST 2010

Am Dienstag 26 Januar 2010 16:46:42 schrieb Eduard Sergeev:
> Daniel Fischer-4 wrote:
> > But that's the point, these checks aren't unnecessary (unless the word
> > under inspection is known). You want to propose the most likely
> > correct word.
>
> I just wanted to rewrite original Norvig's Python code in Haskell :)
> (maybe I misunderstood the algorithm?).

Seems so.

NWORDS is the frequency map built from the corpus.

    return max(candidates, key=NWORDS.get)

returns the candidate with the highest value in NWORDS, i.e. the candidate 
that occurred most often in the corpus (if there are several with the same 
highest count, I think the one found first is taken, the order in which an 
iterator traverses a Python set is not specified, IIRC, so it might be any 
of those).

> Of course it is far from being able to produce 'most likely correct'
> result.

Even taking word frequency into account doesn't get really close.
You'd have to take into account that some errors are more common than 
others (e.g. award a penalty for words starting with a different letter, 
substitution cost should be lower for letters adjacent on common keyboards 
than for letters far apart, but it should also be lower for letter pairs of 
similar sound [e <-> i, d <-> t and so on], insertion/deletion cost should 
be lower for double letters ["diging" is more likely to be a misspelling of 
"digging" than of "diving", although g and v are neighbours on qwerty and 
qwertz keyboards], -able <-> -ible confusion is extremely common).

It's really hairy. But a combination of edit distance and word frequency is 
a good start.

>
> Btw, where can I find the source for this super-fast 'nLDBSWSpelling'
> variant?

Nowhere, unless you come over with a sixpack or two ;)
It originated from a contest-related (codechef, www.codechef.com , a fork 
or similar of SPOJ) question end of November.
To not spoil the contest, I didn't post the code then. When I first 
mentioned the idea in this thread, I hadn't ported the code to the current 
setting yet, so I couldn't post it, even if I wanted, besides I didn't want 
to distract from the topic of proting Norvig's algorithm.

But since you ask and it's been long enough ago (and not directly 
applicable to the contest), here comes the modified source, I've added 
comments and a few further improvements, time for the 400 words is now 
4.02user 0.04system 0:04.07elapsed 100%CPU
2.8s for building the map, so on average 3 milliseconds per correction :D

----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import Data.ByteString.Unsafe (unsafeIndex)
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString as BS
import Data.Char (toLower)
import Data.Map (Map, findWithDefault, insertWith', member, assocs, empty)
import Data.List (inits, tails, foldl')
import System.Environment (getArgs)
import Data.Word (Word8)
import Data.Bits ((.|.))

dataFile = "big.txt"
alphabet = "abcdefghijklmnopqrstuvwxyz"

infixl 9 !

{-# INLINE (!) #-}
(!) :: B.ByteString -> Int -> Word8
(!) = unsafeIndex

{-
   Lazily calculate Levenshtein distance, cut off at 3, modified
   to have transpositions count as one edit.
-}
distance :: B.ByteString -> B.ByteString -> Int
distance start target = go 0 m n
      where
        m = B.length start
        n = B.length target
        go l i j
            {- if number of edits so far + difference of lengths left is 
larger than
               2, the total number of edits will be at least 3 -}
            | l+i > j+2 || l+j > i+2 = 3
            {- if start is completely consumed, we need j additional 
inserts -}
            | i == 0            = l+j
            {- if target is completely consumed, we need i additional 
deletions -}
            | j == 0            = l+i
            {- no edit nor branch if we look at identical letters -}
            | a == b            = go l (i-1) (j-1)
            | otherwise         =
                let -- replace
                    x = go (l+1) (i-1) (j-1)
                    -- insert
                    y = go (l+1) i (j-1)
                    -- delete
                    z = go (l+1) (i-1) j
                    -- transpose
                    w = go (l+1) (i-2) (j-2)
                    -- but only if the letters match
                    t | i > 1 && j > 1 && b == start!(i-2) && a == target!
(j-2) = w
                      | otherwise = 3
                in case compare i j of
                    -- if there's more of target left than of start, a 
deletion
                    -- can't give a path of length < 3, since after that 
we'd
                    -- need at least two inserts
                    LT -> t `seq` x `seq` y `seq` min x (min y t)
                    -- if both remaining segments have the same length,
                    -- we must try all edit steps
                    EQ -> t `seq` x `seq` y `seq` z `seq` min x (min y (min 
t z))
                    -- if there's more of start left than of target, an
                    -- insert would be pointless
                    GT -> t `seq` x `seq` z `seq` min x (min z t)
              where
                a = start!(i-1)
                b = target!(j-1)

splitWords :: B.ByteString -> [B.ByteString]
splitWords = filter (not . BS.null) . BS.splitWith isNogud . BS.map mkLow

{- quick and dirty toLower for ASCII letters -}
mkLow :: Word8 -> Word8
mkLow x = x .|. 32

{- not a lowercase ASCII letter -}
isNogud :: Word8 -> Bool
isNogud c = c < 97 || 122 < c

{- build map (word -> how often seen) -}
train :: [B.ByteString] -> Map B.ByteString Int
train = foldl' updateMap empty
  where updateMap model word = insertWith' (+) word 1 model

{- read corpus and build map -}
nwords :: IO (Map B.ByteString Int)
nwords = (return $!) . train . splitWords =<< B.readFile dataFile

{- single edit modifications, don't reproduce original word -}
edits1 :: String -> [String]
edits1 s = deletes ++ transposes ++ replaces ++ inserts
  where
    deletes = [a ++ bs | (a, _:bs) <- splits]
    transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits, b1 /= b2]
    replaces = [a ++ (c:bs) | (a, l:bs) <- splits, c <- alphabet, c /= l]
    inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet]
    splits = zip (inits s) (tails s)

correct :: Map B.ByteString Int -> String -> String
correct wordCounts word
    -- known word, trivial case
    | wrd `member` wordCounts  = word
    -- no known single edit modification, so scan corpus
    -- of known words for entries of distance 2
    | null ed1  = mxBy2 qm 0 (assocs wordCounts)
    -- at least one known single edit modification, look for
    -- most frequent of them
    | otherwise = mxBy qm 0 ed1
      where
        wrd = B.pack word
        qm  = B.pack "?"
        -- list of known single edit modifications and their count
        ed1 = [(pw,c) | w <- edits1 word
                      , let { pw = B.pack w
                            ; c = findWithDefault 0 pw wordCounts }
                      , c > 0]
        mxBy w _ [] = B.unpack w
        mxBy w m ((n,c):ps)
            | m < c     = mxBy n c ps   -- new highest count
            | otherwise = mxBy w m ps
        -- if we land here, all known words have a distance of at least 2,
        -- we want the one with distance 2 and the highest count among 
those
        -- (if there are any), we start with the unknown-marker and count 0
        mxBy2 w _ []  = B.unpack w
        mxBy2 w f ((n,c):ps)
            -- if the new word's count isn't larger than the best we've
            -- found so far, we can discard it immediately
            -- otherwise, calculate distance, if that's larger than 2,
            -- discard the word, otherwise we've found a new best
            | c <= f || d > 2   = mxBy2 w f ps
            | otherwise         = mxBy2 n c ps
              where
                d = distance wrd n

main :: IO ()
main = do
  args <- getArgs
  wordCounts <- nwords
  mapM_ (printCorrect wordCounts) $ map (map toLower) args
  where
    printCorrect :: Map B.ByteString Int -> String -> IO ()
    printCorrect wordCounts word =
      putStrLn $ word ++ " -> " ++ correct wordCounts word

----------------------------------------------------------------------