[Haskell-cafe] Spelling checker exercise
Daniel Fischer
daniel.is.fischer at web.de
Tue Jan 26 15:06:10 EST 2010
Am Dienstag 26 Januar 2010 16:46:42 schrieb Eduard Sergeev:
> Daniel Fischer-4 wrote:
> > But that's the point, these checks aren't unnecessary (unless the word
> > under inspection is known). You want to propose the most likely
> > correct word.
>
> I just wanted to rewrite original Norvig's Python code in Haskell :)
> (maybe I misunderstood the algorithm?).
Seems so.
NWORDS is the frequency map built from the corpus.
return max(candidates, key=NWORDS.get)
returns the candidate with the highest value in NWORDS, i.e. the candidate
that occurred most often in the corpus (if there are several with the same
highest count, I think the one found first is taken, the order in which an
iterator traverses a Python set is not specified, IIRC, so it might be any
of those).
> Of course it is far from being able to produce 'most likely correct'
> result.
Even taking word frequency into account doesn't get really close.
You'd have to take into account that some errors are more common than
others (e.g. award a penalty for words starting with a different letter,
substitution cost should be lower for letters adjacent on common keyboards
than for letters far apart, but it should also be lower for letter pairs of
similar sound [e <-> i, d <-> t and so on], insertion/deletion cost should
be lower for double letters ["diging" is more likely to be a misspelling of
"digging" than of "diving", although g and v are neighbours on qwerty and
qwertz keyboards], -able <-> -ible confusion is extremely common).
It's really hairy. But a combination of edit distance and word frequency is
a good start.
>
> Btw, where can I find the source for this super-fast 'nLDBSWSpelling'
> variant?
Nowhere, unless you come over with a sixpack or two ;)
It originated from a contest-related (codechef, www.codechef.com , a fork
or similar of SPOJ) question end of November.
To not spoil the contest, I didn't post the code then. When I first
mentioned the idea in this thread, I hadn't ported the code to the current
setting yet, so I couldn't post it, even if I wanted, besides I didn't want
to distract from the topic of proting Norvig's algorithm.
But since you ask and it's been long enough ago (and not directly
applicable to the contest), here comes the modified source, I've added
comments and a few further improvements, time for the 400 words is now
4.02user 0.04system 0:04.07elapsed 100%CPU
2.8s for building the map, so on average 3 milliseconds per correction :D
----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import Data.ByteString.Unsafe (unsafeIndex)
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString as BS
import Data.Char (toLower)
import Data.Map (Map, findWithDefault, insertWith', member, assocs, empty)
import Data.List (inits, tails, foldl')
import System.Environment (getArgs)
import Data.Word (Word8)
import Data.Bits ((.|.))
dataFile = "big.txt"
alphabet = "abcdefghijklmnopqrstuvwxyz"
infixl 9 !
{-# INLINE (!) #-}
(!) :: B.ByteString -> Int -> Word8
(!) = unsafeIndex
{-
Lazily calculate Levenshtein distance, cut off at 3, modified
to have transpositions count as one edit.
-}
distance :: B.ByteString -> B.ByteString -> Int
distance start target = go 0 m n
where
m = B.length start
n = B.length target
go l i j
{- if number of edits so far + difference of lengths left is
larger than
2, the total number of edits will be at least 3 -}
| l+i > j+2 || l+j > i+2 = 3
{- if start is completely consumed, we need j additional
inserts -}
| i == 0 = l+j
{- if target is completely consumed, we need i additional
deletions -}
| j == 0 = l+i
{- no edit nor branch if we look at identical letters -}
| a == b = go l (i-1) (j-1)
| otherwise =
let -- replace
x = go (l+1) (i-1) (j-1)
-- insert
y = go (l+1) i (j-1)
-- delete
z = go (l+1) (i-1) j
-- transpose
w = go (l+1) (i-2) (j-2)
-- but only if the letters match
t | i > 1 && j > 1 && b == start!(i-2) && a == target!
(j-2) = w
| otherwise = 3
in case compare i j of
-- if there's more of target left than of start, a
deletion
-- can't give a path of length < 3, since after that
we'd
-- need at least two inserts
LT -> t `seq` x `seq` y `seq` min x (min y t)
-- if both remaining segments have the same length,
-- we must try all edit steps
EQ -> t `seq` x `seq` y `seq` z `seq` min x (min y (min
t z))
-- if there's more of start left than of target, an
-- insert would be pointless
GT -> t `seq` x `seq` z `seq` min x (min z t)
where
a = start!(i-1)
b = target!(j-1)
splitWords :: B.ByteString -> [B.ByteString]
splitWords = filter (not . BS.null) . BS.splitWith isNogud . BS.map mkLow
{- quick and dirty toLower for ASCII letters -}
mkLow :: Word8 -> Word8
mkLow x = x .|. 32
{- not a lowercase ASCII letter -}
isNogud :: Word8 -> Bool
isNogud c = c < 97 || 122 < c
{- build map (word -> how often seen) -}
train :: [B.ByteString] -> Map B.ByteString Int
train = foldl' updateMap empty
where updateMap model word = insertWith' (+) word 1 model
{- read corpus and build map -}
nwords :: IO (Map B.ByteString Int)
nwords = (return $!) . train . splitWords =<< B.readFile dataFile
{- single edit modifications, don't reproduce original word -}
edits1 :: String -> [String]
edits1 s = deletes ++ transposes ++ replaces ++ inserts
where
deletes = [a ++ bs | (a, _:bs) <- splits]
transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits, b1 /= b2]
replaces = [a ++ (c:bs) | (a, l:bs) <- splits, c <- alphabet, c /= l]
inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet]
splits = zip (inits s) (tails s)
correct :: Map B.ByteString Int -> String -> String
correct wordCounts word
-- known word, trivial case
| wrd `member` wordCounts = word
-- no known single edit modification, so scan corpus
-- of known words for entries of distance 2
| null ed1 = mxBy2 qm 0 (assocs wordCounts)
-- at least one known single edit modification, look for
-- most frequent of them
| otherwise = mxBy qm 0 ed1
where
wrd = B.pack word
qm = B.pack "?"
-- list of known single edit modifications and their count
ed1 = [(pw,c) | w <- edits1 word
, let { pw = B.pack w
; c = findWithDefault 0 pw wordCounts }
, c > 0]
mxBy w _ [] = B.unpack w
mxBy w m ((n,c):ps)
| m < c = mxBy n c ps -- new highest count
| otherwise = mxBy w m ps
-- if we land here, all known words have a distance of at least 2,
-- we want the one with distance 2 and the highest count among
those
-- (if there are any), we start with the unknown-marker and count 0
mxBy2 w _ [] = B.unpack w
mxBy2 w f ((n,c):ps)
-- if the new word's count isn't larger than the best we've
-- found so far, we can discard it immediately
-- otherwise, calculate distance, if that's larger than 2,
-- discard the word, otherwise we've found a new best
| c <= f || d > 2 = mxBy2 w f ps
| otherwise = mxBy2 n c ps
where
d = distance wrd n
main :: IO ()
main = do
args <- getArgs
wordCounts <- nwords
mapM_ (printCorrect wordCounts) $ map (map toLower) args
where
printCorrect :: Map B.ByteString Int -> String -> IO ()
printCorrect wordCounts word =
putStrLn $ word ++ " -> " ++ correct wordCounts word
----------------------------------------------------------------------
More information about the Haskell-Cafe
mailing list