[Haskell-cafe] Spelling checker exercise
Daniel Fischer
daniel.is.fischer at web.de
Sun Jan 24 06:52:10 EST 2010
Am Sonntag 24 Januar 2010 06:19:58 schrieb Matthew Phillips:
> Thanks very much Daniel for giving my (amateurish!) exercise such an
> in-depth a look-over. Comments inline below.
>
> On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:
> >
> > That does help, but the worst part is building the map. That takes a
> > couple of seconds in Python, too. Just building the map takes 1.95s
> > for Python, 3.6s (including GC) with strict ByteStrings, 4.2s with
> > lazy ByteStrings and 6s with plain Strings here.
Correction: I took the total time for a no-argument run for the time it
took Python to build the map, adding timing for the map-building, that says
it takes ~1.45 seconds. I'm clueless as to what Python needs the remaining
half second for.
>
> I get the Python version running in about 1s, compared to 5.6s for the
> Haskell:
$ python --version
Python 2.6
With psyco.full():
$ time python ./norvig.py becuase
Trained in 1.16967606544 seconds
./norvig.py
because
1.52user 0.08system 0:01.60elapsed 100%CPU
without psyco:
$ time python ./norvig.py becuase
Trained in 1.45706319809 seconds
./norvig.py
because
1.95user 0.08system 0:02.03elapsed 100%CPU
$ time python ./norvig.py
Trained in 1.46250891685 seconds
./norvig.py
1.95user 0.09system 0:02.04elapsed 100%CPU
>
> $ time python spelling.py
> because
>
> real 0m1.071s
> user 0m0.821s
> sys 0m0.139s
>
> $ time ./spelling becuase
> becuase -> because
>
> real 0m5.589s
> user 0m4.554s
> sys 0m0.307s
>
> And, strangely, the rewrite you provided (I called it "spelling_df")
> runs a fair bit slower:
>
> $ time ./spelling_df becuase
> becuase -> because
>
> real 0m8.087s
> user 0m6.966s
> sys 0m0.193s
>
> $ time ./spelling korrekt
> korrekt -> correct
>
> real 0m5.970s
> user 0m4.885s
> sys 0m0.300s
>
> $ time ./spelling_df korrekt
> korrekt -> correct
>
> real 0m8.616s
> user 0m7.538s
> sys 0m0.187s
>
I think I know what happened here:
$ ghc -fforce-recomp --make matthew -o matthew0
[1 of 1] Compiling Main ( matthew.hs, matthew.o )
Linking matthew0 ...
$ ghc -O2 -fforce-recomp --make matthew -o matthew2
[1 of 1] Compiling Main ( matthew.hs, matthew.o )
Linking matthew2 ...
$ time ./matthew0 becuase
becuase -> because
7.07user 0.21system 0:07.28elapsed 99%CPU
$ time ./matthew2 becuase
becuase -> because
6.01user 0.19system 0:06.21elapsed 100%CPU
$ ghc -fforce-recomp --make spellingBS -o spelling0
[1 of 1] Compiling Main ( spellingBS.hs, spellingBS.o )
Linking spelling0 ...
$ ghc -O2 -fforce-recomp --make spellingBS -o spelling2
[1 of 1] Compiling Main ( spellingBS.hs, spellingBS.o )
Linking spelling2 ...
$ time ./spelling0 becuase
becuase -> because
9.78user 0.09system 0:09.87elapsed 100%CPU
$ time ./spelling2 becuase
becuase -> because
3.57user 0.03system 0:03.60elapsed 100%CPU
I habitually compile all code with -O2, unless I have a specific reason not
to. I tend to forget that some compile without optimisations.
For some kinds of code that makes hardly any difference, for others it
makes a big difference.
*** Don't even think of using ByteStrings without optimising. ***
> >
> > readFile does not appear in my profile.
>
> Apologies, I should have said that I’d inserted some SCC’s to try to
> tease out the cost of readFile (i.e. {-# SCC "readFile"}).
>
> > If you insert an SCC for updateMap,
> >
> > where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+)
> > word 1 model
> >
> > , you'll see that the really bad citizen is updateMap (splitWords is
> > rather bad, too, together they take some 95% of the time in that
> > profile).
>
> Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking
> 80% of runtime. Adding SCC's like this:
>
> splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC
> "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map
> toLower
>
> gives me:
>
> splitWords Main 216 1 0.0 0.0 78.6
> 91.8 filter Main 217 1 1.9 3.0 78.6
> 91.8 splitWith Main 218 1 28.4 36.8
> 76.7 88.8 isNogud Main 221 6488666 4.2 4.1
> 4.2 4.1 toLower Main 219 1 44.2 47.9
> 44.2 47.9
>
> i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why,
> I have no idea.
>
> Am I reading this wrong?
>
No, you're just compiling it wrong :)
If I profile without optimising, I get
Sun Jan 24 11:37 2010 Time and Allocation Profiling Report (Final)
pspellingBS0 +RTS -P -RTS becuase
total time = 16.46 secs (823 ticks @ 20 ms)
total alloc = 4,088,410,184 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
toLower Main 36.3 45.9 299 468806205
splitWith Main 30.6 33.9 252 346657565
train Main 25.8 13.5 212 138466403
isNogud Main 4.3 3.8 35 38931996
filter Main 2.1 2.7 17 27466769
which is compatible with your results.
With -O2:
Sun Jan 24 11:33 2010 Time and Allocation Profiling Report (Final)
pspellingBS +RTS -P -RTS becuase
total time = 5.66 secs (283 ticks @ 20 ms)
total alloc = 708,686,372 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
updateMap Main 68.6 76.9 194 136314135
toLower Main 16.6 0.9 47 1622182
splitWith Main 6.7 14.3 19 25366835
train Main 3.5 2.5 10 4362826
filter Main 2.1 4.4 6 7736998
splitWords Main 1.8 0.0 5 0
which gives a completely different picture.
Still, toLower takes a significant part of the time, we can drastically
reduce that by exploiting the fact that we don't need to handle the whole
Unicode complexity:
Sun Jan 24 11:53 2010 Time and Allocation Profiling Report (Final)
pspellingBSW +RTS -P -RTS becuase
total time = 4.72 secs (236 ticks @ 20 ms)
total alloc = 708,686,372 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
updateMap Main 76.3 76.9 180 136314135
splitWith Main 9.7 14.3 23 25366835
filter Main 5.1 4.4 12 7736998
train Main 3.0 2.5 7 4362826
splitWords Main 2.1 0.0 5 0
isNogud Main 2.1 0.0 5 0
mkLow Main 1.7 0.9 4 1622182
And, without profiling:
$ time ./spellingBSW becuase
becuase -> because
2.84user 0.03system 0:02.88elapsed 99%CPU
Finally, building the set of two-step edits takes longer than the
additional lookups:
$ time ./spellingBSW becuase
becuase -> because
2.84user 0.03system 0:02.88elapsed 99%CPU
$ time ./spellingBSW korrekt
korrekt -> correct
3.50user 0.02system 0:03.52elapsed 100%CPU
vs.
$ time ./spellingBSWL becuase
becuase -> because
2.79user 0.04system 0:02.83elapsed 100%CPU
$ time ./spellingBSWL3 korrekt
korrekt -> correct
3.20user 0.02system 0:03.23elapsed 99%CPU
Which is reached with
----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString as BS
import Data.Bits
import Data.Word (Word8)
import Data.Char (toLower)
import Data.Map (Map, findWithDefault, insertWith', keysSet, empty, member)
import qualified Data.Map as Map (lookup, empty, size)
import Data.Set (toList, fromList)
import Data.List (inits, tails, foldl')
import System.Environment (getArgs)
dataFile = "big.txt"
alphabet = "abcdefghijklmnopqrstuvwxyz"
splitWords :: B.ByteString -> [B.ByteString]
splitWords = {-# SCC "filter" #-} filter (not . BS.null) . {-# SCC
"splitWith" #-} BS.splitWith isNogud . {-# SCC "mkLow" #-} BS.map mkLow
mkLow :: Word8 -> Word8
mkLow x = x .|. 32
isNogud :: Word8 -> Bool
isNogud c = c < 97 || 122 < c
train :: [B.ByteString] -> Map B.ByteString Int
train = foldl' updateMap Map.empty
where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word
1 model
nwords :: IO (Map B.ByteString Int)
nwords = (return $!) . train . splitWords =<< B.readFile dataFile
edits1 :: String -> [String]
edits1 s = deletes ++ transposes ++ replaces ++ inserts
where
deletes = [a ++ bs | (a, _:bs) <- splits]
transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits]
replaces = [a ++ (c:bs) | (a, _:bs) <- splits, c <- alphabet]
inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet]
splits = zip (inits s) (tails s)
correct :: Map B.ByteString Int -> String -> String
correct wordCounts word = B.unpack $ maxCount (B.pack "?") 0 candidates
where
candidates :: [B.ByteString]
candidates =
known [word] `or` ((known e1) `or` known_edits2)
e1 :: [String]
e1 = toList . fromList $ edits1 word
known_edits2 :: [B.ByteString]
known_edits2 =
[w3 | w1 <- e1, w2 <- edits1 w1, let w3 = B.pack w2, w3 `member`
wordCounts]
known :: [String] -> [B.ByteString]
known ws = {-# SCC "known" #-} [w | w <- map B.pack ws, w `member`
wordCounts]
maxCount :: B.ByteString -> Int -> [B.ByteString] -> B.ByteString
maxCount best cmax (word:more)
| cmax < count = maxCount word count more
| otherwise = maxCount best cmax more
where
count = findWithDefault 1 word wordCounts
maxCount best _ _ = best
or :: [B.ByteString] -> [B.ByteString] -> [B.ByteString]
or a b | null a = b
| otherwise = a
main :: IO ()
main = do
args <- getArgs
wordCounts <- nwords
mapM_ (printCorrect wordCounts) args
where
printCorrect :: Map B.ByteString Int -> String -> IO ()
printCorrect wordCounts word =
putStrLn $ word ++ " -> " ++ correct wordCounts word
----------------------------------------------------------------------
> > But once you start needing two edits (try korrekt), correct and edits1
> > start to show up. That shows that Norvig's algorithm isn't really
> > good. With two edit steps, you create a _lot_ of strings you need to
> > look up, far more than there are in the map. That takes time. It'll be
> > better to scan the map for entries with an edit distance (< 3) if you
> > have a good method to check that
Indeed:
$ time ./nLDBSWSpelling becuase
becuase -> because
2.84user 0.02system 0:02.86elapsed 100%CPU
$ time ./nLDBSWSpelling korrekt
korrekt -> correct
2.83user 0.05system 0:02.88elapsed 100%CPU
> > (http://old.nabble.com/haskell-in-online-contests-td26546989.html
> > contains pointers for that).
>
> Will have a look at that: it looks like it'll be very informative.
>
> > Another thing is
> >
> > allWords = keysSet wordCounts
> >
> > Ouch. For each correction, you construct that set anew. Just use
> > member from Data.Map instead of Data.Set.member and look up the words
> > in the map.
>
> Whoops! Just to be clear though: Haskell will memoise the result of
> "allWords" for a given invocation of "correct"?
Yes. But not across different invocations.
> So this would only make a large difference for multiple corrections?
Right. But that's the interesting use case, isn't it?
> (which I wasn't worrying about for the moment).
> The change seems to wipe off about 0.2s on average.
>
Which is pretty bad for multiple corrections.
> >> I will look at using Bloom Filters or
> >> Trie’s instead of Data.Map, but I wonder if readFile should be taking
> >> nearly %30 of the run time, even for a 6MB file?
> >
> > No way. But it doesn't seem to, from my GHC's point of view.
>
> Just to be sure I wasn't using the SCC incorrectly, I split out the
> readFile call into "myReadFile". The prof line comes out as:
>
> myReadFile Main 210 1 35.8 8.6
> 35.8 8.6
>
> i.e. 35.8% of runtime.
>
Can I see the exact code which gives that profile?
Somehow, things which shouldn't must be attributed to readFile.
> >> Sorry to dump such a long post on the list — I’ll understand if no
> >> one can be bothered rehashing this. But, in summary I’d like to know:
> >>
> >> (a) how could I use ByteString’s for this to speed up I/O and reduce
> >> memory usage without losing the nice readability?
>
> And thanks for the other small tips (e.g. findWithDefault), and I didn't
> know you could use let the way you did either.
>
> Cheers,
>
> Matthew.
More information about the Haskell-Cafe
mailing list