[Haskell-cafe] Spelling checker exercise

Sun Jan 24 06:52:10 EST 2010

Am Sonntag 24 Januar 2010 06:19:58 schrieb Matthew Phillips:
> Thanks very much Daniel for giving my (amateurish!) exercise such an
> in-depth a look-over. Comments inline below.
>
> On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:
> >
> > That does help, but the worst part is building the map. That takes a
> > couple of seconds in Python, too. Just building the map takes 1.95s
> > for Python, 3.6s (including GC) with strict ByteStrings, 4.2s with
> > lazy ByteStrings and 6s with plain Strings here.

Correction: I took the total time for a no-argument run for the time it 
took Python to build the map, adding timing for the map-building, that says 
it takes ~1.45 seconds. I'm clueless as to what Python needs the remaining 
half second for.

>
> I get the Python version running in about 1s, compared to 5.6s for the
> Haskell:

$ python --version
Python 2.6

With psyco.full():
$ time python ./norvig.py becuase
Trained in  1.16967606544 seconds
./norvig.py
because
1.52user 0.08system 0:01.60elapsed 100%CPU 

without psyco:
$ time python ./norvig.py becuase
Trained in  1.45706319809 seconds
./norvig.py
because
1.95user 0.08system 0:02.03elapsed 100%CPU
$ time python ./norvig.py
Trained in  1.46250891685 seconds
./norvig.py
1.95user 0.09system 0:02.04elapsed 100%CPU 

>
> $ time python spelling.py
> because
>
> real	0m1.071s
> user	0m0.821s
> sys	0m0.139s
>
> $ time ./spelling becuase
> becuase -> because
>
> real	0m5.589s
> user	0m4.554s
> sys	0m0.307s
>
> And, strangely, the rewrite you provided (I called it "spelling_df")
> runs a fair bit slower:
>
> $ time ./spelling_df becuase
> becuase -> because
>
> real	0m8.087s
> user	0m6.966s
> sys	0m0.193s
>
> $ time ./spelling korrekt
> korrekt -> correct
>
> real	0m5.970s
> user	0m4.885s
> sys	0m0.300s
>
> $ time ./spelling_df korrekt
> korrekt -> correct
>
> real	0m8.616s
> user	0m7.538s
> sys	0m0.187s
>

I think I know what happened here:

$ ghc -fforce-recomp --make matthew -o matthew0
[1 of 1] Compiling Main             ( matthew.hs, matthew.o )
Linking matthew0 ...
$ ghc -O2 -fforce-recomp --make matthew -o matthew2
[1 of 1] Compiling Main             ( matthew.hs, matthew.o )
Linking matthew2 ...
$ time ./matthew0 becuase
becuase -> because
7.07user 0.21system 0:07.28elapsed 99%CPU
$ time ./matthew2 becuase
becuase -> because
6.01user 0.19system 0:06.21elapsed 100%CPU

$ ghc -fforce-recomp --make spellingBS -o spelling0
[1 of 1] Compiling Main             ( spellingBS.hs, spellingBS.o )
Linking spelling0 ...
$ ghc -O2 -fforce-recomp --make spellingBS -o spelling2
[1 of 1] Compiling Main             ( spellingBS.hs, spellingBS.o )
Linking spelling2 ...
$ time ./spelling0 becuase
becuase -> because
9.78user 0.09system 0:09.87elapsed 100%CPU
$ time ./spelling2 becuase
becuase -> because
3.57user 0.03system 0:03.60elapsed 100%CPU

I habitually compile all code with -O2, unless I have a specific reason not 
to. I tend to forget that some compile without optimisations.
For some kinds of code that makes hardly any difference, for others it 
makes a big difference.

*** Don't even think of using ByteStrings without optimising. ***

> >
> > readFile does not appear in my profile.
>
> Apologies, I should have said that I’d inserted some SCC’s to try to
> tease out the cost of readFile (i.e. {-# SCC "readFile"}).
>
> > If you insert an SCC for updateMap,
> >
> > where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+)
> > word 1 model
> >
> > , you'll see that the really bad citizen is updateMap (splitWords is
> > rather bad, too, together they take some 95% of the time in that
> > profile).
>
> Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking
> 80% of runtime. Adding SCC's like this:
>
>   splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC
> "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map
> toLower
>
> gives me:
>
>   splitWords             Main     216           1   0.0    0.0    78.6  
> 91.8 filter                Main     217           1   1.9    3.0    78.6
>   91.8 splitWith            Main     218           1  28.4   36.8   
> 76.7   88.8 isNogud             Main     221     6488666   4.2    4.1   
>  4.2    4.1 toLower             Main     219           1  44.2   47.9   
> 44.2   47.9
>
> i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why,
> I have no idea.
>
> Am I reading this wrong?
>

No, you're just compiling it wrong :)
If I profile without optimising, I get
	Sun Jan 24 11:37 2010 Time and Allocation Profiling Report  (Final)

	   pspellingBS0 +RTS -P -RTS becuase

	total time  =       16.46 secs   (823 ticks @ 20 ms)
	total alloc = 4,088,410,184 bytes  (excludes profiling overheads)

COST CENTRE   MODULE  %time %alloc  ticks     bytes

toLower       Main     36.3   45.9    299 468806205
splitWith     Main     30.6   33.9    252 346657565
train         Main     25.8   13.5    212 138466403
isNogud       Main      4.3    3.8     35  38931996
filter        Main      2.1    2.7     17  27466769

which is compatible with your results.
With -O2:
	Sun Jan 24 11:33 2010 Time and Allocation Profiling Report  (Final)

	   pspellingBS +RTS -P -RTS becuase

	total time  =        5.66 secs   (283 ticks @ 20 ms)
	total alloc = 708,686,372 bytes  (excludes profiling overheads)

COST CENTRE   MODULE  %time %alloc  ticks     bytes

updateMap     Main     68.6   76.9    194 136314135
toLower       Main     16.6    0.9     47   1622182
splitWith     Main      6.7   14.3     19  25366835
train         Main      3.5    2.5     10   4362826
filter        Main      2.1    4.4      6   7736998
splitWords    Main      1.8    0.0      5         0

which gives a completely different picture.

Still, toLower takes a significant part of the time, we can drastically 
reduce that by exploiting the fact that we don't need to handle the whole 
Unicode complexity:

	Sun Jan 24 11:53 2010 Time and Allocation Profiling Report  (Final)

	   pspellingBSW +RTS -P -RTS becuase

	total time  =        4.72 secs   (236 ticks @ 20 ms)
	total alloc = 708,686,372 bytes  (excludes profiling overheads)

COST CENTRE  MODULE %time %alloc  ticks     bytes

updateMap    Main    76.3   76.9    180 136314135
splitWith    Main     9.7   14.3     23  25366835
filter       Main     5.1    4.4     12   7736998
train        Main     3.0    2.5      7   4362826
splitWords   Main     2.1    0.0      5         0
isNogud      Main     2.1    0.0      5         0
mkLow        Main     1.7    0.9      4   1622182

And, without profiling:
$ time ./spellingBSW becuase
becuase -> because
2.84user 0.03system 0:02.88elapsed 99%CPU

Finally, building the set of two-step edits takes longer than the 
additional lookups:

$ time ./spellingBSW becuase
becuase -> because                                                      
2.84user 0.03system 0:02.88elapsed 99%CPU                       
$ time ./spellingBSW korrekt
korrekt -> correct                                                      
3.50user 0.02system 0:03.52elapsed 100%CPU

vs.

$ time ./spellingBSWL becuase
becuase -> because
2.79user 0.04system 0:02.83elapsed 100%CPU
$ time ./spellingBSWL3 korrekt
korrekt -> correct
3.20user 0.02system 0:03.23elapsed 99%CPU

Which is reached with
----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString as BS
import Data.Bits
import Data.Word (Word8)
import Data.Char (toLower)
import Data.Map (Map, findWithDefault, insertWith', keysSet, empty, member)
import qualified Data.Map as Map (lookup, empty, size)
import Data.Set (toList, fromList)
import Data.List (inits, tails, foldl')
import System.Environment (getArgs)

dataFile = "big.txt"
alphabet = "abcdefghijklmnopqrstuvwxyz"

splitWords :: B.ByteString -> [B.ByteString]
splitWords = {-# SCC "filter" #-} filter (not . BS.null) . {-# SCC 
"splitWith" #-} BS.splitWith isNogud . {-# SCC "mkLow" #-} BS.map mkLow

mkLow :: Word8 -> Word8
mkLow x = x .|. 32

isNogud :: Word8 -> Bool
isNogud c = c < 97 || 122 < c

train :: [B.ByteString] -> Map B.ByteString Int
train = foldl' updateMap Map.empty
  where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 
1 model

nwords :: IO (Map B.ByteString Int)
nwords = (return $!) . train . splitWords =<< B.readFile dataFile

edits1 :: String -> [String]
edits1 s = deletes ++ transposes ++ replaces ++ inserts
  where
    deletes = [a ++ bs | (a, _:bs) <- splits]
    transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits]
    replaces = [a ++ (c:bs) | (a, _:bs) <- splits, c <- alphabet]
    inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet]
    splits = zip (inits s) (tails s)

correct :: Map B.ByteString Int -> String -> String
correct wordCounts word = B.unpack $ maxCount (B.pack "?") 0 candidates
  where
    candidates :: [B.ByteString]
    candidates =
      known [word] `or` ((known e1) `or` known_edits2)

    e1 :: [String]
    e1 = toList . fromList $ edits1 word

    known_edits2 :: [B.ByteString]
    known_edits2 =
      [w3 | w1 <- e1, w2 <- edits1 w1, let w3 = B.pack w2, w3 `member` 
wordCounts]

    known :: [String] -> [B.ByteString]
    known ws = {-# SCC "known" #-} [w | w <- map B.pack ws, w `member` 
wordCounts]

    maxCount :: B.ByteString -> Int -> [B.ByteString] -> B.ByteString
    maxCount best cmax (word:more)
        | cmax < count  = maxCount word count more
        | otherwise     = maxCount best cmax more
          where
            count = findWithDefault 1 word wordCounts
    maxCount best _ _ = best

    or :: [B.ByteString] -> [B.ByteString] -> [B.ByteString]
    or a b | null a     = b
           | otherwise  = a

main :: IO ()
main = do
  args <- getArgs
  wordCounts <- nwords
  mapM_ (printCorrect wordCounts) args
  where
    printCorrect :: Map B.ByteString Int -> String -> IO ()
    printCorrect wordCounts word =
      putStrLn $ word ++ " -> " ++ correct wordCounts word
----------------------------------------------------------------------

> > But once you start needing two edits (try korrekt), correct and edits1
> > start to show up. That shows that Norvig's algorithm isn't really
> > good. With two edit steps, you create a _lot_ of strings you need to
> > look up, far more than there are in the map. That takes time. It'll be
> > better to scan the map for entries with an edit distance (< 3) if you
> > have a good method to check that

Indeed:
$ time ./nLDBSWSpelling becuase
becuase -> because
2.84user 0.02system 0:02.86elapsed 100%CPU
$ time ./nLDBSWSpelling korrekt
korrekt -> correct
2.83user 0.05system 0:02.88elapsed 100%CPU

> > (http://old.nabble.com/haskell-in-online-contests-td26546989.html
> > contains pointers for that).
>
> Will have a look at that: it looks like it'll be very informative.
>
> > Another thing is
> >
> > allWords = keysSet wordCounts
> >
> > Ouch. For each correction, you construct that set anew. Just use
> > member from Data.Map instead of Data.Set.member and look up the words
> > in the map.
>
> Whoops! Just to be clear though: Haskell will memoise the result of
> "allWords" for a given invocation of "correct"?

Yes. But not across different invocations.

> So this would only make a large difference for multiple corrections?

Right. But that's the interesting use case, isn't it?

> (which I wasn't worrying about for the moment).
> The change seems to wipe off about 0.2s on average.
>

Which is pretty bad for multiple corrections.

> >> I will look at using Bloom Filters or
> >> Trie’s instead of Data.Map, but I wonder if readFile should be taking
> >> nearly %30 of the run time, even for a 6MB file?
> >
> > No way. But it doesn't seem to, from my GHC's point of view.
>
> Just to be sure I wasn't using the SCC incorrectly, I split out the
> readFile call into "myReadFile". The prof line comes out as:
>
>   myReadFile            Main         210           1  35.8    8.6   
> 35.8    8.6
>
> i.e. 35.8%  of runtime.
>

Can I see the exact code which gives that profile?
Somehow, things which shouldn't must be attributed to readFile.

> >> Sorry to dump such a long post on the list — I’ll understand if no
> >> one can be bothered rehashing this. But, in summary I’d like to know:
> >>
> >>  (a) how could I use ByteString’s for this to speed up I/O and reduce
> >> memory usage without losing the nice readability?
>
> And thanks for the other small tips (e.g. findWithDefault), and I didn't
> know you could use let the way you did either.
>
> Cheers,
>
> Matthew.