[Haskell-beginners] Fwd: Implementing a spellchecker - problem with Data.HashTable performance

Fri Apr 20 12:30:38 CEST 2012

W dniu 20 kwietnia 2012 11:33 użytkownik Radosław Szymczyszyn
<lavrin at gmail.com> napisał:
> Hello fellow Haskellers,
>
> It's my first post to this list, so at first: Greetings to everyone!
>
> I've wanted to pick up Haskell for some time and I've run in
> to a perfect (at least it seemed so) oocasion at my Natural Language
> Processing course. At first I tried using Python, but it turned out
> that some extra performance wouldn't hurt and I hoped Haskell would
> give me some boost without forcing me to sacrifice high-level
> expressiveness at the same time.
>
> The first lesson was to use ByteStrings instead of ordinary Strings
> for IO. Then I needed to build a dictionary of words found in the
> source text. That's the problematic part - I've tried using a few
> structures, namely Data.Map and Data.HashTable from base package and
> Data.HashMap from unordered-containers. In all cases I couldn't reach
> acceptable performance. As of that, I'd like to ask about some advice
> and comments about the following examples.
>
> First example [1] just builds a HashTable of ~1 500 000 ByteStrings.
> The program takes about 11 seconds to execute. Second example [2] does
> roughly the same, but taking the HashTable data from a file (just a
> dictionary of valid Polish forms). Reading data itself is not a
> problem, as I've measured the execution time of a program which just
> reads the file (and prints its last line to actually force the read,
> hence "handle" Haskell's laziness) and it takes no more than a second.
> Coupled together reading and building, however, takes unbearably long.
> I didn't wait till the program stopped and just terminated it, but it
> had been running for ~10 minutes already.
>
> The most disappointing fact is that a roughly equivalent (and rather
> effortless to write in contrast to the Haskell example - I'm just
> learning though) Python test which just reads everything into a dict
> takes about 10 seconds to execute (reading and updating the dict of
> course). Writing the spellchecker in Python seemed pretty
> straightforward for me (but its certainly not the only language I
> know; I've had some Java (who didn't nowadays), system level C and am
> daily working in Erlang). I understand of course, that the most
> convenient tool is generally the one you know the best, but my
> experience writing the Haskell equivalent reached both extremes -
> effortless expression of the logic, but frustration because of the
> unexplainable (for me at least) performance behaviour.
>
> The question is why is it so slow? (or AM I RILY DAT DUMB? ;)
>
> The link to examples:
> [1] https://gist.github.com/2411699 TestHashTableArtificial
> [2] https://gist.github.com/2411699 TestReadIntoHashTable
>
> I'm looking forward to somebody shedding some light on this. Thanks in advance!
>
> Best regards,
> Radek Szymczyszyn

Hi!

Did you try with -O2?

I've created file "formy.utf8" filled with 1400000 lines of word
"żyźniejszymi" and here is my output:

$ ghc --make -O2 TestReadIntoHashTable.hs -o test -rtsopts
$ time ./test +RTS -K100m -RTS
1400000

real	0m3.265s
user	0m3.097s
sys	0m0.143s

Best,
Karol Samborski