[Haskell-cafe] Reading files efficiently

Donald Bruce Stewart dons at cse.unsw.edu.au
Sun Mar 19 09:53:50 EST 2006


dons:
> 1:
> > I've got another n00b question, thanks for all the help you have been 
> > giving me!
> > 
> > I want to read a text file.  As an example, let's use 
> > /usr/share/dict/words and try to print out the last line of the file. 
> > First of all I came up with this program:
> > 
> > import System.IO
> > main = readFile "/usr/share/dict/words" >>= putStrLn.last.lines
> > 
> > This program gives the following error, presumably because there is an 
> > ISO-8859-1 character in the dictionary:
> > "Program error: <handle>: IO.getContents: protocol error (invalid 
> > character encoding)"
> > 
> > How can I tell the Haskell system that it is to read ISO-8859-1 text 
> > rather than UTF-8?
> > 
> > I now used iconv to convert the file to UTF-8 and tried again.  This 
> > time it worked, but it seems horribly inefficient -- Hugs took 2.8 
> > seconds to read a 96,000 line file.  By contrast the equivalent Python 
> > program:
> > 
> > print open("words", "r").readlines()[-1]
> > 
> > took 0.05 seconds.  I assume I must be doing something wrong here, and 
> > somehow causing Haskell to use a particularly inefficient algorithm. 
> > Can anyone give me any clues what I should be doing instead?
> 
> a) Compile your code with GHC instead of interpreting it. GHC is blazing fast.
> 
>     $ ghc -O A.hs
>     $ time ./a.out
>     Zyzzogeton
>     ./a.out  0.23s user 0.01s system 91% cpu 0.257 total
> 
> b) If not satisifed with the result, Use packed strings (as python does).
> 
> http://www.cse.unsw.edu.au/~dons/fps.html
> 
>     import qualified Data.FastPackedString as P
>     import IO
>     main = P.readFile "/usr/share/dict/words" >>= P.hPut stdout . last . P.lines
> 
>     $ ghc -O2 -package fps B.hs
>     $ time ./a.out
>     Zyzzogeton./a.out  0.04s user 0.02s system 86% cpu 0.063 total
> 
> 0.06s is ok with me  :)

Faster, don't split up the file into lines. Here we're following the
"How to optimise Haskell code by posting to haskell-cafe@" law:

import qualified Data.FastPackedString as P
import IO

main = do P.readFile "/usr/share/dict/words" >>= P.hPut stdout . snd .  P.spanEnd (/='\n') . P.init
          putChar '\n'

$ time ./a.out 
Zyzzogeton
./a.out  0.00s user 0.01s system 60% cpu 0.013 total



More information about the Haskell-Cafe mailing list