[Haskell-cafe] Hexpat: Lazy I/O problem with huge input files

Aleksandar Dimitrov aleks.dimitrov at googlemail.com
Wed Oct 13 17:06:04 EDT 2010


Hello Haskell Cafe,

I really hope this is the right list for this sort of question. I've
bugged the folks in #haskell, they say go here, so I'm turning to you.

I want to use Hexpat to read in some humongous XML files (linguistic
corpora,) since it's the only Haskell XML library (I could find) that
takes ByteStrings as input. I stumbled on a problem when using one of
the examples from the docs of Text.XML.Expat.Tree. The "cookbook
recipe" there suggests *first* processing the data, and only then
looking into the parser error to see if there has been an error. I
understand this should prevent the parse tree from being fully
evaluated before use. Unfortunately, that is not what happens on my
system (ghc 6.12.1, if that's of importance.)

This is the code from the docs, that I modified to read files:

> import Text.XML.Expat.Tree
> import System.Environment (getArgs)
> import Control.Monad (liftM)
> import qualified Data.ByteString.Lazy as C
>·
> -- This is the recommended way to handle errors in lazy parses
> main = do
>     f <- liftM head getArgs >>= C.readFile
>     let (tree, mError) = parse defaultParseOptions f
>     print (tree :: UNode String)
>·
>     -- Note: We check the error _after_ we have finished our processing
>     -- on the tree.
>      case mError of
>          Just err -> putStrLn $ "It failed : "++show err
>          Nothing -> putStrLn "Success!"

Given a 42M test file, an invocation like this:

% ghc --make -O2 Hexpat.hs
% ./Hexpat input.xml > dump.xml

will gobble up some 2Gigs of RAM (at least. I usually kill it before
it starts thrashing the swap space, since that almost crashes my
entire machine.) If I remove the last 3 lines:

> import Text.XML.Expat.Tree
> import System.Environment (getArgs)
> import Control.Monad (liftM)
> import qualified Data.ByteString.Lazy as C
>
> main = do
>     f <- liftM head getArgs >>= C.readFile
>     let (tree, mError) = parse defaultParseOptions f
>     print (tree :: UNode String)

the same invocation and input file barely uses a megabyte or two of
RAM and finishes really quickly.

Why is that? Is this a mistake in the Hexpat docs, or am I doing
something wrong? Lazy IO has always been a little bit of a mystery to
me, and just when I thought I had it...

Thanks for any help on the matter!
Aleks


More information about the Haskell-Cafe mailing list