[Haskell-cafe] HXT is slow?

Wed Jan 12 16:26:08 CET 2011

Hi Patrick,

> Is it just me, or is HXT slow? I noticed that both reading a document
> from a file, as well as running computations, are exceedingly slow,
> with simple stuff like 'get the contents of everything with a given
> class' taking .3 seconds for a 400KB HTML file in Python using lxml
> and 2 seconds using HXT with tagSoup and compiled with -O2.

The tagsoup parser is currently the slowest parser in HXT.
The native one is about twice as fast, but there are still some
performance problems due to unwanted laziness.
We are working on this. Usually the runtime is spend in parsing,
because of the expensive handling of character input, traversing a tree
and selecting some components is rather efficient compared to
parsing.

In the upcomming release there will be a binding to the expat parser via 
hexpat. This head version is already available on github
( https://github.com/UweSchmidt/hxt ).

When you compare runtimes of various parsers, please take into account,
what kind of functionality the parsers provide. If you want a standard parser,
and not just a parser that scans a few angle bracket, you have to do a bit 
more than reading a few chars and checking, whether they are in a specific
char range. These check and transformations are not for free.

Cheers,

  Uwe