Parsing HTML

Wed Dec 10 15:09:23 EST 2003

What are the options for parsing/lexing (X)HTML?  As far as I can see...

- the HTML library in GHC (or from  Andy Gill) is for creating documents,
not parsing them

- HaXml looks like it might do what I want, but (1) seems tricky to
install (needs "make", which isn't that cool for Windows); (2) has a load
of fancy-schmancy combinator stuff, when all I want is a stream of tokens
(something like the Java SAX interface); (3) doesn't seem that solid on
the basics (doesn't seem to handle namespaces (maybe they appear as part
of the attribute name?) (and I haven't yet worked out what it does about
other "esoteric" things like character entities, XML declarations, CDATA,
comments, etc)).  (No offense implied - it's a cool piece of work, just
doesn't seem to be what I'm looking for; this is all from reading the docs
and api rather than looking at code, so I may be mistaken).

- nothing else on the haskell.org page appears to do parsing.

I'd write it myself, but (X)HTML is deceptively complex, in my experience.
 You start of thinking it's going to be trivial (S-expressions), then you
realise that there HTML isn't XML, then there are character entities,
weird CDATA things, namespaces, that what you have isn't robust enough to
parse typical malformed pages (unescaped "<" in text; unescaped data in
URLs inside links (eg "&"), etc) that are accepted by browsers, etc.

Maybe that's why there doesn't seem to be anything?!

(I'm writing a simple tool that generates web pages from templates; the
tool data appears in attributes with a namespace (this is the standard
trick for mixing code generation with HTML in a way that web authoring
tools can parse).  Hence the mix of requirements for HTML and XML.)

Cheers,
Andrew

-- 
personal web site: http://www.acooke.org/andrew
personal mail list: http://www.acooke.org/andrew/compute.html