Parsing HTML

Wed Dec 10 16:07:38 EST 2003

Thanks - you should have replied to the list, because I think I did your
package a dis-service.  I've just been looking at the Haskell XML Toolbox,
and comparing the two, and now that I understand a little more it seems
like either will be fine for me.

In fact I will copy this to the list, hope that's OK, because maybe
someone will find this info via Google one day and find it useful.

Cheers,
Andrew

Malcolm Wallace said:
> "andrew cooke" <andrew at acooke.org> writes:
>
>> - HaXml looks like it might do what I want, but
>> (1) seems tricky to install (needs "make", which isn't that cool for
>> Windows);
>
> Until the general Haskell Library Infrastructure project is
> sufficiently mature, I'm afraid 'make' is going to be pretty
> de rigeur for any build-from-source library.
>
> Having said that, in the case of HaXml I reckon it would be pretty
> straightforward to dispense with 'make' and issue a couple of 'ghc
> --make' commands by hand.  Especially since you seem only to want a
> few of HaXml's facilities, not the complete set.
>
> Another alternative is simply to copy the small number of modules
> you need into your local build tree, and ignore the standard package
> mechanism altogether.
>
>> (2) has a load of fancy-schmancy combinator stuff, when all I want is a
>> stream of tokens (something like the Java SAX interface);
>
> If you really want only a stream of tokens, have a look at
> Text.XML.HaXml.Lex.  For an error-correcting parse into a generic
> tree-like XML data structure, use Text.XML.HaXml.Html.Parse.  You don't
> need the Combinators, Haskell2Xml, Xml2Haskell stuff at all.
>
>> (3) doesn't seem that solid on the basics
>> (doesn't seem to handle namespaces (maybe they appear as part
>> of the attribute name?)
>
> Namespaces are transparent, in the sense that the namespace is part
> of the element or attribute name, but there is no further automatic
> processing of it.  So basically HaXml doesn't do anything fancy with
> namespaces, but it doesn't crash, or discard them either.
>
>> (and I haven't yet worked out what it does about
>> other "esoteric" things like character entities, XML declarations,
>> CDATA,
>> comments, etc)).
>
> All of these are stored in the 'generic' XML data structure
> representation, so you can use them or discard them as you wish.
>
>     data Element   = Elem Name [Attribute] [Content]
>     type Attribute = (Name, AttValue)
>     data Content   = CElem Element
>                    | CString Bool CharData -- bool is whether whitespace
> is significant
>                    | CRef Reference	   -- character and entity references
>                    | CMisc Misc		   -- comments, processing instructions,
> etc.
>
>
>> (No offense implied - it's a cool piece of work, just
>> doesn't seem to be what I'm looking for;
>
> None taken.  I'm sure it looks complicated from the outside, but
> really it is just a collection of individual pieces that can be
> mixed and matched to suit the needs of any particular application.
>
>> I'd write it myself, but (X)HTML is deceptively complex, ...
>>                    HTML isn't XML,
>
> HaXml's special error-correcting HTML parser deals with most of this
> stuff, for instance self-closing tags (IMG), implicitly closed tags (P),
> improperly nested tags, and so on.
>
>>       typical malformed pages (unescaped "<" in text; unescaped data in
>> URLs inside links (eg "&"), etc)
>
> These two examples of error situations might be beyond the current
> capability of the error-correcting parser, but I haven't checked in
> a long while.
>
> So in summary, I think HaXml will get you a long way towards your
> goal, but you will probably want to be selective about what you use,
> and there may be extra things you need to code for yourself on top.
>
> Regards,
>     Malcolm
>
>

-- 
personal web site: http://www.acooke.org/andrew
personal mail list: http://www.acooke.org/andrew/compute.html