[Haskell-cafe] Accumulating related XML nodes using HXT

Daniel McAllansmith dm.maillists at gmail.com
Wed Nov 1 16:53:05 EST 2006


Apologies if this is a duplicate, the original appears to have gone astray.

On Wednesday 01 November 2006 10:57, Albert Lai wrote:
> Daniel McAllansmith <dm.maillists at gmail.com> writes:
> > Hello.
> >
> > I have some html from which I want to extract records.
> > Each record is represented within a number of <tr> nodes, and all records
> > <tr> nodes are contained by the same parent node.
>
> This is very poorly written HTML.  The original structure of the data
> is destroyed - the parse tree no longer reflects the data structure.
> (If a record is to be displayed in several rows, there are proper
> ways.)  It is syntactically incorrect: nested <tr>, and color in <hr>.
> (Just ask http://validator.w3.org/ .)  

Indeed.  The original is even worse, with overlapping nodes and other such 
treasures which makes navigation in HXT tricky at times.

> I trust that you are parsing 
> this because you realize it is all wrong and you want to
> programmatically convert it to proper markup.

Yep!  I sure wouldn't be doing this if I had control of the the original HTML.

>
> Since the file is unstructured, I choose not to sweat over restoring
> the structure in an HXT arrow.  The HXT arrow will return a flat list,
> just as the file is a flat ensemble.

I was about to write a follow-up just as your mail came in... I've ended up 
with the same solution as you've kindly suggested.

Another option I came across is Control.Arrow.ArrowTree.changeChildren which 
could be used to restore a more normalised structure ready for more 
processing.


Thanks
Daniel


More information about the Haskell-Cafe mailing list