[Haskell-cafe] Accumulating related XML nodes using HXT
Daniel McAllansmith
dm.maillists at gmail.com
Tue Oct 31 17:32:35 EST 2006
On Wednesday 01 November 2006 10:57, Albert Lai wrote:
> Daniel McAllansmith <dm.maillists at gmail.com> writes:
> > Hello.
> >
> > I have some html from which I want to extract records.
> > Each record is represented within a number of <tr> nodes, and all records
> > <tr> nodes are contained by the same parent node.
>
> This is very poorly written HTML. The original structure of the data
> is destroyed - the parse tree no longer reflects the data structure.
> (If a record is to be displayed in several rows, there are proper
> ways.) It is syntactically incorrect: nested <tr>, and color in <hr>.
> (Just ask http://validator.w3.org/ .)
Indeed. The original is even worse, with overlapping nodes and other such
treasures which makes navigation in HXT tricky at times.
> I trust that you are parsing
> this because you realize it is all wrong and you want to
> programmatically convert it to proper markup.
Yep! I sure wouldn't be doing this if I had control of the the original HTML.
>
> Since the file is unstructured, I choose not to sweat over restoring
> the structure in an HXT arrow. The HXT arrow will return a flat list,
> just as the file is a flat ensemble.
I was about to write a follow-up just as your mail came in... I've ended up
with the same solution as you've kindly suggested.
Another option I came across is Control.Arrow.ArrowTree.changeChildren which
could be used to restore a more normalised structure ready for more
processing.
Thanks
Daniel
More information about the Haskell-Cafe
mailing list