[Haskell-cafe] Accumulating related XML nodes using HXT

Albert Lai trebla at vex.net
Tue Oct 31 16:57:05 EST 2006


Daniel McAllansmith <dm.maillists at gmail.com> writes:

> Hello.
> 
> I have some html from which I want to extract records.  
> Each record is represented within a number of <tr> nodes, and all records <tr> 
> nodes are contained by the same parent node.

This is very poorly written HTML.  The original structure of the data
is destroyed - the parse tree no longer reflects the data structure.
(If a record is to be displayed in several rows, there are proper
ways.)  It is syntactically incorrect: nested <tr>, and color in <hr>.
(Just ask http://validator.w3.org/ .)  I trust that you are parsing
this because you realize it is all wrong and you want to
programmatically convert it to proper markup.

Since the file is unstructured, I choose not to sweat over restoring
the structure in an HXT arrow.  The HXT arrow will return a flat list,
just as the file is a flat ensemble.  The list looks like:

["/prod17", "Television", " (code: 17)", "A very nice telly.",
 "/prod24", "Cyclotron", " (code: 24)", "Mind your fillings."]

I then use a pure function to decompose this list four items at a time
to emit the desired records.  This is trivial outside HXT arrows.  I
use tuples, and every field is a string; you can easily change the
code to produce Prod's, turn " (code: 17)" into the number 17, etc.

Here is a complete, validated HTML 4 file containing the table, just
so that my program below actually has valid input.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<title>Products</title>
</head>
<body>

<table>
  <tr>
    <td><strong>Product:</strong></td>
    <td><strong><a href="/prod17">Television</a></strong> (code: 17)</td>
  </tr>
  <tr>
    <td><strong>Description:</strong></td>
    <td>A very nice telly.</td>
  </tr>

  <tr>
    <td><hr></td>
  </tr>

  <tr>
    <td><strong>Product:</strong></td>
    <td><strong><a href="/prod24">Cyclotron</a></strong> (code: 24)</td>
  </tr>
  <tr>
    <td><strong>Description:</strong></td>
    <td>Mind your fillings.</td>
  </tr>

  <tr>
    <td><hr></td>
  </tr>
</table>
</body>
</html>

Here is my program:

import Text.XML.HXT.Arrow

main =
    do { unstructured <- runX (p "table.html")
       ; let structured = s unstructured
       ; print structured
       }

p filename =
    readDocument [(a_parse_html,"1")] filename >>>
    deep (isElem >>> hasName "table") >>>
    getChildren >>> isElem >>> hasName "tr" >>>
    getChildren >>> isElem >>> hasName "td" >>>
    getChildren >>>
    p1 <+> p2

p1 =
    isElem >>> hasName "strong" >>>
    getChildren >>> isElem >>> hasName "a" >>>
    getAttrValue "href" <+> (getChildren >>> getText)

p2 =
    getText

s (a:b:c:d: rest) = (a,b,c,d) : s rest
s _ = []


More information about the Haskell-Cafe mailing list