[Haskell-cafe] Accumulating related XML nodes using HXT
Albert Lai
trebla at vex.net
Tue Oct 31 16:57:05 EST 2006
Daniel McAllansmith <dm.maillists at gmail.com> writes:
> Hello.
>
> I have some html from which I want to extract records.
> Each record is represented within a number of <tr> nodes, and all records <tr>
> nodes are contained by the same parent node.
This is very poorly written HTML. The original structure of the data
is destroyed - the parse tree no longer reflects the data structure.
(If a record is to be displayed in several rows, there are proper
ways.) It is syntactically incorrect: nested <tr>, and color in <hr>.
(Just ask http://validator.w3.org/ .) I trust that you are parsing
this because you realize it is all wrong and you want to
programmatically convert it to proper markup.
Since the file is unstructured, I choose not to sweat over restoring
the structure in an HXT arrow. The HXT arrow will return a flat list,
just as the file is a flat ensemble. The list looks like:
["/prod17", "Television", " (code: 17)", "A very nice telly.",
"/prod24", "Cyclotron", " (code: 24)", "Mind your fillings."]
I then use a pure function to decompose this list four items at a time
to emit the desired records. This is trivial outside HXT arrows. I
use tuples, and every field is a string; you can easily change the
code to produce Prod's, turn " (code: 17)" into the number 17, etc.
Here is a complete, validated HTML 4 file containing the table, just
so that my program below actually has valid input.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<title>Products</title>
</head>
<body>
<table>
<tr>
<td><strong>Product:</strong></td>
<td><strong><a href="/prod17">Television</a></strong> (code: 17)</td>
</tr>
<tr>
<td><strong>Description:</strong></td>
<td>A very nice telly.</td>
</tr>
<tr>
<td><hr></td>
</tr>
<tr>
<td><strong>Product:</strong></td>
<td><strong><a href="/prod24">Cyclotron</a></strong> (code: 24)</td>
</tr>
<tr>
<td><strong>Description:</strong></td>
<td>Mind your fillings.</td>
</tr>
<tr>
<td><hr></td>
</tr>
</table>
</body>
</html>
Here is my program:
import Text.XML.HXT.Arrow
main =
do { unstructured <- runX (p "table.html")
; let structured = s unstructured
; print structured
}
p filename =
readDocument [(a_parse_html,"1")] filename >>>
deep (isElem >>> hasName "table") >>>
getChildren >>> isElem >>> hasName "tr" >>>
getChildren >>> isElem >>> hasName "td" >>>
getChildren >>>
p1 <+> p2
p1 =
isElem >>> hasName "strong" >>>
getChildren >>> isElem >>> hasName "a" >>>
getAttrValue "href" <+> (getChildren >>> getText)
p2 =
getText
s (a:b:c:d: rest) = (a,b,c,d) : s rest
s _ = []
More information about the Haskell-Cafe
mailing list