[Haskell] (HS)XML queries via SYB3-like generic transformations

oleg at pobox.com oleg at pobox.com
Fri Apr 21 18:29:36 EDT 2006


This message demonstrates SYB3-like generic term processing for
transformations and selections from (HS)XML-like documents. Our
running example is an HSXML document that includes Haskell sample
code, which is to be formatted specially. Moreover, we wish to
guarantee that the sample code is syntactically and
type-correct. Furthermore, we wish to evaluate expressions in the
context of the code -- and include the result in the rendered HTML
document. The present message is inspired by the example
posted by shelarcy at gmail.com on the Haskell list, Mar 25, 2006, and
benefited from discussions with shelarcy at gmail.com.

We show a quite non-trivial generic term processing: extracting
textual data from a particular subterm. The subterm in question is
identified by its type; as there may be several occurrences of
subterms of a given type, we further identify the desired subterm by
its content (i.e., a string label). Some of the extracted strings must
be concatenated, while the others must be kept distinct -- depending
on their occurrences within terms of a particular type. Thus our
extraction has both static aspects (i.e., type-directed) as well as
dynamic ones (i.e., guards).  We should point out that in HSXML, the
notions of the element type (in the sense of the XML Recommendation)
and of the Haskell data type coincide.

Our generic term processing is like SYB3 in spirit, but different in
implementation. In particular, we do not require a GHC extension of
recursive instances, although this point is moot here as HSXML so far
has had no recursive datatypes. Like SYB3, we do need overlapping
instances to specify generic term processing, which the user can later
override for particular types of terms. Our HSXML transformations are
like SXSLT, SXML pre-post-order transformations described in
	http://pobox.com/~oleg/ftp/Scheme/xml.html#SXSLT
only more modular.
The complete code discussed in the present message is in the file
sample-code.hs included in the (updated) archive
	http://pobox.com/~oleg/ftp/Haskell/HSXML.tar.gz

Here's our running example:

> test_haskell =
>     (document
>      (head
>       [meta_tag [description "Haskell"]])
>     (body
>      [h1 "Haskell examples"]
>      [h2 "The first sample code"]
>      [hcode "sample1"
>       "module Main where"
>       "fact n = product [1..n]"
>      ]
>      [hcode "sample2"
>       [[thide "module Main where"]]
>       [[thcode "fix f = f $" [[em "fix f"]]]]
>       "fact self n = if n <= 0 then 1 else n * (self (n-1))"
>      ]
>      [p "the result:" [[thrunh test_haskell "sample2"
>                         "show $ fix fact 5"]]]
>     ))


It includes two pieces of Haskell code, gabelled "sample1" and
"sample2". The sample code is comprised of lines, which may include
formatting annotations such as 'em' to emphasize certain parts of
code. Some other lines of code, like module declarations, can be
hidden in rendering for the sake of clarity.

We may wish to extract a particular piece of code from the HSXML
document, with formatting annotations removed. For example, we may wish to
type-check that code, and evaluate some expressions in its context. We
may want to do that while we render the HSXML document into HTML.
For instance, [[thrunh test_haskell "sample2" "show $ fix fact 5"]],
when rendered on a particular formatter, will extract "sample2",
evaluate the expression "show $ fix fact 5" in its context, and insert
the result in the rendered HTML document:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
 "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta name="description" content="Haskell"></head>

<body bgcolor="#FFFFFF">
<h1>Haskell examples</h1>

<h2>The first sample code</h2>

<pre>
&gt; module Main where
&gt; fact n = product [1..n]</pre>

<pre>
&gt; fix f = f $ <em>fix f</em>
&gt; fact self n = if n &lt;= 0 then 1 else n * (self (n-1))</pre>

<p>the result:  <code>show $ fix fact 5</code> is 120</p>
</body>
</html>

To accomplish evaluation during rendering, we do invoke a Haskell
system (runghc). To render Haskell code in a particular way, we define
a new HSXML context, CT_hcode, and provide for interpretation of the
text data and elements in that context:

> -- Rendering of Strings in the HCODE context.
> instance Render (HW CT_hcode String) where
>     render (HW str) = emit "\n> " >> emit str
>
> -- `Transparent' grouping of inline elements into one
> -- hcode element
> -- This is an administrative element
> newtype THCODE a = THCODE a deriving Show
> thcode x = build (as_hcode . HW . THCODE . rev'apppend HNil) nil_inline x
>
> -- An THCODE element is rendered as one string
> instance RenderInline b => Render (HW CT_hcode (THCODE b)) where
>     render (HW (THCODE b)) = emit "\n> " >> (render_inline False b
>                                              >> return ())

Of a particular interest is SYB3-like generic processing for
extracting sections of HSXML document. Just like in SYB3, the generic
term processing code naturally splits into three distinct modules. In
sample-code.hs file however, all three `modules' are placed within one
file, for convenience. Again, our processing is SYB3 is spirit only --
but not in the implementation.

The first part of the library defines the class Dat and its instances
for term types of interest. This part of the library depends
exclusively on the structure of terms -- irrespective of operations
one may wish to perform on terms.

> class Dat gctx a where
>   gmapq :: ([gctx] ->w) -> a -> w

We also define a class of generic functions, identified by labels. We
don't provide any instances of this class here.

> class FN label a where
>   fn :: a -> label


Unlike SYB3, Dat is not a subclass of any class (cf. the SAT class in
SYB3).  Thus the dictionary for Dat does not depend on other
dictionaries.  Furthermore, the function gmapq has _no_ higher-rank
type.  The type gctx is the label that identifies a particular generic
function to apply. Within this part of the library, we treat gctx as
an (abstract) type variable.

Here are a few instances of the Dat class

> instance Dat gctx (HW ctx String) where
>   gmapq u _ = u []

> instance Dat gctx HNil where
>   gmapq u _ = u []

> instance (FN gctx a, FN gctx b) => Dat gctx (HCons a b) where
>   gmapq u (HCons a b)  = u [fn a, fn b]

> -- Instances of Dat for more specific containers
> instance (FN gctx h, FN gctx b) => Dat gctx (HW ctx (Document h b)) where
>     gmapq u (HW (Document h b)) = u [fn h, fn b]
>
> instance (FN gctx h) => Dat gctx (HW ctx (Head h)) where
>     gmapq u (HW (Head h)) = u [fn h]

As we can see, all the instances are quite regular and could well be
automatically derived (e.g., via DrIFT).

The second part of the generic term traversal library defines generic
functions, irrespective of the structure of the types to which they
may apply. Here we define a traversal function to extract a part of
the term according to some criterion (to be specified later by the
user of the library)

> class ExtractD a r where
>   extractd :: a -> r

> newtype EXD r = EXD{unEXD :: r}

We use EXD in two ways. First, we use its type as a label to
select the proper instance of FN. Second, we use EXD to encapsulate
the result type. So, we can select the proper instance of generic
function based on the argument type and the result type.

The following `registers' ExtractD to be a generic `function'.

> instance ExtractD a r => FN (EXD r) a where fn = EXD . extractd 

That's all there is to the second part of the library. Incidentally,
none of the previous parts needed overlapping instances.

Finally, we wish to instantiate the ExtractD function to extract all
strings from a part of the term that has the type (HW CT_block (HCode
b)) and which has the desired label (given as a string):

First we define a generic extraction, which applies to any term type:

> instance Dat (EXD (String->[String])) a 
>     => ExtractD a (String->[String]) where
>     extractd a label = gmapq (concatMap (\x -> unEXD x $ label)) a

This traversal rule obviously does nothing but propagating accumulated
strings. To get any result we need to define specific extraction
rules. In our case, we have to find the HCode sub-term of the desired
label and extract all strings from it:

> instance ExtractC b => 
>     ExtractD (HW CT_block (HCode b)) (String->[String]) where
>     extractd (HW (HCode label b)) label' = 
>         if label == label' 
>            then extractc b
>            else []                      -- otherwise, ignore the whole block

Here extractc is yet another generic function, which extracts the
string content from terms and returns the array of lines. The
following is the entire definition.

A three-line boiler-plate:

> class ExtractC a  where extractc :: a -> [String]
> newtype EXC = EXC{unEXC :: [String]}
> instance ExtractC a => FN EXC a where fn = EXC . extractc

General extraction instance, which only propagates what other
instances might have produced:

> instance Dat EXC a => ExtractC a where
>     extractc a = gmapq (concatMap unEXC) a

Specific instances:

> instance ExtractC (HW ctx String) where
>     extractc (HW str) = [str]
> instance ExtractC b => ExtractC (HW CT_hcode (THIDE b)) where
>     extractc (HW (THIDE b)) = [concat $ extractc b]
> instance ExtractC b => ExtractC (HW CT_hcode (THCODE b)) where
>     extractc (HW (THCODE b)) = [concat $ intersperse " " (extractc b) ]



More information about the Haskell mailing list