[Haskell-cafe] ANN: vxml (validating xml lib) - proof of concept - need some guidance - bad type level performance (ghc)

Marc Weber marco-oweber at gmx.de
Fri Aug 29 21:58:46 EDT 2008


Hi @ll.

This shouldn't have been an announce yet.. it's more crying for help to
get to know how / wether to continue :-)

I had an idea: a xml generating library validating the result against a
given dtd (to be used in web frameworks etc .. )

After one week of coding I got the result

git clone git://mawercer.de/vxml (1) (see README)
git hash as of writing : 4dc53

A minimal example looks like this (taken from test.hs):

  import Text.XML.Validated.TH

  $( dtdToTypes "dtds/xhtml1-20020801/DTD/xhtml1-strict_onefile.dtd" 
                (XmlIds (Just "-//W3C//DTD XHTML 1.0 Strict//EN") 
                        (Just "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd") ) )

  main = putStrLn $ xml $ ((html << (head << (title <<< "vxml hello world")))
                                 << (body << (div <<< "PCDATA text" )) )

Which takes about 7 secs to compile on my (fast) 2.5 Ghz machine.
Before Oleg telling me about how to use the TypeEq implementation found
in HList it took several hours (switch the cabal configure flag to see
experience the incredible slow TypeToNat implementation)

However type checking the simple small document (2) already takes 35
seconds, duplicating the tags within the body raises compilation
time to 74 seconds. (Have a look at test/XmlToQ.hs to automatically
create the code generating such a given xml file (its used in
run-testcases.hs)

The dtd representation used is very easy to understand:
(3a) shows allowed subelements of the head tag
  Seq = sequence
  Star = zero or more
  ...
(3b)
  shows the list of allowed attributes

for a more commonly used tag such as div the element list (3a) is 324 lines
long and there are approximately 15 attributes.

Use a line such as 
  putStrLn $ xml $ debugEl div
to get this view.


All the work is done by

  class Consume st el r | st el -> r -- result is on of C,CS,R,F 
  class Retry elType st el st' | st el -> st'

  st = state as given in (3a)
  el = child to be added (Elem Html_T) or PCDATA

  C = element consumed, end
  CS a = element consumed, continue with state a
  R a = retry with given state (can happen after removing a Star = ()*
        on a no match )
  F a = no match, show failure a

  instance Retry elType C el C
  instance Retry elType (CS st) el st
  instance ( -- retry 
    Consume st el r
    , Retry elType r el st'
    ) => Retry elType (R st) el  st'

Maybe you knowing much more about ghc internals have some more
ideas how to optimize? I only came up with

a) implement
  xml HTrue $ xmldoc -- type checking variant 
  xml HFalse $ xmldoc -- unchecked but faster variant
  so you can run xml HTrue once a day only while having lunch..
 ( doubt Maybe you have to take lunch two or more times on intermediate 
  web projects .. )

b)  change (A attrType) to attrType only
     and (Elem e) to e
  which does not work for the same reason as
    class TypeEq a b c | a b -> c
    instance TypeEq a a HTrue
    instance TypeEq a b HFalse
  doesn't.
  Of course a a will match a b as well, but the result is totally
  different..
  So would it be possible to either tell ghc to ignore this happily
  taking the result of the better matching instance?

  Or to explicitely tell ghc when an instance matches the way it does in
  current implementations make it a no match if some constraints can or
  cannot be satisfied such as this:

    class TypeEq a b c | a b -> c
    instance TypeEq a a HTrue
    instance [ NoMatchOn a b -- if this can be satisfied ignore this instance 
             ] => TypeEq a b HFalse
    class NoMatchOn a b
    instance NoMatchOn a a

c) Having a reduction rule such as this:
  If the left side of (Or a b) returnns C (consumed) tell ghc to not
  evaluate the result of (Consume b element result) which must be done
  to see wether there is closer instance match (Is this correct ?)


Of course not everything is implemented yet.. but except speed I
consider it beeing quite usable.

He, thanks for reading till the end. Any feedback and suggestions
are appreciated. If you have any trouble contact me on irc.

I've been using
  $ ghc --version
  The Glorious Glasgow Haskell Compilation System, version 6.8.2

Sincerly
Marc Weber


(1) If you are interested and either don't have git or are not familiar
using it I can send you a tarball.

(2)

  <!DOCTYPE html SYSTEM "/tmp/dtd.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
  <head>
    <title>Example 6 - XHTML 1.0 Strict as application/xhtml+xml</title>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
    <link rel="stylesheet" type="text/css" href="style.css" />
  </head>
  <body>
    <h1>Example 6 - XHTML 1.0 Strict as application/xhtml+xml</h1>
    <p>
     This document is valid XHTML 1.0 Strict served as
     <code>application/xhtml+xml</code>.
    </p>
   
    <p>
    This document references CSS rules contained in an external
    stylesheet via <code>link</code>.
    </p>
   
    <p>
    Note how the CSS rules for the background are applied in Netscape 7.x,
    Mozilla, Opera 7 but that Internet Explorer can not display the page at all.
    </p>
   
    <p>
      <a href="http://validator.w3.org/check/referer"><img
          src="http://www.w3.org/Icons/valid-xhtml10"
          alt="Valid XHTML 1.0!" height="31" width="88" /></a>
    </p>
   
  </body>
  </html>

(3a)

    (Seq
       (Star
          (Or
             (Elem Script_T)
             (Or
                (Elem Style_T)
                (Or (Elem Meta_T) (Or (Elem Link_T) (Elem Object_T))))))
       (Or
          (Seq
             (Elem Title_T)
             (Seq
                (Star
                   (Or
                      (Elem Script_T)
                      (Or
                         (Elem Style_T)
                         (Or
                            (Elem Meta_T)
                            (Or (Elem Link_T) (Elem Object_T))))))
                (Query
                   (Seq
                      (Elem Base_T)
                      (Star
                         (Or
                            (Elem Script_T)
                            (Or
                               (Elem Style_T)
                               (Or
                                  (Elem Meta_T)
                                  (Or
                                     (Elem Link_T) (Elem Object_T))))))))))
          (Seq
             (Elem Base_T)
             (Seq
                (Star
                   (Or
                      (Elem Script_T)
                      (Or
                         (Elem Style_T)
                         (Or
                            (Elem Meta_T)
                            (Or (Elem Link_T) (Elem Object_T))))))
                (Seq
                   (Elem Title_T)
                   (Star
                      (Or
                         (Elem Script_T)
                         (Or
                            (Elem Style_T)
                            (Or
                               (Elem Meta_T)
                               (Or (Elem Link_T) (Elem Object_T)))))))))))

(3b)

  (HCons (A Lang_A)
    (HCons (A Xml:lang_A)
      (HCons (A Dir_A)
        (HCons (A Id_A)
          (HCons (A Profile_A) HNil))))) 


More information about the Haskell-Cafe mailing list