[Haskell-cafe] Munging wiki articles with tagsoup

Gwern Branwen gwern0 at gmail.com
Mon Sep 8 01:31:05 EDT 2008

Hiya Neil. So recently I've been trying to come up with some automated system to turn The Monad Reader articles like those in <http://sneezy.cs.nott.ac.uk/darcs/TMR/Issue11> into wiki-formatted articles for putting on Haskell.org. Thus far, I've had the most success with SVN Pandoc.

Pandoc does a good job - you can see an example conversion at <http://haskell.org/haskellwiki/?title=User:Gwern/kenn&oldid=22808>. Modulo the errors which are largely due to haskell.org problems and a few limitations in Pandoc (no comments, no real support for references), it's fine.

But Pandoc's author will not support <haskell></haskell> tags inasmuch as they are an extension to MediaWiki and not universal; he prefers <pre> or <pre class="haskell"> tags. He suggested I use TagSoup to convert them into <haskell> tags. Well, alright. They're tags, TagSoup does tags - seems natural.

After an hour, I came up with a nice clean little script:


import Text.HTML.TagSoup.Render
import Text.HTML.TagSoup

main :: IO ()
main = interact convertPre

convertPre :: String -> String
convertPre = renderTags . map convertToHaskell . canonicalizeTags . parseTags

convertToHaskell :: Tag -> Tag
convertToHaskell x
               | isTagOpenName  "pre" x = TagOpen  "haskell" (extractAttribs x)
               | isTagCloseName "pre" x = TagClose "haskell"
               | otherwise              = x
                               extractAttribs :: Tag -> [Attribute]
                               extractAttribs (TagOpen _ y) = y
                               extractAttribs _             = error "The impossible happened."


On an aside, may I note that TagSoup doesn't seem to support transformations particularly well? Or if it does, I didn't notice any examples. I spent most of my time just figuring out how to convert the 'x' from a <pre>stuff to <haskell>stuff. Also, it might be nice to define an 'interact' alike, which is (String -> String), and defined, I supposed, as 'interact f = renderTags . f . canonicalizeTags . parseTags'. Extraction functions would be good as well - you'd only need 3 groups, I think; 1 for the 2 items in TagOpen, 1 for TagPosition's 2 positions, and 1 which extracts the String from the rest.

Anyway, so my script seems to work. I ran the wiki output through it and this is the diff: <http://haskell.org/haskellwiki/?title=User%3AGwern%2Fkenn&diff=22827&oldid=22811>.

Ok, good, it replaces all the tags... But wait, what's all this other stuff? It is replacing all my apostrophes with &apos;! No doubt this has something to do with XML/HTML/SGML or whatever, but it's not ideal. Even if it doesn't break the formatting (as I think it does), it's still cluttering up the source.

So, how can I fix this? Am I just barking up the wrong tree and should be writing a simple-minded search-and-replace sed script which replaces <pre> with <haskell>, </pre> with </haskell>...?

USS Enforcers SORO Morwenstow MOD Albright MI5 AOL 701 GCHQ
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20080908/408562e9/attachment-0001.bin

More information about the Haskell-Cafe mailing list