[Haskell-cafe] Re: Munging wiki articles with tagsoup

Neil Mitchell ndmitchell at gmail.com
Tue Sep 9 14:49:49 EDT 2008


Hi Gwern,

Sorry for not noticing this sooner, my haskell-cafe@ reading is
somewhat behind right now!


>  After an hour, I came up with a nice clean little script:
>
>  ----
>
>  import Text.HTML.TagSoup.Render
>  import Text.HTML.TagSoup
>
>  main :: IO ()
>  main = interact convertPre
>
>  convertPre :: String -> String
>  convertPre = renderTags . map convertToHaskell . canonicalizeTags . parseTags
>
>  convertToHaskell :: Tag -> Tag
>  convertToHaskell x
>                | isTagOpenName  "pre" x = TagOpen  "haskell" (extractAttribs x)
>                | isTagCloseName "pre" x = TagClose "haskell"
>                | otherwise              = x
>                              where
>                                extractAttribs :: Tag -> [Attribute]
>                                extractAttribs (TagOpen _ y) = y
>                                extractAttribs _             = error "The impossible happened."


convertToHaskell (TagOpen "pre" atts) = TagOpen "haskell" atts
convertToHaskell (TagClose "pre") = TagClose "haskell"
convertToHaskell x = x

Direct pattern matching is much easier and simpler.

>  Anyway, so my script seems to work. I ran the wiki output through it and this is the diff: <http://haskell.org/haskellwiki/?title=User%3AGwern%2Fkenn&diff=22827&oldid=22811>.
>
>  Ok, good, it replaces all the tags... But wait, what's all this other stuff? It is replacing all my apostrophes with &apos;! No doubt this has something to do with XML/HTML/SGML or whatever, but it's not ideal. Even if it doesn't break the formatting (as I think it does), it's still cluttering up the source.

The escaping of ' is caused by renderTags, so instead call:


renderTagsOptions (renderOptions{optEscape = (:[])})

For no escaping of any characters, or more likely do something like <,
> and & conversions. See the docs:
http://hackage.haskell.org/packages/archive/tagsoup/0.6/doc/html/Text-HTML-TagSoup-Render.html

> Am I just barking up the wrong tree and should be writing a simple-minded search-and-replace sed script which replaces <pre> with <haskell>, </pre> with </haskell>...?

Not necessarily. If you literally just want to replace "<haskell>"
with "<pre>" then sed is probably the easy choice. However, its quite
likely you'll want to make more fixes, and tagsoup gives you the
flexibility to extend in that direction.

Thanks

Neil


More information about the Haskell-Cafe mailing list