[Haskell-cafe] Unescaping with HaXmL (or anything else!)
gale at sefer.org
Tue Apr 1 11:12:44 EDT 2008
On Fri, Mar 28, 2008 at 4:26 AM, Anton van Straaten wrote:
> I want to unescape an encoded XML or HTML string, e.g. converting "
> to the quote character, etc.
> Since I'm using HaXml anyway, I tried using xmlUnEscapeContent with no
I only noticed your post today, sorry for the delay.
I also need this. In fact, it seems to me that it would be
generally useful. I hope that simple functions to escape/unescape
a string will be added to the API.
In the meantime, you are right that it is a bit tricky
to do this in HaXml. Besides the wrappers that you found
to be needed, there are two other issues:
One issue is that you need to lex and then parse the text first.
If you tell HaXml that your string is a CString, it
will believe you and just use the text the way it is without
any further processing.
The other issue is that HaXml's lexer currently can only
deal with XML content that begins with an XML tag. (I've
pointed this out to Malcolm Wallace, the author of HaXml.)
So in order to use it, you need to wrap your content in a
tag and then unwrap it after parsing.
The code below works for me (obviously it would be better to
remove the "error" calls):
import Text.XML.HaXml.Parse (xmlParseWith, document)
import Text.XML.HaXml.Lex (xmlLex)
unEscapeXML :: String -> String
unEscapeXML = concatMap ctext . xmlUnEscapeContent stdXmlEscaper .
either error id . fst . xmlParseWith document .
xmlLex "oops, lexer failed" . wrapWithTag "t"
ctext (CString _ txt _) = txt
ctext (CRef (RefEntity name) _) = '&' : name ++ ";" -- skipped by escaper
ctext (CRef (RefChar num) _) = '&' : '#' : show num ++ ";" -- ditto
ctext _ = error "oops, can't unescape non-cdata"
wrapWithTag t s = concat ["<", t, ">", s, "</", t, ">"]
unwrapTag (Document _ _ (Elem _ _ c) _) = c
unwrapTag _ = error "oops, not wrapped"
More information about the Haskell-Cafe