[Haskell-beginners] remove XML tags using Text.Regex.Posix
Patrick LeBoutillier
patrick.leboutillier at gmail.com
Tue Sep 29 16:29:52 EDT 2009
Robert,
On Tue, Sep 29, 2009 at 3:25 PM, Robert Ziemba <rziemba at gmail.com> wrote:
> I have been working with the regular expression package (Text.Regex.Posix).
> My hope was to find a simple way to remove a pair of XML tags from a short
> string.
>
> I have something like this "<tag>Data</tag>" and would like to extract
> 'Data'. There is only one tag pair, no nesting, and I know exactly what the
> tag is.
>
> My first attempt was this:
>
> "<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
>
> result: "123"
>
> Upon further experimenting I realized that it only works with more than 2
> digits in 'Data'. I occured to me that my thinking on how this regular
> expression works was not correct - but I don't understand why it works at
> all for 3 or more digits.
>
> Can anyone help me understand this result and perhaps suggest another
> strategy? Thank you.
>
The regex you are using here can be described as such:
"Match a character not in the set '<,t,a,g,>', followed by 1 or more of
anything, followed by a character not in the set '<,/,t,a,g,>'."
Effectively, it will not match if your data has less than 3 characters and
is probably not the correct regex for this job, i.e. it would also match
"x123x". What you need is regex capturing, but I don't know if that is
available in that regex library (I'm not an expert Haskeller).
If you really need a regex to locate the tag, you could use a function like
this to extract it:
getTagData tag s =
let match = s =~ ("<" ++ tag ++ ">.*</" ++ tag ++ ">")::String
dropTag = drop (length tag + 2) s
getData = take (length match - (2 * length tag + 5)) dropTag
in if length match > 0
then Just getData
else Nothing
*Main> getTagData "tag" "<tag>123</tag>"
Just "123"
Patrick
> _______________________________________________
> Beginners mailing list
> Beginners at haskell.org
> http://www.haskell.org/mailman/listinfo/beginners
>
>
--
=====================
Patrick LeBoutillier
Rosemère, Québec, Canada
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/beginners/attachments/20090929/46415b74/attachment.html
More information about the Beginners
mailing list