[Haskell-beginners] remove XML tags using Text.Regex.Posix

Jan Jakubuv jakubuv at gmail.com
Wed Sep 30 12:11:46 EDT 2009


Hi Robert,

On Tue, Sep 29, 2009 at 12:25:07PM -0700, Robert Ziemba wrote:
> I have been working with the regular expression package (Text.Regex.Posix).
>  My hope was to find a simple way to remove a pair of XML tags from a short
> string.
> 
> I have something like this "<tag>Data</tag>" and would like to extract
> 'Data'.  There is only one tag pair, no nesting, and I know exactly what the
> tag is.
> 

This is so simple that I would not recommend anything other than regular
expressions. Use the following pattern:

    pat = "<tag>(.*)</tag>"

It creates a group withing the matched string containing the data (it is
done using parenthesis). Use `[[String]]` as a result type and you receive a
list of matches where each match is described by a list of strings whose
first member is the whole matched string (including <tag> and </tag>) and it
is followed by values of groups (in our case we have just one group). Thus:

    *Main> "text<tag>data</tag>text" =~ pat :: [[String]]
    [["<tag>data</tag>","data"]]

It is easy extract the data using `(!!)` and `head`:

    *Main> (!! 1) . head $ ("text<tag>7</tag>text" =~ pat :: [[String]]) 
    "7"

> My first attempt was this:
> 
>   "<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
> 
> result:  "123"
> 

The problem with your pattern is that `[^<tag>]` doesn't mean what you think
it does. Its meaning is “one character which is not `<`, `t`, `a`, or `>`”
as Patrick already described in his mail.

> Upon further experimenting I realized that it only works with more than 2
> digits in 'Data'.  I occured to me that my thinking on how this regular
> expression works was not correct - but I don't understand why it works at
> all for 3 or more digits.
> 

It doesn't work for all 3 or more digits:
        
    *Main> "<tag>tag</tag>" =~ "[^<tag>].+[^</tag>]" :: String
    ""

Briefly, it doesn't work when the data contains one of characters `<`, `t`,
`a`, `g`, `>`.

Finally, consider using

    pat = "<tag>([^<]*)</tag>"

which works with more tags in the same line as well.

Sincerely,
    jan.




-- 
Heriot-Watt University is a Scottish charity
registered under charity number SC000278.



More information about the Beginners mailing list