[Haskell-cafe] strange behavior in Text.Regex.Posix

Tue Jan 23 05:38:42 EST 2007

John MacFarlane wrote:
> Can anyone help me understand this odd behavior in Text.Regex.Posix (GHC 6.6)?
> 
> Prelude Text.Regex.Posix Text.Regex> subRegex (mkRegex "\\^") "he\350llo" "@"
> "he at llo"
> 
> Why does /\^/ match \350 here?  Generally Text.Regex.Posix seems to work
> fine with unicode characters.  For example, \350 is treated as a single
> character here:
> 
> Prelude Text.Regex.Posix Text.Regex> subRegex (mkRegex "e.l") "he\350llo" "@"
> "h at lo"
> 
> The problem is specific to \350 and doesn't happen with, say, \351:
> 
> Prelude Text.Regex> subRegex (mkRegex "\\^") "he\351llo" "@"
> "he\351llo"
> 
> Is this a bug, or just something I'm not understanding?
> 
> John
> 
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe

The Text.Regex API calls the regex-posix backend in Text.Regex.Posix which hands
off the matching to the (very slow) posix c library.

And this library does not know unicode from a hole in the ground -- all Char are
truncated to a single byte:

chr (ord '\350' `mod` 256) is '^'

Thus your pattern, which matches the character '^' will match '\350'.

http://darcs.haskell.org/packages/
http://darcs.haskell.org/packages/regex-unstable/

For a full Char matching regex backend you should get regex-parsec.  The
regex-dfa backend has problems which I have not uploaded the fix to.

The regex-pcre backend ought to handle UTF8 -- but you have to handle the
conversion to UTF8, for which Data.ByteString will come in handy.

The unstable library regex-tdfa is much faster then regex-parsec and is more
POSIX compliant than regex-posix.  It should go stable within a week.