[Haskell-cafe] strange behavior in Text.Regex.Posix
haskell at list.mightyreason.com
Tue Jan 23 05:38:42 EST 2007
John MacFarlane wrote:
> Can anyone help me understand this odd behavior in Text.Regex.Posix (GHC 6.6)?
> Prelude Text.Regex.Posix Text.Regex> subRegex (mkRegex "\\^") "he\350llo" "@"
> "he at llo"
> Why does /\^/ match \350 here? Generally Text.Regex.Posix seems to work
> fine with unicode characters. For example, \350 is treated as a single
> character here:
> Prelude Text.Regex.Posix Text.Regex> subRegex (mkRegex "e.l") "he\350llo" "@"
> "h at lo"
> The problem is specific to \350 and doesn't happen with, say, \351:
> Prelude Text.Regex> subRegex (mkRegex "\\^") "he\351llo" "@"
> Is this a bug, or just something I'm not understanding?
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
The Text.Regex API calls the regex-posix backend in Text.Regex.Posix which hands
off the matching to the (very slow) posix c library.
And this library does not know unicode from a hole in the ground -- all Char are
truncated to a single byte:
chr (ord '\350' `mod` 256) is '^'
Thus your pattern, which matches the character '^' will match '\350'.
For a full Char matching regex backend you should get regex-parsec. The
regex-dfa backend has problems which I have not uploaded the fix to.
The regex-pcre backend ought to handle UTF8 -- but you have to handle the
conversion to UTF8, for which Data.ByteString will come in handy.
The unstable library regex-tdfa is much faster then regex-parsec and is more
POSIX compliant than regex-posix. It should go stable within a week.
More information about the Haskell-Cafe