[Haskell-cafe] regex and Unicode

Peter Simons simons at nospf.cryp.to
Thu Sep 8 06:42:10 UTC 2016


Hi Brian,

 > I tried to write a program using Text.Regex.PCRE to search through a
 > UTF8-encoded document. It appears that the presence of non-breaking-space
 > characters (code point 160) triggers some weird behavior in my program.

I seem to recall that regex-pcre simply binds to the system's pcre
library and effectively lets that library do all the work. Now, libpcre
has full Unicode support, but that needs to be enabled at compile time
to be available. I believe "--enable-unicode-properties" is the
appropriate configure flag, but I don't know for sure. Anyway, my point
is that your system's libpcre may or may not have that feature enabled.
If it does not, then regex-pcre won't be able to deal with Unicode
characters properly and that issue should be reported to Debian. If your
system library *has* Unicode support, then this issue might be a caused
by a bug in regex-pcre (unlikely) or in your code that uses it (more
likely).

I hope this helps,
Peter



More information about the Haskell-Cafe mailing list