[Haskell-cafe] regex-pcre is not working with UTF-8
José Romildo Malaquias
j.romildo at gmail.com
Wed Aug 22 16:56:58 CEST 2012
On Tue, Aug 21, 2012 at 05:50:44PM -0300, José Romildo Malaquias wrote:
> On Tue, Aug 21, 2012 at 04:05:28PM +0100, Chris Kuklewicz wrote:
> > I do not have time to test this myself right now. But I will unravel my code a
> > bit for you.
> >
> > > By November 2011 it worked without problems in my application. Now that
> > > I have resumed developping the application, I have been faced with this
> > > behaviour. As it used to work before, I believe it is a bug in
> > > regex-pcre or libpcre.
> >
> > I believe it may be problem in String <-> ByteString conversion. The "base"
> > library may have changed and your LOCALE information may be different or may be
> > being used differently by "base".
> >
> > > The (temporary) workaround I found is to convert the strings to
> > > byte-strings before matching, and then convert the results back to
> > > strings. With byte-strings it works well.
> >
> > That is an excellent sign that it is your LOCALE settings being picked up by
> > GHC's "base" package, see explanation below.
[...]
> I have written an application to test those things. There are 2 source
> files: test.hs and seestr.c, which are attached.
>
> The test does the following:
>
> 1. shows the getForeignEncoding
>
> 2. uses a C function to show the characters from a String (using
> withCString) and from a ByteString (using useAsCString)
>
> 3. matches a PCRE regular expression using String and ByteString
>
> The test is run twice, with different LANG settings, and its output
> follows.
[...]
> As can be seen, regular expression matching does not work with
> en_US.UTF-8. But it works with en_US.ISO-8859-1.
>
> The test shows that withCString is working as expected too. This
> may suggest the problem is really with regex-pcre.
The previous tests were run on an gentoo linux with ghc-7.4.1.
I have also run the tests on Fedora 17 with ghc-7.0.4, which does not
have the bug. The sources are attached. The tests output follows:
$ LANG=en_US.ISO-8859-1 && ./test
testing with String
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4
testing with ByteString
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4
regex : pa�s:(.*)
text : pa�s:Brasil
String match : [["pa\237s:Brasil","Brasil"]]
ByteString match : [["pa\237s:Brasil","Brasil"]]
$ LANG=en_US.UTF-8 && ./test
testing with String
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4
testing with ByteString
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4
regex : país:(.*)
text : país:Brasil
String match : [["pa\237s:Brasil","Brasil"]]
ByteString match : [["pa\237s:Brasil","Brasil"]]
Clearly witchCString has changed from ghc-7.0.4 to ghc-7.4.1. It seems
that With ghc-7.0.4 withCString does not obey the UTF-8 locale and
generates a latin1 C string.
Regards,
Romildo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.hs
Type: text/x-haskell
Size: 1551 bytes
Desc: not available
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120822/4d9bb102/attachment.hs>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seestr.c
Type: text/x-c
Size: 202 bytes
Desc: not available
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120822/4d9bb102/attachment.bin>
More information about the Haskell-Cafe
mailing list