[Haskell-cafe] regex-pcre and ghc-7.4.2 is not working with UTF-8

José Romildo Malaquias j.romildo at gmail.com
Thu Aug 23 13:59:52 CEST 2012


Hello.

I think I have an explanation for the problem with regex-pcre, ghc-7.4.2
and UTF Strings.

The Text.Regex.PCRE.String module uses the withCString and
withCStringLen from the module Foreign.C.String to pass a Haskell string
to the C library pcre functions that compile regular expressions, and
execute regular expressions to match some text.

Recent versions of ghc have withCString and withCStringLen definitions
that uses the current system locale to define the marshalling of a
Haskell string into a NUL terminated C string using temporary storage.

With a UTF-8 locale the length of the C string will be greater than the
length of the corresponding Haskell string in the presence with
characters outside of the ASCII range. Therefore positions of
corresponding characters in both strings do not match.

In order to compute matching positions, regex-pcre functions use C
strings. But to compute matching strings they use those positions with
Haskell strings.

That gives the mismatch shown earlier and repeated here with the
attached program run on a system with a UTF-8 locale:


   $ LANG=en_US.UTF-8 && ./test1
   getForeignEncoding: UTF-8

   regex            : país:(.*):(.*)
   text             : país:Brasília:Brasil
   String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))])
   String match     : [["pa\237s:Bras\237lia:Brasil","ras\237lia:B","asil"]]

   $ LANG=en_US.ISO-8859-1 && ./test1
   getForeignEncoding: ISO-8859-1

   regex            : pa�s:(.*):(.*)
   text             : pa�s:Bras�lia:Brasil
   String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))])
   String match     : [["pa\237s:Bras\237lia:Brasil","Bras\237lia","Brasil"]]


I see two ways of fixing this bug:

1. make the matching functions compute the text using the C string and
   the positions calculated by the C function, and convert the text back
   to a Haskell string.

2. map the positions in the C string (if possible) to the corresponding
   positions in the Haskell string; this way the current definitions of
   the matching functions returning text will just work.

I hope this would help fixing the issue.


Regards,

Romildo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test1.hs
Type: text/x-haskell
Size: 726 bytes
Desc: not available
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120823/f3a06a0b/attachment.hs>


More information about the Haskell-Cafe mailing list