[Haskell-cafe] Re: Bug in Text.Regex.PCRE - do not accept national symbol in pattern

Mon Jan 19 08:09:21 EST 2009

Alexandr,

   Thanks for sending me this question about unicode and regex-pcre.  I will 
share with the mailing list. This is an encoding issue.

 From the haddock documentation for regex-pcre:

http://hackage.haskell.org/packages/archive/regex-pcre/0.94.1/doc/html/Text-Regex-PCRE.html

"Using the provided CompOption and ExecOption values and if configUTF8 is
True, then you might be able to send UTF8 encoded ByteStrings to PCRE and get
sensible results. This is currently untested."

This is a literate Haskell post so you can save with file extension
".lhs" and pass this to ghci.

The answer is a combination of "man 3 pcre" and the haddock
documentation for haskell-pcre and using makeRegexOpts.  I show one
possible way to use utf8 below, via the 'utf8-string' package from
hackage.  There are other ways to use the same package and other
packages available.

 > {-# LANGUAGE FlexibleContexts #-}
 > import Text.Regex.PCRE hiding ((=~))
 > --import Text.Regex.PCRE.Wrap(configUtf8)
 > import qualified Data.ByteString.UTF8 as U
 > import qualified System.IO.UTF8 as U
 > import Data.Bits((.|.))
 >

Here I copied the original source for (=~) from
http://hackage.haskell.org/packages/archive/regex-pcre/0.94.1/doc/html/Text-Regex-PCRE-Wrap.html#v%3A%3D~
I then editied it to create a custom (=~) that defines its own
options.  You can add compNoUTF8Check for performance/safety tradeoff
(see man 3 pcre).

 > makeRegexUtf8 :: (RegexMaker Regex CompOption ExecOption source) => source -> 
Regex
 > makeRegexUtf8 r = let co = defaultCompOpt .|. compUTF8 -- need compUTF8 flag 
when using makeRegexOpts
 >                       -- co = defaultCompOpt .|. compUTF8 .|. compNoUTF8Check  --
 >                   in makeRegexOpts co defaultExecOpt r

 > (=~)  :: (RegexMaker Regex CompOption ExecOption source,RegexContext Regex 
source1 target)
 >       => source1 -> source -> target
 > (=~) x r = let q = makeRegexUtf8 r
 >            in match q x

If you are going to use the same pattern against many different texts
then you should NOT use (=~).  Instead you should call makeRegexUtf8
and reuse the resulting Regex value.  Otherwise you have to recompile
the pattern for each match performed.

Below, 're_test' was changed internally to convert the [Char] into a
ByteString holding a utf8 encoded representation.  The 'makeRegexOpts'
and 'match' calls will then run the libpcre routines directly on the
the memory that backs the ByteString.  This is an optimal was to use
the library.

 > re_test :: String -> String -> Bool
 > re_test re str = (U.fromString str) =~ (U.fromString re)
 >
 > -- test for national symbols
 > main = do
 >   putStrLn $ "If this line ends with True then your libpcre has UTF8 support: 
" ++ show configUTF8
 >   let pattern1,pattern2,pattern3,text :: String
 >       pattern1 = "^п.*"
 >       pattern2 = "^..ив.*"
 >       pattern3 = "^......$"
 >       text = "привет"
 >   U.putStrLn $ "The 3 patterns are: " ++ pattern1 ++ ", " ++ pattern2 ++ ", 
and "++pattern3
 >   U.putStrLn $ "The text to be matched is " ++ text
 >   putStrLn $ "The length of the text to be matched is "++show (length text)
 >   putStrLn "All three lines below should print True"
 >   print $ re_test pattern1 text
 >   print $ re_test pattern2 text
 >   print $ re_test pattern3 text

The output when I run this on my machine is

If this line ends with True then your libpcre has UTF8 support: True
The 3 patterns are: ^п.*, ^..ив.*, and ^......$
The text to be matched is привет
The length of the text to be matched is 6
All three lines below should print True
True
True
True