[Haskell-cafe] A weird bug of regex-pcre

Tue Dec 18 10:55:44 CET 2012

I had similar issues a while ago. It had to do with UTF-8 encoding as far
as I can recall.

I wanted to "wrap" a multiline string (code listings) within some pandoc
generated HTML of a hakyll page with a container "div". The text to wrap
would be determined using a PCRE regex.

Here the (probably inefficient) implementation:

module Transformations where

import Hakyll
import qualified Text.Regex.PCRE as RE
import qualified Data.ByteString.UTF8 as BSU
import qualified Data.ByteString as BS

-- Wraps numbered code listings within the page body with a div
-- in order to be able to apply some more specific styling.
wrapNumberedCodelistings (Page meta body) =
    Page meta newBody
    where
        newBody = regexReplace' regex wrap body
        regex = "<table\\s+class=\"sourceCode[^>]+>.*?</table>"-
        wrap x = "<div class=\"sourceCodeWrap\">" ++ x ++ "</div>"

-- Replace the whole string matched by the given
-- regex using the given replacement function (hopefully UTF8-aware)
regexReplace' :: String -> (String -> String) -> String -> String
regexReplace' pattern replace text = BSU.toString $ go textUTF8
    where
        patternUTF8 = BSU.fromString pattern
        textUTF8 = BSU.fromString text
        replaceUTF8 x = BSU.fromString $ replace $ BSU.toString x
        regex = RE.makeRegexOpts compOpts RE.defaultExecOpt $
BSU.fromString pattern
        compOpts = RE.compMultiline + RE.compDotAll + RE.compUTF8 +
RE.compNoUTF8Check
        go part = case RE.matchM regex part of
            Just (before, match, after) ->
                BS.concat [before, replaceUTF8 match, go after]
            _ -> part

The discussion back then was
http://www.haskell.org/pipermail/beginners/2012-June/010064.html

Hope this helps.

Best regards,

Rico Moorman

P.S. Sorry for the double email Magicloud ... didn't hit reply all at first

On Tue, Dec 18, 2012 at 10:43 AM, José Romildo Malaquias <
j.romildo at gmail.com> wrote:

> On Tue, Dec 18, 2012 at 02:28:26PM +0800, Magicloud Magiclouds wrote:
> > Attachment is the test text file.
> > And I tested my regexp as this:
> >
> > Prelude> :m + Text.Regex.PCRE
> > Prelude Text.Regex.PCRE> z <- readFile "test.html"
> > Prelude Text.Regex.PCRE> let (b, m ,a, ss) = z =~ "<a
> > href=\"(.*?)\">.*?<img class=\"article-image\"" :: (String, String,
> String,
> > [String])
> > Prelude Text.Regex.PCRE> b
> > ...
> > n of the Triumvirate</td>\r\n    <td class=\"small\">David
> Rapoza</td>\r\n
> >    <td class=\"small\">\r\n      <i>Return to Ravnica</i>\r\n
>  </td>\r\n
> >    <td class=\"small\">10/31/2012</td>\r\n  </tr><tr>\r\n  <td
> > class=\"small\"><"
> > Prelude Text.Regex.PCRE> m
> > "a href=\"/magic/magazine/article.aspx?x=mtg/daily/activity/1088\"><img
> > class=\"article-image\" "
> >
> > >From the value of b and m, it was weird that the matching was moved
> forward
> > by 1 char ( the ss (sub matching) was even worse, 2 chars ). Rematch to a
> > and so on gave correct results. It was only the first matching that was
> > broken.
> > Tested with regex-posix (with modified regexp), everything is OK.
>
> I have a similar issue with non-ascii strings. It seems that the
> internal representation used by Haskell and pcre are different and one
> of them is counting bytes and the other is counting code points. So they
> diverge when a multi-byte representation (like utf8) is used.
>
> It has been reported previously. See these threads:
>
>
> http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#102959
>
> http://www.haskell.org/pipermail/haskell-cafe/2012-August/thread.html#103029
>
> I am still waiting for a new release of regex-pcre that fixes this
> issue.
>
> Romildo
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20121218/44718a83/attachment.htm>