[Haskell-cafe] Haskell performance when it comes to regex?

Mon May 29 19:53:56 UTC 2017

Thanks Bram,

There is correction in the comments to that Bryan O’Sullivan regex blog
post by Chris Kuklewicz, the author of the regex packages:

> You wrote:
> > Thereâ€™s normally no need to compile regular expression
> > patterns. A pattern will be compiled the first time itâ€™s used,
> > and your Haskell runtime should memoise the compiled
> > representation for you.
>
> Which is wrong. If you use =~ then the regexp is compiled with makeRegex and probably not cached. If you need to cache the compiled form then you need to use makeRegex and hold > onto the result and then use match separately (and matchM for =~~).

(http://www.serpentine.com/blog/2007/02/27/a-haskell-regular-expression-tutorial/comment-page-1/#comment-22241)

Pure library function are rarely memoised in Haskell because memorisation
requires side effects to achieve transparently.

I made four main things to optimise ‘hanon’ in my own ‘annie’ experimental
Fork after profiling it:

  * I pre-compiled the REs (this accounted for most of the improvement);
  * as working with Strings are generally slow and ByteStrings just as
    convenient here, I converted everything to work with ByteString 
    (though given your UTF8, requirements, Text might
    be more appropriate);
  * I switched from Text.Regex.TDFA to Text.Regex.PCRE (as I think you
    observed, judging by your commit messages), it seems to be somewhat
    faster, at lease for these relatively small data sets;
  * I combined all of the highlighter REs into a single RE of 
    alternatives (using a function, leaving the ‘inputPaths’ organisation
    intact) as the multi-pass approach looked like it would be slower but
    also because it seemed to avoid some problems with overlapping REs.

With these changes I saw a ~20x speedup over the ‘hanon’ source code
(committed to master on 2017-05-16). The python code is ~4x faster than this
on your sample ‘hello’ data set (which lines up with your reported 80x differential),
though it only appears to identify the things to be anonymised in the input file,
without carrying out any anonymisation.

You can see my WIP here: https://github.com/cdornan/annie. (A warning, the
main application wit all of the above changes has been rewritten, though
there are variants of the original with the REs being precompiled and
the ability to switch between String and ByteString for comparison
purposes.  It hasn’t been properly documented yet and remains WIP.)

Chris

On 29/05/2017, 15:40, "Bram Neijt" <bneijt at gmail.com> wrote:

    Hi Chris,

    Thank you for looking into this and thank you for your pull-request.

    I moved the "=~" outside of the map and that makes the whole thing a
    huge amount faster.

    Seems my assumption that =~ would memoise the regex creation (I read
    that in a post on regex in Haskell[1])

    The 80% diff is now gone, the Python code was everything without the
    leveldb stuff (but still, compiling the regexes every time, so it
    seemed like a valid comparison at the time), see attachment for code.

    Thank you all for your help!

    Bram

    [1] http://www.serpentine.com/blog/2007/02/27/a-haskell-regular-expression-tutorial/

    On Sun, May 28, 2017 at 2:22 PM, Chris Dornan <chris at chrisdornan.com> wrote:
    > Hi Bram,
    >
    >
    >
    > Sorry for being a bit late to this -- I have been on the road.
    >
    >
    >
    > I have switched over you example to pre-compile the REs and use ByteString
    > and can see 13x speedup on scan and a 9x speedup on mapping. Curiously,
    > nearly all of that speedup seems to be gained by lifting the RE compilation
    > out of the loop but I am pretty sure there are gains to be had from
    > re-writing the loops.
    >
    >
    >
    > Do you have the Python code that was performing 80x better?
    >
    >
    >
    > Chris
    >
    >
    >
    >
    >
    > From: Alfredo Di Napoli <alfredo.dinapoli at gmail.com>
    > Date: Monday, 22 May 2017 at 08:48
    > To: Bram Neijt <bneijt at gmail.com>
    > Cc: Станислав Черничкин <schernichkin at gmail.com>, haskell-cafe
    > <haskell-cafe at haskell.org>, Chris Dornan <chris at chrisdornan.com>
    > Subject: Re: [Haskell-cafe] Haskell performance when it comes to regex?
    >
    >
    >
    > Hi Bram,
    >
    >
    >
    > you might be interested in the “regex” package from my colleague Chris
    > Dornan:
    >
    >
    >
    > http://regex.uk/
    >
    >
    >
    > I know some proper performance work still needs to be done, but I would be
    > curious to hear your experience report ;)
    >
    >
    >
    > Alfredo
    >
    >
    >
    > On 19 May 2017 at 18:52, Bram Neijt <bneijt at gmail.com> wrote:
    >
    > Thank you!
    >
    > I already changed to Text instead, but I thought the regex was already
    > memoized by GHC, so that should not be a problem.
    >
    > I'm trying regex-applicative now, maybe that will help, but it takes
    > some time to figure out the syntax. I'll also try to see if
    > precompilation helps.
    >
    > Greetings,
    >
    > Bram
    >
    >
    >
    >
    > On Fri, May 19, 2017 at 1:17 PM, Станислав Черничкин
    > <schernichkin at gmail.com> wrote:
    >> Try to use Text or ByteString instead of strings. Try to use compile and
    >> execute methods
    >>
    >> (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html),
    >> make sure regex get compiled once.
    >>
    >> 2017-05-16 12:12 GMT+03:00 Bram Neijt <bneijt at gmail.com>:
    >>>
    >>> Dear reader,
    >>>
    >>> I decided to do a little project which is a simple search and replace
    >>> program for large text files.
    >>>
    >>> Written in Haskell, it does a few different regex matches on each line
    >>> and stores them in a leveldb key-value store to create a
    >>> consistent/reviewable search-replace index. It should provide for some
    >>> simple/brute-force anonymization of data and therefore I called it
    >>> hanon (sorry, could not think of a better name).
    >>>
    >>> https://github.com/BigDataRepublic/hanon
    >>>
    >>> The code works, but I've done some benchmarking to compare it with
    >>> Python and the code is about 80x slower then doing the same thing in
    >>> Python, making it useless for larger data files.
    >>>
    >>> I'm obviously doing something wrong.
    >>>
    >>> Could you give me tips on improving the performance of this code?
    >>> Probably mainly looking at
    >>>
    >>> https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs
    >>>
    >>> where the regex code lives?
    >>>
    >>> Greetings,
    >>>
    >>> Bram
    >>> _______________________________________________
    >>> Haskell-Cafe mailing list
    >>> To (un)subscribe, modify options or view archives go to:
    >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
    >>> Only members subscribed via the mailman list are allowed to post.
    >>
    >>
    >>
    >>
    >> --
    >> Sincerely, Stanislav Chernichkin.
    > _______________________________________________
    > Haskell-Cafe mailing list
    > To (un)subscribe, modify options or view archives go to:
    > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
    > Only members subscribed via the mailman list are allowed to post.
    >
    >