[Haskell-cafe] Haskell performance when it comes to regex?

Bram Neijt bneijt at gmail.com
Mon May 29 14:40:01 UTC 2017


Hi Chris,

Thank you for looking into this and thank you for your pull-request.

I moved the "=~" outside of the map and that makes the whole thing a
huge amount faster.

Seems my assumption that =~ would memoise the regex creation (I read
that in a post on regex in Haskell[1])

The 80% diff is now gone, the Python code was everything without the
leveldb stuff (but still, compiling the regexes every time, so it
seemed like a valid comparison at the time), see attachment for code.

Thank you all for your help!

Bram

[1] http://www.serpentine.com/blog/2007/02/27/a-haskell-regular-expression-tutorial/

On Sun, May 28, 2017 at 2:22 PM, Chris Dornan <chris at chrisdornan.com> wrote:
> Hi Bram,
>
>
>
> Sorry for being a bit late to this -- I have been on the road.
>
>
>
> I have switched over you example to pre-compile the REs and use ByteString
> and can see 13x speedup on scan and a 9x speedup on mapping. Curiously,
> nearly all of that speedup seems to be gained by lifting the RE compilation
> out of the loop but I am pretty sure there are gains to be had from
> re-writing the loops.
>
>
>
> Do you have the Python code that was performing 80x better?
>
>
>
> Chris
>
>
>
>
>
> From: Alfredo Di Napoli <alfredo.dinapoli at gmail.com>
> Date: Monday, 22 May 2017 at 08:48
> To: Bram Neijt <bneijt at gmail.com>
> Cc: Станислав Черничкин <schernichkin at gmail.com>, haskell-cafe
> <haskell-cafe at haskell.org>, Chris Dornan <chris at chrisdornan.com>
> Subject: Re: [Haskell-cafe] Haskell performance when it comes to regex?
>
>
>
> Hi Bram,
>
>
>
> you might be interested in the “regex” package from my colleague Chris
> Dornan:
>
>
>
> http://regex.uk/
>
>
>
> I know some proper performance work still needs to be done, but I would be
> curious to hear your experience report ;)
>
>
>
> Alfredo
>
>
>
> On 19 May 2017 at 18:52, Bram Neijt <bneijt at gmail.com> wrote:
>
> Thank you!
>
> I already changed to Text instead, but I thought the regex was already
> memoized by GHC, so that should not be a problem.
>
> I'm trying regex-applicative now, maybe that will help, but it takes
> some time to figure out the syntax. I'll also try to see if
> precompilation helps.
>
> Greetings,
>
> Bram
>
>
>
>
> On Fri, May 19, 2017 at 1:17 PM, Станислав Черничкин
> <schernichkin at gmail.com> wrote:
>> Try to use Text or ByteString instead of strings. Try to use compile and
>> execute methods
>>
>> (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html),
>> make sure regex get compiled once.
>>
>> 2017-05-16 12:12 GMT+03:00 Bram Neijt <bneijt at gmail.com>:
>>>
>>> Dear reader,
>>>
>>> I decided to do a little project which is a simple search and replace
>>> program for large text files.
>>>
>>> Written in Haskell, it does a few different regex matches on each line
>>> and stores them in a leveldb key-value store to create a
>>> consistent/reviewable search-replace index. It should provide for some
>>> simple/brute-force anonymization of data and therefore I called it
>>> hanon (sorry, could not think of a better name).
>>>
>>> https://github.com/BigDataRepublic/hanon
>>>
>>> The code works, but I've done some benchmarking to compare it with
>>> Python and the code is about 80x slower then doing the same thing in
>>> Python, making it useless for larger data files.
>>>
>>> I'm obviously doing something wrong.
>>>
>>> Could you give me tips on improving the performance of this code?
>>> Probably mainly looking at
>>>
>>> https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs
>>>
>>> where the regex code lives?
>>>
>>> Greetings,
>>>
>>> Bram
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> To (un)subscribe, modify options or view archives go to:
>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>> Only members subscribed via the mailman list are allowed to post.
>>
>>
>>
>>
>> --
>> Sincerely, Stanislav Chernichkin.
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hanon.py
Type: text/x-python
Size: 1107 bytes
Desc: not available
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170529/b41a12dc/attachment.py>


More information about the Haskell-Cafe mailing list