[Haskell-cafe] Re: ANN: weighted-regexp-0.1.0.0

Benedikt Huber benjovi at gmx.net
Thu Jul 29 21:07:34 EDT 2010


Sebastian Fischer schrieb:
> 
> On Jul 29, 2010, at 12:47 AM, Benedikt Huber wrote:
> 
>> Taking a quick look at the PyPy blog post on JIT code generation for 
>> regular expressions, I thought it would be fun to implement a 
>> generator using the excellent LLVM bindings for haskell.
> 
> Interesting. Would you mind elaborating on your code? Maybe even write a 
> blog about how it works?
Hi,
so the current implementation (a little bit cleaned up at
   http://github.com/visq/llvm-regexp
), works as follows:

It generates an LLVM function which matches a bytestring against a given 
  regular expression. The state of the 'automaton' consists of one bit for
each leaf of the regexp AST, corresponding to the marks in the article's 
figure. It then generates straight-line code updating the state given an 
input character (generateRegexpCode and matchCharSet in Text.RegExp.LLVM).

For example, for matching '.a' , the generated code looks like this in a 
simplified pseudo code notation:

 > ... next ~ first character matched
 > ... ch   ~ input character
 >   next1 = bitmask[0]              -- was '.' marked ?
 >   bitmask[0] = next               -- mark '.' if initial
 >   next2 = bitmask[1]              -- was 'a' marked ?
 >   bitmask[1] = ch == 'a' && next1 -- mark 'a' if '.' was marked 

                                     -- and input is 'a'

At the end of the string, code is generated to check whether the 
automaton is in a final state (genFinalStateCheck). In the example 
above, this corresponds to

 >   final = bitmask[1]

The rest is either LLVM boilerplate (regexMatcher) or adaptions from 
weighed-regexp. Additionally, I've adapted the Parser.y from 
weighted-regexp, but some things (e.g. character classes like \w) are 
not implemented.

Generating one big basic block for the whole regular expressions does 
not work well when there are more than a few thousand nodes in the AST. 
Using functions for large regular expressions and loop for repititions 
would be one solution. The other problem is that the matcher only 
operates on Word8s at the moment.

Trying it out (if you have llvm+haskell bindings) is also easy (do not 
cabal-easy):

 > $ git clone git at github.com:visq/llvm-regexp.git && cd llvm-regexp
 > $ make
 > $ ./Grep '.*spotlight.*' < /etc/passwd
< Line 45 matches

cheers, benedikt


More information about the Haskell-Cafe mailing list