[Haskell-cafe] Re: ANN: weighted-regexp-0.1.0.0
benjovi at gmx.net
Thu Jul 29 21:07:34 EDT 2010
Sebastian Fischer schrieb:
> On Jul 29, 2010, at 12:47 AM, Benedikt Huber wrote:
>> Taking a quick look at the PyPy blog post on JIT code generation for
>> regular expressions, I thought it would be fun to implement a
>> generator using the excellent LLVM bindings for haskell.
> Interesting. Would you mind elaborating on your code? Maybe even write a
> blog about how it works?
so the current implementation (a little bit cleaned up at
), works as follows:
It generates an LLVM function which matches a bytestring against a given
regular expression. The state of the 'automaton' consists of one bit for
each leaf of the regexp AST, corresponding to the marks in the article's
figure. It then generates straight-line code updating the state given an
input character (generateRegexpCode and matchCharSet in Text.RegExp.LLVM).
For example, for matching '.a' , the generated code looks like this in a
simplified pseudo code notation:
> ... next ~ first character matched
> ... ch ~ input character
> next1 = bitmask -- was '.' marked ?
> bitmask = next -- mark '.' if initial
> next2 = bitmask -- was 'a' marked ?
> bitmask = ch == 'a' && next1 -- mark 'a' if '.' was marked
-- and input is 'a'
At the end of the string, code is generated to check whether the
automaton is in a final state (genFinalStateCheck). In the example
above, this corresponds to
> final = bitmask
The rest is either LLVM boilerplate (regexMatcher) or adaptions from
weighed-regexp. Additionally, I've adapted the Parser.y from
weighted-regexp, but some things (e.g. character classes like \w) are
Generating one big basic block for the whole regular expressions does
not work well when there are more than a few thousand nodes in the AST.
Using functions for large regular expressions and loop for repititions
would be one solution. The other problem is that the matcher only
operates on Word8s at the moment.
Trying it out (if you have llvm+haskell bindings) is also easy (do not
> $ git clone git at github.com:visq/llvm-regexp.git && cd llvm-regexp
> $ make
> $ ./Grep '.*spotlight.*' < /etc/passwd
< Line 45 matches
More information about the Haskell-Cafe