[Haskell-cafe] JRegex on "large" input sizes

Chris Kuklewicz haskell at list.mightyreason.com
Sat Jul 1 10:58:56 EDT 2006


David House wrote:
 > Hi all. I need a decent regex library and JRegex seems the perfect
 > choice: simple API, yet well-featured, as well as PCRE support.

I "maintain" Text.Regex.Lazy ( http://sourceforge.net/projects/lazy-regex ) so I 
would mention it does not have full PCRE support.  The module's documentation ( 
summarize here http://sourceforge.net/forum/forum.php?forum_id=554104 ) explains 
what it does have.  In summary of summary:

For simple Regex usage (with capture) the Text.Regex.Lazy.Compat module replaces 
Text.Regex with a better implementation.

For simple expressions where a DFA works, the CompatDFA is fastest.

For fancier Regexes (such as using lazy pattern with ?? *? and +?) the 
Text.Regex.Lazy.Full extends Text.Regex.Lazy.Compat.

For much fancier regular expressions (e.g. PCRE) you would need to add two 
hopefully simple pieces:
(1) Extend the parsec code used to comprehend the meaning of the regex string.
(2) Extend the code that produces the Parsec parser that implements the desired 
matching semantics.
(3) Test cases for the expanded syntax and semantics.

Note that Text.Regex.Lazy is an all Haskell solution.  There are other haskell 
projects that wrap the standard regex/pcre libraries.  The problem is that 
marshaling [Char] to c-strings is quite slow and cannot be lazy, so you may want 
to use the new Fast Packed String (now ByteString) library with foreign 
functions to call the pcre c-library.

 > I want
 > to use it on a simple project which involves input files a little
 > larger than typical -- between 100KB and 500KB -- but still small
 > enough so as to not present a problem.
 >
 > However, and I'm fairly sure JRegex is at fault here, my program
 > segfaults on an input of ~230KB. Has anyone used JRegex successfully
 > in this way before? If so, what tactics did you use?
 >
 > Thanks in advance.
 >




More information about the Haskell-Cafe mailing list