[Haskell-cafe] JRegex on "large" input sizes

Chris Kuklewicz haskell at list.mightyreason.com
Sat Jul 1 17:22:33 EDT 2006


Gregory Woodhouse wrote:
> 
> On Jul 1, 2006, at 6:27 AM, David House wrote:
> 
>> Hi all. I need a decent regex library and JRegex seems the perfect
>>
>> choice: simple API, yet well-featured, as well as PCRE support. I want
>>
>> to use it on a simple project which involves input files a little
>>
>> larger than typical -- between 100KB and 500KB -- but still small
>>
>> enough so as to not present a problem.
>>
>>
>> However, and I'm fairly sure JRegex is at fault here, my program
>>
>> segfaults on an input of ~230KB. Has anyone used JRegex successfully
>>
>> in this way before? If so, what tactics did you use?
>>
>>
>> Thanks in advance.
>>
> 
> I have no idea how JRegex works, but regular expressions are normally 
> implemented using finite automata, so the input size should be immaterial.
> 
> Gregory Woodhouse
> gregory.woodhouse at sbcglobal.net <mailto:gregory.woodhouse at sbcglobal.net>
> 
> "Judge a man by his questions not 
> his answers."   --Voltaire
> 

It depends on the advanced features that you need.  If you use sub-expression 
capturing then it often changes to a backtracking algorithm. This use of 
backtracking is true of the Text.Regex.Lazy.Full code, which creates a Parsec 
parser instead of a DFA.  The CompatDFA module uses an automata (derived from 
the CTKLight code, http://www.cse.unsw.edu.au/~chak/haskell/ctk/ ), and as a 
bonus, the DFA itself is build in a Lazy fashion, so unexplored parts of the 
automata don't get built.

I found creating a Parsec parser to do regex matching was possible, but a bit 
odd in places.  Like CTK, it is more aimed at writing a Lexer/Parser than a 
simple matcher.

At some point, the FastPackedString version of Text.Regex.Lazy should be changed 
to the new ByteString library.  Perhaps there is even room to build in on (or 
present it as) a type class.

-- 
Chris

"I'm here tonight to present the award everyone's been waiting for: best movie. 
This award holds a special place in my heart because next year I'll be winning 
it for Snakes on a Plane." -- June 3, 2006, Samuel L. Jackson


More information about the Haskell-Cafe mailing list