[Haskell-cafe] JRegex on "large" input sizes
haskell at list.mightyreason.com
Sat Jul 1 17:22:33 EDT 2006
Gregory Woodhouse wrote:
> On Jul 1, 2006, at 6:27 AM, David House wrote:
>> Hi all. I need a decent regex library and JRegex seems the perfect
>> choice: simple API, yet well-featured, as well as PCRE support. I want
>> to use it on a simple project which involves input files a little
>> larger than typical -- between 100KB and 500KB -- but still small
>> enough so as to not present a problem.
>> However, and I'm fairly sure JRegex is at fault here, my program
>> segfaults on an input of ~230KB. Has anyone used JRegex successfully
>> in this way before? If so, what tactics did you use?
>> Thanks in advance.
> I have no idea how JRegex works, but regular expressions are normally
> implemented using finite automata, so the input size should be immaterial.
> Gregory Woodhouse
> gregory.woodhouse at sbcglobal.net <mailto:gregory.woodhouse at sbcglobal.net>
> "Judge a man by his questions not
> his answers." --Voltaire
It depends on the advanced features that you need. If you use sub-expression
capturing then it often changes to a backtracking algorithm. This use of
backtracking is true of the Text.Regex.Lazy.Full code, which creates a Parsec
parser instead of a DFA. The CompatDFA module uses an automata (derived from
the CTKLight code, http://www.cse.unsw.edu.au/~chak/haskell/ctk/ ), and as a
bonus, the DFA itself is build in a Lazy fashion, so unexplored parts of the
automata don't get built.
I found creating a Parsec parser to do regex matching was possible, but a bit
odd in places. Like CTK, it is more aimed at writing a Lexer/Parser than a
At some point, the FastPackedString version of Text.Regex.Lazy should be changed
to the new ByteString library. Perhaps there is even room to build in on (or
present it as) a type class.
"I'm here tonight to present the award everyone's been waiting for: best movie.
This award holds a special place in my heart because next year I'll be winning
it for Snakes on a Plane." -- June 3, 2006, Samuel L. Jackson
More information about the Haskell-Cafe