[Haskell] regular expression syntax - perl ain't got nothin on haskell

Glynn Clements glynn.clements at virgin.net
Tue Feb 24 18:24:39 EST 2004


Hal Daume III wrote:

> p.s., certainly this is at least somewhat unique to me, but almost all of 
> the data i work with is unstructured text for two reasons.  first, that's 
> how it naturally comes.  second, to throw xml or some other scheme on to 
> it will balloon the data sizes to unmanagable amounts, with little gain.

There's a pretty big gap between *unstructured* text and e.g. XML. 
Most of what fits into that gap is essentially structured text.

If you're performing some kind of processing on the text, the odds are
that it does actually have some degree of structure to it.

My experience of code which does ad-hoc text processing using regexps
or similar is that a lot of it only handles a subset of what it ought
to, and that subset is typically defined by the nature of the
technique. Some examples of this issue are code which attempts to:

+ match C-style string literals, but falls down on an embedded \"
sequence;

+ match code tokens, but matches the same sequence of characters when
they occur inside string literals;

+ process email headers, but falls down on folded headers;

+ process HTML, but falls down in more ways than I could possibly
list.

Except in the most trivial cases, to process text *reliably* you
usually need to at least tokenise it and process the token stream. And
anything which has a more complex structure usually needs to operate
(at least conceptually) on a parse tree.

Regexps certainly have their place, although that's primarily in
writing tokenisers. IMHO, try to do everything (or, at least, too
much) using s/pattern/replacement/ constructs seems to be a favourite
recipe for buggy code.

Case in point: the regular occurrence of cross-site scripting, SQL
injection, printf() and similar issues on lists such as BugTraq.

-- 
Glynn Clements <glynn.clements at virgin.net>


More information about the Haskell mailing list