[Haskell-cafe] Re: Missing join and split

ChrisK haskell at list.mightyreason.com
Sat Dec 29 08:58:16 EST 2007


Mitar wrote:
> Hi!
>
> On Dec 28, 2007 5:51 PM, Lihn, Steve <horng_twu_lihn at merck.com> wrote:
>> Since regex is involved, it is specific to (Byte)String, not a generic
>> list.
>
> Oh, this gives me an interesting idea: making regular expressions more generic.
>

The new regex-base API is fairly generic.

If you look at the classes in regex-base's Text.Regex.RegexLike:

class Extract source => RegexLike regex source where
 matchAll :: regex -> source -> [MatchArray]
 matchOnce :: regex -> source -> Maybe MatchArray
 matchCount :: regex -> source -> Int
 matchTest :: regex -> source -> Bool
 matchAllText :: regex -> source -> [MatchText source]
 matchOnceText :: regex -> source -> Maybe (source, MatchText source, source)

you can see that the "regex" type parameter is fully abstract, and that the
"source" being searched is also fully abstract.  The reason for having those
specific class methods is to allow for the instance to expose the most efficient
way to do each operation.

You could make an instance for string seaching (e.g. KMP or BM searching).
Pretty much any "search" or "find" operation could be made into an instance of
RegexLike.   The main constraint is that the MatchArray/MatchText use Int
indexing and the Extract instance wants to be able to do lookup with this:

type MatchOffset = Int
type MatchLength = Int
type MatchArray = Array Int (MatchOffset, MatchLength)
type MatchText source = Array Int (source, (MatchOffset, MatchLength))

class Extract source where
 before :: Int -> source -> source
 after :: Int -> source -> source
 empty :: source
 extract :: (Int, Int) -> source -> source

One benefit is that all the RegexContext instances are implemented by using just
the above class methods, so all the polymorphic match/matchM will immediately work.

If there is ever a strong need for going beyond the range of Int indexing, then
one could either make new variants of the classes, or add methods to the
existing ones.  But if you are searching over 2GB of something, then perhaps
have this generic type class API is not the top priority.

> Would not it be interesting and useful (but not really efficient) to
> have patterns something like:
> 
> foo :: Eq a => a -> ...
> foo (_{4}'b') = ...
> 
> which would match a list with four elements ending with an element 'b'. Or:
> 
> foo (_+';'_+';'_) = ...
> 
> which would match a list with embedded two ';' elements. (Last _
> matched the remaining of the list.)
> 
> OK, maybe guards are not the proper place to implement this as would
> add a possibility to make a really messy Haskell programs. But
> extending regular expressions to work on any list of elements with
> type implementing Eq would be realy powerfull. And then we could use
> split in many other than just text processing contexts.
> 
> Of course, the problem in both cases is implementing something like
> regular expressions efficiently, especially on lists, but this is why
> there are smart people around. :-)
> 
> 
> Mitar



More information about the Haskell-Cafe mailing list