[Haskell-cafe] Re: ANN: weighted-regexp-0.1.0.0

Tue Jul 27 00:57:24 EDT 2010

Hi Sjoerd,

> It seems that only shift needs the reg field of the RegW datatype.  
> So you can also replace the reg field with a shift field. This makes  
> the regexp parser extensible, as there is no longer a dependence on  
> the (closed) datatype Reg:
>
>> data RegW w c = RegW { active :: !Bool,
>>                       empty :: !w,
>>                       final_ :: !w,
>>                       shift :: w -> c -> RegW w c }

Interesting observation. However, such an encoding would prevent the  
definition of some other functions on RegExp. More specifically, there  
are Show and Eq instances for the QuickCheck tests.

> For example it is then easy to define the parser that matches  
> nothing, which is the identity element of alt:
>
>> noMatch :: RegExp c
>> noMatch = RegExp noMatchW
>>
>> noMatchW :: Semiring w => RegW w c
>> noMatchW = RegW False zero zero $ \_ _ -> noMatchW

Note that you can also define it with the current interface:

     noMatch :: RegExp c
     noMatch = psym "(Big Lambda)" (const False)

Maybe I'll add it to the next version. I only need a better string  
representation ;)

> But otherwise I do wonder if the parser needs to be extensible.

I have some ideas for extending the matcher. For example /a{2,5}/ is  
currently translated into /aa(a(a(a)?)?)?/ but it may be possible to  
handle it without such blowup. I also want to add substring matching,  
i.e., the possibility to find out against which strings parenthesized  
parts of the regexp were matched.

But as the closed Reg type is not exported I can freely change it  
along with any matcher extension.

> For example some XML Schema implementations that are based on finite  
> automata have special cases for the xs:all construct, which matches  
> a list of elements, each occurring once in any order. But I tried a  
> straightforward implementation and it works fine:
>
>> eachOnce :: [RegExp c] -> RegExp c
>> eachOnce [] = eps
>> eachOnce ps = eachOnce' ps [] where
>>  eachOnce' [] _ = noMatch
>>  eachOnce' (p:ps) qs = (p `seq_` eachOnce (ps ++ qs)) `alt`  
>> eachOnce' ps (p:qs)

Neat! That's also worth adding. I find

     eachOnce :: [RegExp c] -> RegExp c
     eachOnce = foldr alt noMatch . map (foldr seq_ eps) . permutations

even clearer but your version is *much* better as it uses nesting to  
combine all alternatives that start with the same regexp.

Thanks!
Sebastian

-- 
Underestimating the novelty of the future is a time-honored tradition.
(D.G.)