Alex unicode trick

Mateusz Kowalczyk fuuzetsu at fuuzetsu.co.uk
Tue Jan 7 18:18:33 UTC 2014


On 07/01/14 14:38, Simon Marlow wrote:
> Krasimir is right, it would be hard to use Alex's built-in Unicode
> support because we have to automatically generate the character classes
> from the Unicode spec somehow.  Probably Alex ought to include these as
> built-in macros, but right now it doesn't.
>
> Even if we did have access to the right regular expressions, I'm
> slightly concerned that the generated state machine might be enormous.
>
> Cheers,
> 	Simon
>
> On 07/01/2014 08:26, Krasimir Angelov wrote:
>> Hi,
>>
>> I was recenly looking at this code to see how the lexer decides that a
>> character is a letter, space, etc. The problem is that with Unicode
>> there are hundreds of thousands of characters that are declared to be
>> alphanumeric. Even if they are compressed into a regular expression
>> with a list of ranges there will be still ~390 ranges. The GHC lexer
>> avoids hardcoding this ranges by calling isSpace, isAlpha, etc and
>> then converting this result to a code. Ideally it would be nice if
>> Alex had a predefined macroses corresponding to the Unicode
>> categories, but for now you have to either hard code the ranges with
>> huge regular expressions or use the workaround used in GHC. Is there
>> any other solution?
>>
>> Regards,
>>    Krasimir
>>
>>

Ah, I think I understand now. If this is the case, at least the
‘alexGetChar’ could be removed, right? Is Alex 2.x compatibility
necessary for any reason whatsoever?

--
Mateusz K.


More information about the ghc-devs mailing list