Unicode source files

Simon Marlow simonmar at microsoft.com
Tue May 17 09:30:06 EDT 2005


On 13 May 2005 11:37, Bulat Ziganshin wrote:

> Thursday, May 05, 2005, 1:56:12 PM, you wrote:
> 
>>> it is true what to support unicode source files only StringBuffer
>>> implementation must be changed?
> 
>> It depends whether you want to support several different encodings,
>> or just UTF-8.  If we only want to support UTF-8, then we can keep
>> the StringBuffer in UTF-8 and also FastStrings.  (or you could
>> re-encode the other encodings into UTF-8).
> 
> srcParseErr contains call to "stepOnBy (-len)", and doing this will be
> hard with UTF-8. although we can save pointer(s) to positions of
> previous chars or even just reparse from scratch entire buffer -
> printing source errors is not so frequent task. of course, it will be
> great to just save this position for us :)

I don't think that's a problem - instead of storing last_len in the
lexer state we just store the actual number of bytes (or the
StringBuffer).

>> The question is what Alex should see for a unicode character: Alex
>> currently assumes that characters are in the range 0-255 (you need a
>> fixed range in order to generate the lexer tables).  One possibility
>> is to map all Unicode upper-case characters to a single character
>> code for Alex, and similarly for the other classes of character.
> 
> i don't know anything about Alex intrinsics, and can only say that any
> solution is better to do INSIDE Alex, so other programs using it will
> also get Unicode support

The right thing to do as far as Alex is concerned is to collapse the
full Char range onto a smaller number of character classes which are
then lexed using the standard DFA lexer.  Alex could figure out the
required character classes automatically.

However, a simpler solution for GHC would be to essentially do this by
hand, since we already know what the character classes for Haskell are
(upper case, lower case, digit etc.), and we already have some code that
determines character classes for Unicode characters (GHC.Unicode).  So
for example you map upper-case unicode character onto 0xfe, lower-case
onto 0xfd, and so on.

> ... if this problem is just about changing charType in Ctype.lhs - we
> can 
> use some sort of hack. for example, use current scheme until there is
> some char greater than (chr 255). in this moment we create array for
> classification of chars 256-65535. all chars greater than (chr 65535)
> is 
> better to recognize with calls to appropriate functions, i think

We could use a combination of Ctype and GHC.Unicode to find character
classes.

> btw, Ruby supports writing numbers in form 1_200_000. how about adding
> this feature to GHC? ;)

I'm not keen on that.  We don't tend to introduce features that break
Haskell 98 compatibility unless they're quite compelling - and this is
only a small change.  It would introduce another way that code written
for GHC would gratuitously fail to compile with another compiler.  This
kind of change is best left until the next major revision of the
language.

Cheers,
	Simon


More information about the Glasgow-haskell-users mailing list