Fwd: Re[2]: Unicode source files

Fri May 13 06:37:00 EDT 2005

Sorry, Simon, are you received this message?

This is a forwarded message
From: Bulat Ziganshin <bulatz at HotPOP.com>
To: "Simon Marlow" <simonmar at microsoft.com>
Date: Thursday, May 05, 2005, 10:13:37 PM
Subject: Unicode source files

===8<==============Original message text===============
Hello Simon,

Thursday, May 05, 2005, 1:56:12 PM, you wrote:

>> it is true what to support unicode source files only StringBuffer
>> implementation must be changed?

SM> It depends whether you want to support several different encodings, or
SM> just UTF-8.  If we only want to support UTF-8, then we can keep the
SM> StringBuffer in UTF-8 and also FastStrings.  (or you could re-encode the
SM> other encodings into UTF-8).

srcParseErr contains call to "stepOnBy (-len)", and doing this will be
hard with UTF-8. although we can save pointer(s) to positions of
previous chars or even just reparse from scratch entire buffer -
printing source errors is not so frequent task. of course, it will be
great to just save this position for us :)

making FastString UTF-8-enabled would be great. it needs changes in
lengthFS, indexFS and may be cmpFS (can the UTF-8 chars be compared
with just memcmp?). also i don't know about hPutFS. win32 console
works in either oem or ansi 8-bit encoding

SM> The question is what Alex should see for a unicode character: Alex
SM> currently assumes that characters are in the range 0-255 (you need a
SM> fixed range in order to generate the lexer tables).  One possibility is
SM> to map all Unicode upper-case characters to a single character code for
SM> Alex, and similarly for the other classes of character.

i don't know anything about Alex intrinsics, and can only say that any
solution is better to do INSIDE Alex, so other programs using it will
also get Unicode support

... if this problem is just about changing charType in Ctype.lhs - we can
use some sort of hack. for example, use current scheme until there is
some char greater than (chr 255). in this moment we create array for
classification of chars 256-65535. all chars greater than (chr 65535) is
better to recognize with calls to appropriate functions, i think

btw, Ruby supports writing numbers in form 1_200_000. how about adding
this feature to GHC? ;)

Lexer.x:

@decimal     = $digit [$digit \_]*
@octal       = [$octit \_]+
@hexadecimal = [$hexit \_]+

StringBuffer.lhs:

parseInteger :: StringBuffer -> Int -> Integer -> (Char->Int) -> Integer
parseInteger buf len radix to_int 
  = go 0 0
  where go i x | i == len  = x
               | otherwise = case (lookAhead buf i) of
                               '_' -> go (i+1) x
                               c   -> go (i+1) (x * radix + toInteger (to_int c))

-- 
Best regards,
 Bulat                            mailto:bulatz at HotPOP.com

===8<===========End of original message text===========

-- 
Best regards,
 Bulat                            mailto:bulatz at HotPOP.com