Unicode source files

Simon Marlow simonmar at microsoft.com
Thu May 5 05:56:12 EDT 2005

On 04 May 2005 15:57, Bulat Ziganshin wrote:

> it is true what to support unicode source files only StringBuffer
> implementation must be changed?

It depends whether you want to support several different encodings, or
just UTF-8.  If we only want to support UTF-8, then we can keep the
StringBuffer in UTF-8 and also FastStrings.  (or you could re-encode the
other encodings into UTF-8).

> if so, then task can be simplified by
> converting any files read by hGetStringBuffer to UTF-32 (PackedString)
> representation and putting in memory array in this form. After this,
> we must change indexing of ByteArray to indexing of Array Int Char,
> and somewhat replace call to mkFastSubStringBA#.

This is the other alternative.  It uses rather more memory, but that
might not be an issue.

The other thing that needs to be changed is the lexer, to be able to
recognise classes of Unicode characters (i.e. upper/lower case for
identifiers, symbol characters. etc.).  The code recently added to the
libraries can be used for this, I believe.  

The question is what Alex should see for a unicode character: Alex
currently assumes that characters are in the range 0-255 (you need a
fixed range in order to generate the lexer tables).  One possibility is
to map all Unicode upper-case characters to a single character code for
Alex, and similarly for the other classes of character.

> btw, why in FastString module unicode strings are saved as [Int], not
> as String itself?

Probably for reasons that are no longer relevant.  When we changed Char
from 8 to 32 bits, we still had to compile GHC with older versions of
itself that only supported 8-bit Chars.


More information about the Glasgow-haskell-users mailing list