H98 Report: Unicode (was: Re: H98 Report: input functions)

Simon Peyton-Jones simonpj@microsoft.com
Wed, 11 Sep 2002 11:25:14 +0100


Ketil says:

| While we're at it, are there any plans to remove this paragraph from
| section 2.1:
|=20
| | Haskell uses a pre-processor to convert non-Unicode character sets
| | into Unicode. This pre-processor converts all characters to Unicode
| | and uses the escape sequence \uhhhh, where the "h" are hex digits,
| | to denote escaped Unicode characters. Since this translation occurs
| | before the program is compiled, escaped Unicode characters may
| | appear in identifiers and any other place in the program.

I agree!  This para should go.  No impl does it, and the paragraph is
inconsistent with the lexical syntax of identifiers (which can't contain
escapes like \uhhh).  I therefore propose to delete this para
altogether.

| Note that the 16-bits remark should probably be removed, Unicode code
| points extend beyond that nowadays.

It was removed some while ago.

| Also, if a provision is made for escaped Unicode in identifiers, it
| would be nice if the section on layout (2.7) discouraged layout rules
| where the indentation level depended on the width of non-space
| characters.  (Ideally, this would result in a compiler warning.)
| In fact, this might always be useful, since some Unicode characters
| are defined as double width.

OK, here's what I suggest.  The Report currently says:
"For the purposes of the layout rule, Unicode characters in a source
program
are considered to be of the same, fixed, width as an ASCII character."

I do not propose to change this statement now.  Doubtless it could be
cleverer, but it's too late for new cleverness.  I would, however, be
willing to add:

"To avoid visual confusion, programmers should avoid writing programs in
which the meaning of implicit layout depends on the width of non-space
characters."

That's more or less what Ketil suggested, and I think is
non-controversial.  Any comments?

Simon