[Haskell-cafe] Unicode case (in)stability and Haskell identifiers.

Fri Nov 2 09:28:06 CET 2012

I've been putting together a proposal for Unicode identifiers
in Erlang (it's EEP 40 if anyone wants to look it up).  In
the course of this, it has turned out that there is a technical
problem for languages with case-significant identifiers.

Haskell 2010 report, chapter 2.
http://www.haskell.org/onlinereport/haskell2010/haskellch2.html

varid → (small {small | large | digit | ' })\⟨reservedid⟩
conid →	 large {small | large | digit | ' }

small    → ascSmall | uniSmall | _
ascSmall → a | b | … | z
uniSmall → any Unicode lowercase letter

large    → ascLarge | uniLarge
ascLarge → A | B | … | Z
uniLarge → any uppercase or titlecase Unicode letter

This is actually ambiguous: any ascSmall is also a uniSmall
and any ascLarge is also a uniLarge.  I take it that this
is intended to mean "any Unicode xxx letter other than an ASCII one"
in each case.

That's not the problem.  The definition currently bans Hebrew,
Arabic, Chinese, Japanese, all the Indic scripts, and basically
only allows Latin, Greek, Coptic, Cyrillic, Glagolitic,
Armenian, arguably Georgian, and Deseret (but not Shavian).
That's not the problem either.

The problem is that being a Unicode lower case, upper case,
or title case letter is not a stable property.

Unicode annex UAX#31 guarantees that
       X is a well-formed case-insensitive identifier now
  =>   X will always be a well-formed case-insensitive
       identifier
and that
       X is a well-formed case-sensitive identifier now
   =>  X will always be a well-formed case-sensitive
       identifier

What it does NOT guarantee is that it will continue to be
begin with the same *case* or even that a letter will
continue to be classified as a letter.  So it is at least
technically possible for a valid Haskell 2010 varid
(conid) to turn into a conid (varid) or even cease to be
a legal Haskell identifier at all.  Unicode standard
Annex UAX#31 guarantees stability of being-an-identifier
by having an exceptional set for any letter that stops
being a letter to go into.  For example, there are
SCRIPT CAPITAL {B,E,F,H,I,L,M,P,R} characters, all of
which are capital letters except for SCRIPT CAPITAL P,
which is a symbol, but it's in the exception set so it's
still OK to use.  All of the SCRIPT CAPITAL letters were
in General Category So in Unicode 1.1.5 (the earliest for
which online data is available). In Unicode 2.1.8, all of
them were Lu except for SCRIPT CAPITAL P, which was Ll.
By Unicode 3.0.0, SCRIPT CAPITAL P was back to So.  Some
time later it switched over to Sm.  So we've had

SCRIPT CAPITAL P
	- not a letter (1.1.5)
	- is a lower case letter (2.1.8)
	- not a letter again (3.0.0)

at least according to the on-line UnicodeData-<version>.txt
files.  Putting ℘ into the exceptional set means that a
UAX#31 identifier may still contain it, but not so a Haskell one.

There are two aspects to this instability.

(1) Because Haskell hews its own line instead of tailoring
    UAX#31 the way Ada and Python do, Haskell cannot benefit
    from the UAX#31 stability guarantee.  There _has_ been a
    character that used to be legal in a Haskell identifier
    that is not now.  That's Haskell's problem, not Unicode's,
    and the Haskell community does not have to wait for anyone
    else to address is.

(2) Even if you adopt one of the UAX#31 definitions verbatim,
    the case distinction Haskell needs to make is not stable.

It appears that nobody who worked on UAX#31 was thinking about
languages like Prolog, Erlang, Clean, Haskell, F#, or Scala,
and that if the Unicode Consortium are told of the problem,
they will probably be happy to add some sort of "don't break
these languages" guideline.

Next week I intend to submit a proposal to the Unicode
consortium to consider this issue.

Would anyone care to see and comment on the proposal
before I send it to Unicode.org?  Anyone got any suggestions
before I begin to write it?

For the sake of argument, suppose that we are going to
stick with Xid_Start Xid_Continue* for the union of
variables and atoms (which is pretty much what Ada and
Python do), and the sole issue of concern is that there
should be a stable way to classify such a token as
"beginning with default case" or "beginning with marked case".