[Haskell-cafe] Re: Richer (than ascii) notation for haskell source?

Thu May 15 18:39:51 EDT 2008

On 15 May 2008, at 8:33 pm, Yitzchak Gale wrote:
> The point is that it is always best to keep language syntax
> as simple as possible, for many reasons. In the case of Unicode,
> that means staying as close as possible to the spirit of Unicode and
> minimizing our own ad hoc rules.

In particular, Unicode has explicit guidance about what an
identifier should be, in UAX#31:
http://www.unicode.org/reports/tr31/tr31-9.html

I've only recently started slogging my way through the
capital-city-telephone-book-size Unicode 5.0 book.  (I was
tolerably current to 4.0)  Imagine my stress levels on
discovering that Unicode 5.1 is already out, with another
"1,624 newly encoded characters", including a capital letter
version of "ß".  It is deeply ironic that one of the things
that keeps changing is the stability policy.  Another of the
things that has changed is UAX#31.

> Adding one more
> keyword is way simpler than adding a bunch of complex
> rules to the lexer.

Um, there's no way a Haskell lexer is going to comply with
the Unicode rules without a fair bit of complexity.  The
basic idea is simply <id start><id continue>*, but there
are rules about when ZWJ and ZWNJ are allowed.  The real
issue here is Unicode compliance, and the Unicode rules say
that a mixture of scripts is OK.  Er, it's not actually
that simple.  They do recommend that the scripts in table 4
_not_ be allowed in identifiers, so if you fancied writing
some of your identifiers in Shavian, you may or may not be
out of luck.  (Just why a Coptic priest who is also a
coder should be discouraged from using the Coptic script in
his programs escapes me.)

> A lot less moving parts to break.
> Especially if those lexer rules are not so consistent with
> built-in Unicode concepts such as letter and symbol, glyph
> direction, etc.

UAX#31 definitely allows identifiers with any mixture of
left to right and right to left characters.  The *intent* is
that anything even remotely reasonable should be accepted,
and should keep on being accepted, but of course the devil
is in the details.
>

> So I think the best and simplest idea is to make
> the letter lambda a keyword.

The lambda that people actually *want* in Haskell is in fact
the >mathematical< small letter lambda, not the Greek letter.
UAX#31 explicitly envisages "mathematically oriented programming
languages that make distinctive use of the Mathematical Alphanumeric
Symbols".  I don't think there can be much argument about this
being the right way to encode the symbol used in typeset versions
of Haskell.  There are three arguments against using it routinely:
  (a) It is outside the 16-bit range that Java is happy with,
      making it hard to write Haskell tools in Java.  But then,
      about 40% of the characters in Unicode are now outside the
      16-bit range that Java is comfortable with, which is just too
      bad for Java.  Haskell tools should be written in Haskell,
      and should cope with 20-bit characters.  (I used to say 21-
      bit, but Unicode 5 promises never to go beyond 16 planes.)
  (b) It is outside the range of characters currently available in
      fonts.  A character you cannot type or see isn't much use.
      Implementations *will* catch up, but what do we do now?
  (c) People *can* type a Greek small letter now, and will not be
      interested in making fine distinctions between characters that
      look pretty much the same.  So people will *expect* the Greek
      letter to work, even if a pedant like me says it's the wrong
      character.

Of course, we could always take an upside down lambda and put some
bars through it and use ¥ for lambda.  (Pop quiz: why would some
people not be surprised to see this instead of \ ?)  [It's a joke.]

All of this seems to leave Greek small letter lambda as a keyword
as being the simplest solution, but it's easy to predict that it
will cause confusion.

> True, you need a space after it
> then. You already need spaces between the variables after the
> lambda, so anyway you might say that would be more consistent.

Who says there is more than one variable?
\(x,y,z)-> doesn't have any spaces.
\x -> \y -> \z -> needs spaces, but that's because
->\ is a single token, not because of the identifiers.

--
"I don't want to discuss evidence." -- Richard Dawkins, in an
interview with Rupert Sheldrake.  (Fortean times 232, p55.)