[Haskell-cafe] Re: Richer (than ascii) notation for haskell
source?
Richard A. O'Keefe
ok at cs.otago.ac.nz
Thu May 15 18:39:51 EDT 2008
On 15 May 2008, at 8:33 pm, Yitzchak Gale wrote:
> The point is that it is always best to keep language syntax
> as simple as possible, for many reasons. In the case of Unicode,
> that means staying as close as possible to the spirit of Unicode and
> minimizing our own ad hoc rules.
In particular, Unicode has explicit guidance about what an
identifier should be, in UAX#31:
http://www.unicode.org/reports/tr31/tr31-9.html
I've only recently started slogging my way through the
capital-city-telephone-book-size Unicode 5.0 book. (I was
tolerably current to 4.0) Imagine my stress levels on
discovering that Unicode 5.1 is already out, with another
"1,624 newly encoded characters", including a capital letter
version of "ß". It is deeply ironic that one of the things
that keeps changing is the stability policy. Another of the
things that has changed is UAX#31.
> Adding one more
> keyword is way simpler than adding a bunch of complex
> rules to the lexer.
Um, there's no way a Haskell lexer is going to comply with
the Unicode rules without a fair bit of complexity. The
basic idea is simply <id start><id continue>*, but there
are rules about when ZWJ and ZWNJ are allowed. The real
issue here is Unicode compliance, and the Unicode rules say
that a mixture of scripts is OK. Er, it's not actually
that simple. They do recommend that the scripts in table 4
_not_ be allowed in identifiers, so if you fancied writing
some of your identifiers in Shavian, you may or may not be
out of luck. (Just why a Coptic priest who is also a
coder should be discouraged from using the Coptic script in
his programs escapes me.)
> A lot less moving parts to break.
> Especially if those lexer rules are not so consistent with
> built-in Unicode concepts such as letter and symbol, glyph
> direction, etc.
UAX#31 definitely allows identifiers with any mixture of
left to right and right to left characters. The *intent* is
that anything even remotely reasonable should be accepted,
and should keep on being accepted, but of course the devil
is in the details.
>
> So I think the best and simplest idea is to make
> the letter lambda a keyword.
The lambda that people actually *want* in Haskell is in fact
the >mathematical< small letter lambda, not the Greek letter.
UAX#31 explicitly envisages "mathematically oriented programming
languages that make distinctive use of the Mathematical Alphanumeric
Symbols". I don't think there can be much argument about this
being the right way to encode the symbol used in typeset versions
of Haskell. There are three arguments against using it routinely:
(a) It is outside the 16-bit range that Java is happy with,
making it hard to write Haskell tools in Java. But then,
about 40% of the characters in Unicode are now outside the
16-bit range that Java is comfortable with, which is just too
bad for Java. Haskell tools should be written in Haskell,
and should cope with 20-bit characters. (I used to say 21-
bit, but Unicode 5 promises never to go beyond 16 planes.)
(b) It is outside the range of characters currently available in
fonts. A character you cannot type or see isn't much use.
Implementations *will* catch up, but what do we do now?
(c) People *can* type a Greek small letter now, and will not be
interested in making fine distinctions between characters that
look pretty much the same. So people will *expect* the Greek
letter to work, even if a pedant like me says it's the wrong
character.
Of course, we could always take an upside down lambda and put some
bars through it and use ¥ for lambda. (Pop quiz: why would some
people not be surprised to see this instead of \ ?) [It's a joke.]
All of this seems to leave Greek small letter lambda as a keyword
as being the simplest solution, but it's easy to predict that it
will cause confusion.
> True, you need a space after it
> then. You already need spaces between the variables after the
> lambda, so anyway you might say that would be more consistent.
Who says there is more than one variable?
\(x,y,z)-> doesn't have any spaces.
\x -> \y -> \z -> needs spaces, but that's because
->\ is a single token, not because of the identifiers.
--
"I don't want to discuss evidence." -- Richard Dawkins, in an
interview with Rupert Sheldrake. (Fortean times 232, p55.)
More information about the Haskell-Cafe
mailing list