[Haskell-cafe] Unicode Haskell source -- Yippie!

Wed Apr 30 12:38:18 UTC 2014

I wrote
>> If we turn to Unicode, how should we read
>>
>>         a âŠž b âŸ c
>>
>> Maybe someone has a principled way to tell.  I don't.

Rustom Mody wrote:
>
> Without claiming to cover all cases, this is a 'principle'
> If we have:
> (âŠž) :: a -> a -> b
> (âŸ) :: b -> b -> c
>
> then âŠž's precedence should be higher than âŸ.

I always have trouble with "higher" and "lower" precedence,
because I've used languages where the operator with the bigger
number binds tighter and languages where the operator with the
bigger number gets to dominate the other.  Both are natural
enough, but with opposite meanings for "higher".

This principle does not explain why * binds tighter than +,
which means we need more than one principle.
It also means that if OP1 :: a -> a -> b and OP2 :: b -> b -> a
then OP1 should be higher than OP2 and OP2 should be higher
than OP1, which is a bit of a puzzler, unless perhaps you are
advocating a vaguely CGOL-ish asymmetric precedence scheme
where the precedence on the left and the precedence on the
right can be different.

For the record, let me stipulate that I had in mind a situation
where OP1, OP2 : a -> a -> a.  For example, APL uses the floor
and ceiling operators infix to stand for max and min.  This
principle offers us no help in ordering max and min.

Or consider APL again, whence I'll borrow (using ASCII because
this is webmail tonight)
    take, rotate :: Int -> Vector t -> Vector t
Haskell applies operator precedence before it does type
checking, so how would it know to parse
    n `take` m `rotate` v
as (n `take` (m `rotate` v))?

I don't believe there was anything in my original example to
suggest that either operator had two operands of the same type,
so I must conclude that this principle fails to provide any
guidance in that case (like this one).

> This is what makes it natural to have the precedences of (+) (<) (&&) in
> decreasing order.
>
> This is also why the bitwise operators in C have the wrong precedence:

Oh, I agree with that!

> The error comes (probably) from treating & as close to the logical
> operators like && whereas in fact it is more kin to arithmetic operators
> like +.

The error comes from BCPL where & and && were the same operator
(similarly | and ||).  At some point in the evolution of C from BCPL
the operators were split apart but the bitwise ones left in the wrong
place.
>
> There are of course other principles:
> Dijkstra argued vigorously that boolean algebra being completely symmetric
> in
> (âˆ¨,True)  (âˆ§, False),  âˆ§, âˆ¨ should have the same precedence.
>
> Evidently not too many people agree with him!

Sadly, I am reading this in a web browser where the Unicode symbols
are completely garbled.  (More precisely, I think it's WebMail doing
it.)  Maybe Unicode isn't ready for prime time yet?

You might be interested to hear that in the Ada programming
language, you are not allowed to mix 'and' with 'or' (or
'and then' with 'or else') without using parentheses.  The
rationale is that the designers did not believe that enough
programmers understood the precedence of and/or.  The GNU C
compiler kvetches when you have p && q || r without otiose
parentheses.  Seems that there are plenty of designers out
there who agree with Dijkstra, not out of a taste for
well-engineered notation, but out of contempt for the
Average Programmer.

> When I studied C (nearly 30 years now!) we used gets as a matter of
> course.
> Today we dont.

Hmm.  I started with C in late 1979.  Ouch.  That's 34 and a half
years ago.  This was under Unix version 6+, with a slightly
"pre-classic" C.  A little later we got EUC Unix version 7, and a
'classic' C compiler that, oh joy, supported /\ (min) and \/ (max)
operators.  [With a bug in the code generator that I patched.]

> Are Kernighan and Ritchie wrong in teaching it?
> Are today's teacher's wrong in proscribing it?
>
> I believe the only reasonable outlook is that truth changes with time: it
> was ok then; its not today.

In this case, bull-dust!  gets() is rejected today because a
botch in its design makes it bug-prone.  Nothing has changed.
It was bug-prone 34 years ago.  It has ALWAYS been a bad idea
to use gets().  Amongst other things, the Unix manuals have
always presented the difference between gets() -- discards
the terminator -- and fgets() -- annoyingly retains the
terminator -- as a bug which they thought it was too late to
fix; after all, C had hundreds of users!  No, it was obvious
way back then:  you want to read a line?  Fine, WRITE YOUR OWN
FUNCTION, because there is NO C library function that does
quite what you want.  The great thing about C was that you
*could* write your own line-reading function without suffering.
Not only would your function do the right thing (whatever you
conceived that to be), it would be as fast, or nearly as fast,
as the built-in one.  Try doing *that* in PL/I!

No, in this case, *opinions* may have changed, peoples
*estimation* of and *tolerance for* the risks may have
changed, but the truth has not changed.
>
> Likewise DOCTYPE-missing and charset-other-than-UTF-8.
> Random example  showing how right yesterday becomes wrong today:
> http://www.sitepoint.com/forums/showthread.php?660779-Content-type-iso-8859-1-or-utf-8

Well, "missing" DOCTYPE is where it starts to get a bit technical.
An SGML document is basically made up of three parts:
  - an SGML declaration (meta-meta-data) that tells the
    parser, amongst other things, what characters to use for
    delimiters, whether various things are case sensitive,
    what the numeric limits are, and whether various features
    are enabled.
  - a Document Type Declaration (meta-data) that conforms to
    the lexical rules set up by the SGML declaration and
    defines (a) the grammar rules and (b) a bunch of macros.
  - a document (data).
The SGML declaration can be supplied to a parser as data (and
yes, I've done that), or it can be stipulated by convention
(as the HTML standards do).  In the same way, the DTD can be
  - completely declared in-line
  - defined by reference with local amendments
  - defined solely by reference
  - known by convention.
If there is a convention that a document without a DTD uses
a particular DTD, SGML is fine with that.  (It's all part of
"entity management", one of the minor arcana of SGML.)

As for the link in question, it doesn't show right turning into
wrong.  A quick summary of the sensible part of that thread:

   - If you use a <meta> tag to specify the encoding of your
     file, it had better be *right*.

     This has been true ever since <meta> tags first existed.

   - If you have a document in Latin 1 and any characters
     outside that range are written as character entity references
     or numeric character references, there is no need to change.

     No change of right to wrong here!

   - If you want to use English punctuation marks like dashes and
     curly quotes, using UTF-8 will let you write these characters
     without character entities or NCRs.

     This is only half true.  It will let you do this conveniently
     IF your local environment has fonts that include the characters.
     (Annoyingly, in Mac OS 10.6, which I'm typing on,
     Edit|Special characters is not only geographically confused,
     listing Coptic as a *European* script -- last type I checked
     Egypt was still in Africa -- but it doesn't display any Coptic
     characters.  In the Mac OS 10.7 system I normally use,
     Edit|Special characters got dramatically worse as an interface,
     but no more competent with Coptic characters.  Just because a
     character is in Unicode doesn't mean it can be *used*,
     practically speaking.)

     Instead of saying that what is wrong has become or is becoming
     right, I'd prefer to say that what was impossible is becoming
     possible and what was broken (Unicode font support) is gradually
     getting fixed.

   - Some Unicode characters, indeed, some Latin 1 characters, are
     so easy to confuse with other characters that it is advisable
     to use character entities.

     Again, nothing about wrong turning into right.  This was good
     advice as soon as Latin 1 came out.

> Unicode vs ASCII in program source is similar (I believe).

Well, not really.  People using specification languages like Z
routinely used characters way outside the ASCII range; one way
was to use LaTeX.  Another way was to have GUI systems that
let you key in using LaTeX character names or menus but see the
intended characters.  Back in about 1984 I was able to use a
16-bit character set on the Xerox Lisp Machines.  I've still
got a manual for the XNS character set somewhere.  In one of
the founding documents for the ISO Prolog standard, I
recommended, in 1984, that the Prolog standard.  That's THREE
YEARS before Unicode was a gleam in its founders' eyes.

This is NOT new.  As soon as there were bit-mapped displays
and laser printers, there was pressure to allow a wider range
of characters in programs.  Let me repeat that: 30 years ago
I was able to use non-ASCII characters in computer programs.
*Easily*, via virtual keyboards.

In 1987, the company I was working at in California revamped
their system to handle 16-bit characters and we bought a
terminal that could handle Japanese characters.  Of course
this was because we wanted to sell our system in Japan.
But this was shortly before X11 came out; the MIT window
system of the day was X10 and the operating system we were
using the 16-bit characters on was VMS.  That's 27 years ago.

This is not new.

So what _is_ new?

* A single standard.

  Wait, we DON'T have a single standard.  We have a single
  standard *provider* issuing a rapid series of revisions
  of an increasingly complex standard, where entire features
  are first rejected outright, then introduced, and then
  deprecated again.  Unicode 6.3 came out last year with
  five new characters (bringing the total to 110,122),
  over a thousand new character *variants*, two new normative
  properties, and a new BIDI algorithm which I don't yet
  understand.  And Unicode 7.0 is due out in 3 months.

  Because of this
  - different people WILL have tools that understand different
    versions of Unicode.  In fact, different tools in the same
    environment may do this.
  - your beautiful character WILL show up as garbage or even
    blank on someone's screen UNLESS it is an old or extremely
    popular (can you say Emoji?  I knew you could.  Can you
    teach me how to say it?) one.
  - when proposing to exploit Unicode characters, it is VITAL
    to understand that the Unicode "stability" rules are and
    which characters have what stable properties.

* With large cheap discs, large fonts are looking like a lot less
  of a problem.  (I failed to learn to read the Armenian letters,
  but do have those.  I succeeded in learning to read the Coptic
  letters -- but not the language(s)! -- but don't have those.
  Life is not fair.)

* We now have (a series of versions of) a standard character set
  containing a vast number of characters.  I very much doubt whether
  there is any one person who knows all the Unicode characters.

* Many of these characters are very similar.  I counted 64 "right
  arrow" characters before I gave up; this didn't include harpoons.
  Some of these are _very_ similar.  Some characters are visibly
  distinct, but normally regarded as mere stylistic differences.
  For example, <= has at least three variations (one bar, slanted;
  one bar, flat; two bars, flat) which people familiar with
  less than or equal have learned *not* to tell apart. But they
  are three different Unicode characters, from which we could
  make three different operators with different precedence or
  associativity, and of course type.

> My thoughts on this (of a philosophical nature) are:
> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
>
> If we can get the broader agreements (disagreements!) out of the way to
> start with, we may then look at the details.

I think Haskell can tolerate an experimental phase where people
try out a lot of things as long as everyone understands that it
*IS* an experimental phase, and as long as experimental operators
are kept out of Hackage, certainly out of the Platform, or at
least segregate it into areas with big flashing "danger" signs.

I think a *small* number of "pretty" operators can be added to
Haskell, without the sky falling, and I'll probably quite like
the result.  (Does anyone know how to get a copy of the
collected The Squiggolist?)  Let's face it, if a program is
full of Armenian identifiers or Ogham ones I'm not going to
have a clue what it's about anyway.  But keeping the "standard"
-- as in used in core modules -- letter and operator sets smallish
is probably a good idea.