[Haskell-i18n] unicode notation \uhhhh implementation

15 Aug 2002 19:02:45 +0200

On Thu, 2002-08-15 at 16:10, Simon Marlow wrote:
> > -----Original Message-----
> > From: Martin Norb=E4ck [mailto:d95mback@dtek.chalmers.se]=20
> > Sent: 15 August 2002 14:40
> > To: haskell-i18n@haskell.org
> > Subject: [Haskell-i18n] unicode notation \uhhhh implementation
> >=20
> >=20
> > Does anyone know the status of the implementation of unicode escape
> > sequences \uhhhh as per 2.1 in the Haskell 98 standard?
> >=20
> > When implemented, do the count as one or five characters.
> >=20
> > Or, is UTF-8 (or locale specified encoding) to be used for Haskell
> > source code? If yes, when?
>=20
> I wasn't aware of that paragraph in the report until recently, and as far=
 as I know none of the current Haskell implementations implement the '\uhhh=
h' escape sequences.
>=20
> One reason to use this approach would be if there already existed a prepr=
ocessor to do the job - does anyone know of one?  If not, I think the parag=
raph could be deleted in favour of using appropriate encodings for source f=
iles (I'd planned to implement at least UTF-8 in GHC at some point).

What reason would one have to use Unicode characters outside of string
and character literals? The only one I could see would be to have
prettier code. I'm personally itching for real greek letters in my code,
even if only a handful of other people could read the file, ATM.

But what's the point of it if you can't deal with the characters
yourself, having to use escape sequences. The compiler surely couldn't
care less about it.

One reason to use such a Unicode-to-ASCII conversion would be to enable
others whose editors and other tools don't support the encoding to read
your code. They'd need a tool to escape your encoding, though. But after
that, it might indeed be convenient if the compiler preprocessed the
escapes automatically. The only question that remains is: how likely are
things like these:

    \u1234 -> u1234 * 5
    "Hey\    \u1234!"
    "C:\\udef2\\"

This is either ambiguous or error-prone. I think I'd personally vote
against allowing this in the language. I guess the problem is that the
backslash appears in two unrelated "layers" so one can never be quite
sure about which layer it was intended for. Of course a good definition
can be made. The current behaviour is already well-defined: the above
for instance would be a syntax error and a couple of strings with
strange contents (Hm, would the first string also be a syntax error?).
But what would be the cost for avoiding these things? The first case
would harm code-generators. Hm. The second one looks artificial. But
number three doesn't appear all that strange and alot of people would
stumble over that, not knowing what's up, I'm sure.

Having shared the relevant contents of my mind to feed the process, I
leave the final decision up to people with more experience.

Regards,
Sven Moritz