RFC: Unicode primes and super/subscript characters in GHC

Wed Jun 25 19:54:57 UTC 2014

Isn't it weird that you can't write `a₁'`? I was considering proposing

varid -> (small { small | large | digit | ' | primes } { subsup | primes 
}) (EXCEPT reservedid)

but felt that it would be odd to allow primes in the middle of an 
identifier but not super/subscripts. I wish we could just abandon things 
like `a'bc'd` altogether...

On 06/15/2014 03:58 AM, John Meacham wrote:
> I have this feature in jhc, where I have a 'trailing' character class
> that can appear at the end of both symbols and ids.
>
> currently it consists of
>
>   $trailing = [₀₁₂₃₄₅₆₇₈₉⁰¹²³⁴⁵⁶⁷⁸⁹₍₎⁽⁾₊₋]
>
>   John
>
> On Sat, Jun 14, 2014 at 7:48 AM, Mikhail Vorozhtsov
> <mikhail.vorozhtsov at gmail.com> wrote:
>> Hello lists,
>>
>> As some of you may know, GHC's support for Unicode characters in lexemes is
>> rather crude and hence prone to inconsistencies in their handling versus the
>> ASCII counterparts. For example, APOSTROPHE is treated differently from
>> PRIME:
>>
>> λ> data a +' b = Plus a b
>> <interactive>:3:9:
>>      Unexpected type ‘b’
>>      In the data declaration for ‘+’
>>      A data declaration should have form
>>        data + a b c = ...
>> λ> data a +′ b = Plus a b
>>
>> λ> let a' = 1
>> λ> let a′ = 1
>> <interactive>:10:8: parse error on input ‘=’
>>
>> Also some rather bizarre looking things are accepted:
>>
>> λ> let ᵤxᵤy = 1
>>
>> In the spirit of improving things little by little I would like to propose:
>>
>> 1. Handle single/double/triple/quadruple Unicode PRIMEs the same way as
>> APOSTROPHE, meaning the following alterations to the lexer:
>>
>> primes -> U+2032 | U+2033 | U+2034 | U+2057
>> symbol -> ascSymbol | uniSymbol (EXCEPT special | _ | " | ' | primes)
>> graphic -> small | large | symbol | digit | special | " | ' | primes
>> varid -> (small { small | large | digit | ' | primes }) (EXCEPT reservedid)
>> conid -> large { small | large | digit | ' | primes }
>>
>> 2. Introduce a new lexer nonterminal "subsup" that would include the Unicode
>> sub/superscript[1] versions of numbers, "-", "+", "=", "(", ")", Latin and
>> Greek letters. And allow these characters to be used in names and operators:
>>
>> symbol -> ascSymbol | uniSymbol (EXCEPT special | _ | " | ' | primes |
>> subsup )
>> digit -> ascDigit | uniDigit (EXCEPT subsup)
>> small -> ascSmall | uniSmall (EXCEPT subsup) | _
>> large -> ascLarge | uniLarge (EXCEPT subsup)
>> graphic -> small | large | symbol | digit | special | " | ' | primes |
>> subsup
>> varid -> (small { small | large | digit | ' | primes | subsup }) (EXCEPT
>> reservedid)
>> conid -> large { small | large | digit | ' | primes | subsup }
>> varsym -> (symbol (EXCEPT :) {symbol | subsup}) (EXCEPT reservedop | dashes)
>> consym -> (: {symbol | subsup}) (EXCEPT reservedop)
>>
>> If this proposal is received favorably, I'll write a patch for GHC based on
>> my previous stab at the problem[2].
>>
>> P.S. I'm CC-ing Cafe for extra attention, but please keep the discussion to
>> the GHC users list.
>>
>> [1] https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
>> [2] https://ghc.haskell.org/trac/ghc/ticket/5108
>> _______________________________________________
>> Glasgow-haskell-users mailing list
>> Glasgow-haskell-users at haskell.org
>> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>
>