What is a punctuation character?
Iavor Diatchki
iavor.diatchki at gmail.com
Tue Mar 20 23:37:40 CET 2012
Hello,
So I looked at what GHC does with Unicode and to me it is seems quite
reasonable:
* The alphabet is Unicode code points, so a valid Haskell program is
simply a list of those.
* Combining characters are not allowed in identifiers, so no need for
complex normalization rules: programs should always use the "short"
version of a character, or be rejected.
* Combining characters may appear in string literals, and there they
are left "as is" without any modification (so some string literals may
be longer than what's displayed in a text editor.)
Perhaps this is simply what the report already states (I haven't
checked, for which I apologize) but, if not, perhaps we should clarify
things.
-Iavor
PS: I don't think that there is any need to specify a particular
representation for the unicode code-points (e.g., utf-8 etc.) in the
language standard.
On Fri, Mar 16, 2012 at 6:23 PM, Iavor Diatchki
<iavor.diatchki at gmail.com> wrote:
> Hello,
> I am also not an expert but I got curious and did a bit of Wikipedia
> reading. Based on what I understood, here are two (related) questions
> that it might be nice to clarify in a future version of the report:
>
> 1. What is the alphabet used by the grammar in the Haskell report? My
> understanding is that the intention is that the alphabet is unicode
> codepoints (sometimes referred to as unicode characters). There is no
> way to refer to specific code-points by escaping as in Java (the link
> that Gaby shared), you just have to write the code-points directly
> (and there are plenty of encodings for doing that, e.g. UTF-8 etc.)
>
> 2. Do we respect "unicode equivalence"
> (http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source
> code. The issue here is that, apparently, some sequences of unicode
> code points/characters are supposed to be morally the same. For
> example, it would appear that there are two different ways to write
> the Spanish letter ñ: it has its own number, but it can also be made
> by writing "n" followed by a modifier to put the wavy sign on top.
>
> I would guess that implementing "unicode equivalence" would not be
> too hard---supposedly the unicode standard specifies a "text
> normalization procedure". However, this would complicate the report
> specification, because now the alphabet becomes not just unicode
> code-points, but equivalence classes of code points.
>
> Thoughts?
>
> -Iavor
>
>
>
>
>
>
> On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh <igloo at earth.li> wrote:
>>
>> Hi Gaby,
>>
>> On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
>>>
>>> OK, thanks! I guess a take away from this discussion is that what
>>> is a punctuation is far less well defined than it appears...
>>
>> I'm not really sure what you're asking. Haskell's uniSymbol includes all
>> Unicode characters (should that be codepoints? I'm not a Unicode expert)
>> in the punctuation category; I'm not sure what the best reference is,
>> but e.g. table 12 in
>> http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
>> lists a number of Px categories, and a meta-category P "Punctuation".
>>
>>
>> Thanks
>> Ian
>>
>>
>> _______________________________________________
>> Haskell-prime mailing list
>> Haskell-prime at haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-prime
More information about the Haskell-prime
mailing list