What is a punctuation character?

Mon Mar 19 10:56:58 CET 2012

On Mon, Mar 19, 2012 at 4:34 AM, Simon Marlow <simonmar at microsoft.com> wrote:
>> On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh <igloo at earth.li> wrote:
>> > Hi Gaby,
>> >
>> > On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
>> >>
>> >> OK, thanks!  I guess a take away from this discussion is that what is
>> >> a punctuation is far less well defined than it appears...
>> >
>> > I'm not really sure what you're asking. Haskell's uniSymbol includes
>> > all Unicode characters (should that be codepoints? I'm not a Unicode
>> > expert) in the punctuation category; I'm not sure what the best
>> > reference is, but e.g. table 12 in
>> >    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
>> > lists a number of Px categories, and a meta-category P "Punctuation".
>> >
>> >
>> > Thanks
>> > Ian
>> >
>>
>> Hi Ian,
>>
>> I guess what I am asking was partly summarized in Iavor's message.
>>
>> For me, the issue started with bullet number 4 in section 1.1
>>
>>      http://www.haskell.org/onlinereport/intro.html#sect1.1
>>
>> which states that:
>>
>>        The lexical structure captures the concrete representation
>>        of Haskell programs in text files.
>>
>> That combined with the opening section 2.1 (e.g. example of terminal
>> syntax) and the fact that the grammar  routinely described two non-
>> terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character)
>> suggested that the concrete syntax of Haskell programs in text files is in
>> ASCII charset.  Note this does not conflict with the general statement
>> that Haskell programs use the Unicode character because the uniXXX could
>> use the ASCII charset to introduce Unicode characters -- this is not
>> uncommon practice for programming languages using Unicode characters; see
>> the link I gave earlier.
>>
>> However, if I understand Malcolm's message correctly, this is not the
>> case.
>> Contrary to what I quoted above, Chapter 2 does NOT specify the concrete
>> representation of Haskell programs in text files.  What it does is to
>> capture the structure of what is obtained from interpreting, *in some
>> unspecified encoding or unspecified alphabet*,  the concrete
>> representation of Haskell programs in text files.  This conclusion is
>> unfortunate, but I believe it is correct.
>> Since the encoding or the alphabet is unspecified, it is no longer
>> necessarily the case that two Haskell implementations would agree on the
>> same lexical interpretation when presented with the same exact text file
>> containing  a Haskell program.
>>
>> In its current form, you are correct that the Report should say
>> "codepoint"
>> instead of characters.
>>
>> I join Iavor's request in clarifying the alphabet used in the grammar.
>
> The report gives meaning to a sequence of codepoints only, it says nothing about how that sequence of codepoints is represented as a string of bytes in a file, nor does it say anything about what those files are called, or even whether there are files at all.

Thanks, Simon.

The fact that the Report is silent about encoding used to
represent concrete Haskell programs in text files adds
a certain level of non-portability (and confusion.)  I found
last night that a proposal has been made to add some
support for encoding specification

    http://hackage.haskell.org/trac/haskell-prime/wiki/UnicodeInHaskellSource

I believe that is a good start.  What are the odds of it being considered
for Haskell 2012?  I suspect the pragma proposal works only if something
is said about the position of that pragma in the source file (e.g. it
must be the
first line, or file N bytes in the source file) otherwise we have an
infinite descent.

>
> Perhaps some clarification is in order in a future revision, and we should use the correct terminology where appropriate.  We should also clarify that "punctuation" means exactly the Punctuation class.

That would be great.  Do you have any comment about the
UnicodeInHaskellSource proposal?

> With regards to normalisation and equivalence, my understanding is that Haskell does not support either: two identifiers are equal if and only if they are represented by the same sequence of codepoints.  Again, we could add a clarifying sentence to the report.
>

Ugh.

Writing a parser for Haskell was an interesting exercise :-)

-- Gaby