FW: Haskell 98 lexical syntax again
Ashley Yakeley
ashley@semantic.org
Thu, 28 Feb 2002 19:13:32 -0800
At 2002-02-28 07:18, Simon Peyton-Jones wrote:
> whitechar -> newline | vertab | space | tab | uniWhite
> newline -> return linefeed | return | linefeed | formfeed
> return -> a carriage return
> linefeed -> a line feed
>
>This means that CR, LF, or CRLF, are all valid 'newline' separators,
>and the same sequence of characters should therefore work on any
>Haskell implementation.
Good.
While you're fiddling with it, I recommend this:
newline -> return linefeed | return | linefeed | formfeed |
uniLineSep | uniParaSep
uniLineSep -> any char of General Category Zl
uniParaSep -> any char of General Category Zp
Unicode defines two codepoints that unambiguously mean 'line separator'
(\u2028) and 'paragraph separator' (\u2029). As it happens, they are the
only codepoints in General Categories Zl and Zp. There are other
paragraph separators (e.g. Georgian and Urdu), but they are actual marks
rather than being whitespace and are not in GC Zp -- much like the
pilcrow.
uniWhite -> any UNIcode character defined as whitespace
This is fine. But note that whitespace is an 'extended property', it
can't be derived from General Category:
<http://www.unicode.org/Public/3.1-Update1/PropList-3.1.1.html>
--
Ashley Yakeley, Seattle WA