FW: Haskell 98 lexical syntax again

Ashley Yakeley ashley@semantic.org
Thu, 28 Feb 2002 19:13:32 -0800


At 2002-02-28 07:18, Simon Peyton-Jones wrote:

>  whitechar -> newline | vertab | space | tab | uniWhite
>  newline   -> return linefeed | return | linefeed | formfeed
>  return    -> a carriage return
>  linefeed  -> a line feed
>
>This means that CR, LF, or CRLF, are all valid 'newline' separators,
>and the same sequence of characters should therefore work on any
>Haskell implementation.

Good.

While you're fiddling with it, I recommend this:

  newline    -> return linefeed | return | linefeed | formfeed | 
uniLineSep | uniParaSep
  uniLineSep -> any char of General Category Zl
  uniParaSep -> any char of General Category Zp

Unicode defines two codepoints that unambiguously mean 'line separator' 
(\u2028) and 'paragraph separator' (\u2029). As it happens, they are the 
only codepoints in General Categories Zl and Zp. There are other 
paragraph separators (e.g. Georgian and Urdu), but they are actual marks 
rather than being whitespace and are not in GC Zp -- much like the 
pilcrow.

  uniWhite   -> any UNIcode character defined as whitespace

This is fine. But note that whitespace is an 'extended property', it 
can't be derived from General Category:
<http://www.unicode.org/Public/3.1-Update1/PropList-3.1.1.html>


-- 
Ashley Yakeley, Seattle WA