Comment Syntax

Bulat Ziganshin bulatz at HotPOP.com
Fri Feb 3 04:53:41 EST 2006


Hello John,

Friday, February 03, 2006, 3:39:38 AM, you wrote:
>> Got a unicode-compliant compiler?

JM> sure do :)

JM> but it currently doesn't recognize any unicode characters as possible
JM> operators.

are you read this? :)

>   Log:
>   Add support for UTF-8 source files
> 
>   GHC finally has support for full Unicode in source files.  Source
>   files are now assumed to be UTF-8 encoded, and the full range of
>   Unicode characters can be used, with classifications recognised
using
>   the implementation from Data.Char.  This incedentally means that
only
>   the stage2 compiler will recognise Unicode in source files, because
I
>   was too lazy to port the unicode classifier code into libcompat.
> 
>   Additionally, the following synonyms for keywords are now
recognised:
> 
>     forall symbol     (U+2200)        forall
>     right arrow       (U+2192)        ->
>     left arrow                (U+2190)        <-
>     horizontal ellipsis       (U+22EF)        ..
> 
>   there are probably more things we could add here.
> 
>   This will break some source files if Latin-1 characters are being
used.
>   In most cases this should result in a UTF-8 decoding error.  Later
on
>   if we want to support more encodings (perhaps with a pragma to
specify
>   the encoding), I plan to do it by recoding into UTF-8 before
parsing.
> 
>   Internally, there were some pretty big changes:
> 
>     - FastStrings are now stored in UTF-8
> 
>     - Z-encoding has been moved right to the back end.  Previously we
>       used to Z-encode every identifier on the way in for simplicity,
>       and only decode when we needed to show something to the user.
>       Instead, we now keep every string in its UTF-8 encoding, and
>       Z-encode right before printing it out.  To avoid Z-encoding the
>       same string multiple times, the Z-encoding is cached inside the
>       FastString the first time it is requested.
> 
>       This speeds up the compiler - I've measured some definite
>       improvement in parsing at least, and I expect compilations
overall
>       to be faster too.  It also cleans up a lot of cruft from the
>       OccName interface.  Z-encoding is nicely hidden inside the
>       Outputable instance for Names & OccNames now.
> 
>     - StringBuffers are UTF-8 too, and are now represented as
>       ForeignPtrs.
> 
>     - I've put together some test cases, not by any means exhaustive,
>       but there are some interesting UTF-8 decoding error cases that
>       aren't obvious.  Also, take a look at unicode001.hs for a demo.


-- 
Best regards,
 Bulat                            mailto:bulatz at HotPOP.com





More information about the Haskell-prime mailing list