[Haskell-cafe] Type system madness

Hugh Perkins hughperkins at gmail.com
Tue Jul 10 16:24:52 EDT 2007


We can consider three "families" of character sets:
- ASCII: 127 characters, some of which are escape codes like "bell" etc
- regional encodings: china uses GB2312, Europe uses ISO-8859-1, America
uses ... something
- unicode: UTF-8, UTF-16

The regional encodings are optimized for their region, and they only support
characters from their own region, so the chinese character set (GB2312)
contains all the chinese characters, and the english letters, but it doesnt
contain for example French characters like é or ç.

Similarly ISO-8859-1 contains the characters for all the european langauges
(I think), but it doesnt contain the Chinese characters.

Unicode contains the characters from *all* the worlds languages combined.
UTF-16 encodes this uses 2 or more bytes.  UTF-8 encodes this using 1 or
more bytes.

Basically the characters 0-127 are identical between ASCII and UTF-8, then
numbers from 128 onwards are a flag to say that you need to read another
byte or so to get the full information to know the character (something like
that).

UTF-16 kindof sucks because its not compatible with ASCII, and it uses twice
as many bytes for English characters.  On the other hand its what Windows NT
uses.  UTF-8 is compatible with ASCII, but it can use more bytes to encode
the data for certain non-English characters than UTF-16.

On 7/10/07, Andrew Coppin <andrewcoppin at btinternet.com> wrote:
>
> (BTW, I always wondered how the Asian and Chinese people do any work
> with computers, given that the ASCII character set doesn't even include
> any characters in their alphabet...)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20070710/ae8beefc/attachment.htm


More information about the Haskell-Cafe mailing list