[newbie] UTF-8

Wolfgang Jeltsch wolfgang@jeltsch.net
Mon, 11 Aug 2003 00:49:05 +0200


On Sunday, 2003-08-10, 19:27, CEST, Danon'. wrote:
> Hi,
>
> We try to make a program which write on stdout the UTF-8 character
> corresponding to an input unicode value.

UTF-8 encodes each unicode value as a stream of octets. So there are two 
mistakes in your sentence above:
    1. You want to output octets (i.e., 8-bit words), not characters. (In
       Haskell 98, a character is always a Unicode code value, although, in
       practice, not all Haskell systems support Unicode.)
    2. One character (i.e., Unicode code value) is not always converted to a
       single octet but often to a sequence of octets.

> [...]

The main problem is that you need binary I/O. Haskell 98 only provides text 
I/O.

Text I/O involves the use of an encoding which maps between the octets of the 
actual I/O stream and the characters Haskell sends or recieves. At least, 
Hugs and GHC seem to use Latin-1 as the encoding which just means that they 
map the octets 0 to 255 to the characters with Unicode codes 0 to 255.

The other point with text I/O is that under Windows the EOF character ^Z is 
treated specially and a conversion between Windows EOLs (^M^J) and Haskell 
EOLs (^J) takes place. Hugs and GHC provide the function openFileEx which 
allows you to turn all these Windows-specific things off.

So an easy way to read or write octets from/to a file might be to open the 
file via openFileEx and convert characters to octets via Char.ord or octets 
to characters via Char.chr, respectively.

The conversion between characters and their UTF-8 encodings shouldn't be too 
difficult for you to implement yourself. Alternatively, you might want to 
look at http://sourceforge.net/projects/haskell-i18n/.

> Niko.

> [...]

Wolfgang