[newbie] UTF-8
Wolfgang Jeltsch
wolfgang@jeltsch.net
Mon, 11 Aug 2003 00:49:05 +0200
On Sunday, 2003-08-10, 19:27, CEST, Danon'. wrote:
> Hi,
>
> We try to make a program which write on stdout the UTF-8 character
> corresponding to an input unicode value.
UTF-8 encodes each unicode value as a stream of octets. So there are two
mistakes in your sentence above:
1. You want to output octets (i.e., 8-bit words), not characters. (In
Haskell 98, a character is always a Unicode code value, although, in
practice, not all Haskell systems support Unicode.)
2. One character (i.e., Unicode code value) is not always converted to a
single octet but often to a sequence of octets.
> [...]
The main problem is that you need binary I/O. Haskell 98 only provides text
I/O.
Text I/O involves the use of an encoding which maps between the octets of the
actual I/O stream and the characters Haskell sends or recieves. At least,
Hugs and GHC seem to use Latin-1 as the encoding which just means that they
map the octets 0 to 255 to the characters with Unicode codes 0 to 255.
The other point with text I/O is that under Windows the EOF character ^Z is
treated specially and a conversion between Windows EOLs (^M^J) and Haskell
EOLs (^J) takes place. Hugs and GHC provide the function openFileEx which
allows you to turn all these Windows-specific things off.
So an easy way to read or write octets from/to a file might be to open the
file via openFileEx and convert characters to octets via Char.ord or octets
to characters via Char.chr, respectively.
The conversion between characters and their UTF-8 encodings shouldn't be too
difficult for you to implement yourself. Alternatively, you might want to
look at http://sourceforge.net/projects/haskell-i18n/.
> Niko.
> [...]
Wolfgang