UTF-8 library

Martin Norbäck d95mback@dtek.chalmers.se
08 Aug 2002 14:58:42 +0200


--=-VvY6O2HJKvXl6Nvq4CIM
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

tor 2002-08-08 klockan 14.18 skrev Manuel M T Chakravarty:
> Ashley Yakeley <ashley@semantic.org> wrote,
>=20
> > At 2002-08-08 02:28, Manuel M T Chakravarty wrote:
> >=20
> > >ANSI C guarantees that char is 1 byte (more precisely that
> > >"sizeof (char)" =3D=3D 1).
> >=20
> > That's also what the C++ ARM says (which I have to hand). Unfortunately=
,=20
> >=20
> >     "a byte is undefined by the language except in terms of=20
> >     sizeof; sizeof(char) is 1." [sec. 5.3.2]
> >=20
> > Maybe ANSI C is different?
>=20
> As I understand it, in ANSI C, the only freedom that an
> implementation has in choosing a concrete representation for
> "char" is to decide whether it is signed or unsigned.  In
> any case, it is going to be an 8 bit entity.

No, ANSI C just says that sizeof measures other things in chars. So
sizeof(char) is always 1, but 1 could mean 8, 9, 16 or 17 bits depending
on the architecture.

However, I've yet to see an architecture where a c char is not 8 bits,
and I doubt that there ever will be. So assuming char =3D 8 bits is not
going to make things any worse, since it's already implicitly assumed in
many places.

Anyway, UTF-8 is as stated before an octet stream, and so, the natural
choice would be to have UTF-8 encoded text as [Word8].

putChar (and putStr) should output UTF-8 text if the locale is UTF-8,
and getChar (and getLine) should input UTF-8 text if the locale is
UTF-8.=20

This is the only implication you can make based on the fact that a Char
is a unicode character (not iso-8859-1, not ASCII).

There rarely should be a need to handle UTF-8 text internally in
Haskell, but for FFI it would be neccessary. Using locale automatically
there is wrong, since gtk2 uses UTF-8 always, and other interfaces uses
iso-8859-1 always. However, having some conversion functions could never
hurt, but they need not be in the FFI.

Regards,

	Martin=20

--=20
Martin Norb=E4ck          d95mback@dtek.chalmers.se             =20
Kapplandsgatan 40       +46 (0)708 26 33 60                   =20
S-414 78  G=D6TEBORG      http://www.dtek.chalmers.se/~d95mback/
SWEDEN                  OpenPGP ID: 3FA8580B

--=-VvY6O2HJKvXl6Nvq4CIM
Content-Type: application/pgp-signature; name=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: För information se http://www.gnupg.org/

iD8DBQA9UmsCkXyAGj+oWAsRAg/AAJ954IIRFhdGVhjG56WlB5i2dp/RFgCfcpfE
T9vFZbOWSXMl2VzW78FSoqQ=
=PDPI
-----END PGP SIGNATURE-----

--=-VvY6O2HJKvXl6Nvq4CIM--