UTF-8 library
Axel Simon
A.Simon@ukc.ac.uk
Wed, 7 Aug 2002 10:28:08 +0100
On Tue, Aug 06, 2002 at 07:57:50PM +0200, George Russell wrote:
> Axel Simon wrote:
> >
> > On Tue, Aug 06, 2002 at 06:11:04PM +0200, George Russell wrote:
> > [snip]
> > > Converting CStrings to [Word8] is probably a bad idea anyway, since there is
> > > absolutely no reason to assume a C character will be only 8 bits long, and
> > > under some implementations it isn't.
> > But the interface should be practical. I do not really want to write
> > Haskell programs for architectures where the smallest addressable memory
> > entity (i.e. C's char) is something else than 8 bits.
> Yes but if all we are talking about is practicality, I *do* want to convert
> between CString's and ordinary String's in the conventional way, and I bet lots
> of other people do.
Let's stick to CChar and provide conversion functions then! (See below)
> I suggest that for conversions CChar to Char all that should be required should
> be that it be a total function and that the printable ASCII characters map in
> the obvious way and that the results should be consistent during the course of
> a Haskell program. However I see no reason why the conversion should not take
> account of locale when the program starts and so translate (say) KOI-8 encoded
> Cyrillic characters to their Unicode equivalents.
>
> For Char to CChar it would surely be easiest to produce an IO error if the character
> doesn't fit.
For safety reasons I think the user should be aware of what he is doing.
Just using withCString doesn't make the user aware of possible problems. I
guess we need:
encodeISO-8859-1 :: String -> [CChar]
encodeISO-8859-1 = fromIntegral
encodeUTF-8 = ...
encodeDefault = case currentCodeset of
ISO-8859-1 -> encodeISO-8859-1
UTF-8 -> encodeUTF-8
writeFile fname str = writeBinaryFile fname (encodeDefault str)
withCString str = withArray 0 (encodeDefault str)
An the other way round:
decodeISO-8859-1 :: [CChar] -> String
decodeISO-8859-1 = fromIntegral
decodeUTF-8 = ...
decodeGuess ('<magic number for UTF-8>':xs) = decodeUTF-8 xs
decodeGuess ...
decodeGuess = decodeDefault
decodeDefault = case currentCodeset of ...
readFile fname = liftM decodeDefault $ readBinaryFile fname
peekCString sPtr = liftM decodeDefault $ peekArray0 0 sPtr
Does that sound reasonable? In the documentation of GHC it says:
data CChar = CChar Int8
Is this determined during compilation of GHC?
Axel.