UTF-8 library

Axel Simon A.Simon@ukc.ac.uk
Wed, 7 Aug 2002 10:28:08 +0100


On Tue, Aug 06, 2002 at 07:57:50PM +0200, George Russell wrote:
> Axel Simon wrote:
> > 
> > On Tue, Aug 06, 2002 at 06:11:04PM +0200, George Russell wrote:
> > [snip]
> > > Converting CStrings to [Word8] is probably a bad idea anyway, since there is
> > > absolutely no reason to assume a C character will be only 8 bits long, and
> > > under some implementations it isn't.
> > But the interface should be practical. I do not really want to write
> > Haskell programs for architectures where the smallest addressable memory
> > entity (i.e. C's char) is something else than 8 bits.
> Yes but if all we are talking about is practicality, I *do* want to convert 
> between CString's and ordinary String's in the conventional way, and I bet lots
> of other people do.
Let's stick to CChar and provide conversion functions then! (See below)

> I suggest that for conversions CChar to Char all that should be required should
> be that it be a total function and that the printable ASCII characters map in
> the obvious way and that the results should be consistent during the course of
> a Haskell program.  However I see no reason why the conversion should not take 
> account of locale when the program starts and so translate (say) KOI-8 encoded
> Cyrillic characters to their Unicode equivalents.
> 
> For Char to CChar it would surely be easiest to produce an IO error if the character
> doesn't fit.
For safety reasons I think the user should be aware of what he is doing. 
Just using withCString doesn't make the user aware of possible problems. I 
guess we need:

encodeISO-8859-1 :: String -> [CChar]
encodeISO-8859-1 = fromIntegral

encodeUTF-8 = ...

encodeDefault = case currentCodeset of
  ISO-8859-1 -> encodeISO-8859-1
  UTF-8      -> encodeUTF-8

writeFile fname str = writeBinaryFile fname (encodeDefault str)

withCString str = withArray 0 (encodeDefault str)

An the other way round:

decodeISO-8859-1 :: [CChar] -> String
decodeISO-8859-1 = fromIntegral

decodeUTF-8 = ...

decodeGuess ('<magic number for UTF-8>':xs) = decodeUTF-8 xs
decodeGuess ...
decodeGuess = decodeDefault 

decodeDefault = case currentCodeset of ...

readFile fname = liftM decodeDefault $ readBinaryFile fname

peekCString sPtr = liftM decodeDefault $ peekArray0 0 sPtr

Does that sound reasonable? In the documentation of GHC it says:
data CChar = CChar Int8
Is this determined during compilation of GHC?

Axel.