Unicode

John Meacham john@repetae.net
Fri, 25 May 2001 13:12:49 -0700


The algorithms for encoding unicode characters into the various
transport formats, UTF16,UTF8,UTF32 are well defined, they can trivially
be implemented in Haskell, for instance
encodeUTF8 :: String -> [Byte]
decodeUTF8 :: [Byte] -> Maybe String
would be easily definable.

BTW, since a char is no longer a byte, how about making a standard type
Byte = Word8 declaration in a readily accesable and standard place. then
we can start revamping the large amounts of legacy code which assume a
Char is 8 bits. in particular the POSIX APIs are no good in this
respect, as well as many of the example programs out there on the web...
	John


On Sat, May 26, 2001 at 03:17:40AM +1000, Fergus Henderson wrote:
> On 24-May-2001, Marcin 'Qrczak' Kowalczyk <qrczak@knm.org.pl> wrote:
> > Thu, 24 May 2001 14:41:21 -0700, Ashley Yakeley <ashley@semantic.org> pisze:
> > 
> > >>   - Initial Unicode support - the Char type is now 31 bits.
> > > 
> > > It might be appropriate to have two types for Unicode, a UCS2 type
> > > (16 bits) and a UCS4 type (31 bits).
> > 
> > Actually it's 20.087462841250343 bits. Unicode 3.1 extends to U+10FFFF,
> > ISO-10646-1 is said to shrink to U+10FFFF in future, so maxBound::Char
> > is '\x10FFFF' now.
> > 
> > Among encodings of Unicode in a stream of bytes there are UTF-8,
> > UTF-16 and UTF-32 (with endianness variants). AFAIK terms UCS2 and
> > UCS4 are obsolete: there is a single code space 0..0x10FFFF and
> > various ways to serialize characters.
> > 
> > Ghc is going to support conversion between internal Unicode and
> > some encodings for external byte streams. Among them there will be
> > UTF-{8,16,32} (with endianness variants), all treated as streams
> > of bytes.
> > 
> > There is no point in storing characters in UTF-16 internally.
> > Especially in ghc where characters are boxed objects, and Word16 is
> > represented as a full machine word (32 or 64 bits). UTF-16 will be
> > supported as an external encoding, parallel to ISO-8859-x etc.
> 
> What about for interfacing with Win32, MacOS X, or Java?
> Your talk about "external" versus "internal" worries me a bit,
> since the distinction between these is not always clear.
> Is there a way to convert a Haskell String into a UTF-16
> encoded byte stream without writing to a file and then
> reading the file back in?
> 
> -- 
> Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
>                                     |  of excellence is a lethal habit"
> WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.
> 
> _______________________________________________
> Haskell mailing list
> Haskell@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell

-- 
--------------------------------------------------------------
John Meacham   http://www.ugcs.caltech.edu/~john/
California Institute of Technology, Alum.  john@repetae.net
--------------------------------------------------------------