[Haskell-cafe] UTF-16

Thu Jul 26 04:29:06 EDT 2007

> From: haskell-cafe-bounces at haskell.org 
> [mailto:haskell-cafe-bounces at haskell.org] On Behalf Of Donald 
> Bruce Stewart
> 
> andrewcoppin:
> > I don't know if anybody cares, but... Today a wrote some 
> trivial code to 
> > decode (not encode) UTF-16.
> > 
> > I believe somebody out there has a UTF-8 decoder, but I 
> needed UTF-16 as it happens.
> 
> Perhaps you could polish it up, and provide it in a form suitable for
> use as a patch to:
> 
>     http://code.haskell.org/utf8-string/
> 
> that is, put it in a module:
> 
>     Codec.Binary.UTF16.String
> 
> and provide the functions:
> 
>     encode :: String -> [Word8]
>     decode :: [Word8] -> String
> 
> ? And then submit that as a patch to Eric, the utf8 maintainer.
> 
> -- Don

There is a UTF16 en/decoder in Foreign.C.String (see cWcharsToChars &
charsToCWchars):
  http://darcs.haskell.org/libraries/base/Foreign/C/String.hs

but it only seems to be available for Windows users, via the CWSTring
functions.

In Takusen we also have a UTF8 module (it's about the fourth or fifth
out there, after HXML and John Meacham's, and someone else's - Graham
Klyne?, and one hidden away in GHC's internals). It has pure en/decode
functions (String <-> [Word8]), naturally (which we ripped off from John
Meacham), but we were more interested in efficient marshalling from
CStrings (or data buffers, if you like), so we wrote specific code to
marshall CString -> String fairly quickly, and space efficiently (see
fromUTF8Ptr, which is wrapped by peekUTF8String{Len}):
  http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs

We stuck it in the Foreign.C namespace, rather than Codec, because we're
doing more FFI related stuff. I'm not sure what the best location is;
perhaps there should be a split, with FFI functions (withUTF8String,
peekUTF8String) in Foreign.C, and pure functions in Codec.

(Also, is there a wiki page somewhere which gives advice as to how to
locate/name library modules, and what the currently occupied namespace
is, including user libs like those on Hackage? It's sometimes a bit
tricky to try to figure out where to put a new module.)

Obviously a proliferation of UTF8 modules isn't great for code re-use.
Is there a plan to consolidate and expose UTF8 and UTF16 de- and
encoders in the libraries? I note that the various UTF8 modules have
fairly similar implementations, and differ mainly w.r.t. how much of the
UTF8 codepoint space they handle (for example, HXML's decodes up to 6
bytes, which isn't strictly standards compliant). Also, some choice as
how to handle errors in the byte stream might be nice i.e. the user
could choose between functions which raise errors, or introduce
substition chars.

Alistair
*****************************************************************
Confidentiality Note: The information contained in this message,
and any attachments, may contain confidential and/or privileged
material. It is intended solely for the person(s) or entity to
which it is addressed. Any review, retransmission, dissemination,
or taking of any action in reliance upon this information by
persons or entities other than the intended recipient(s) is
prohibited. If you received this in error, please contact the
sender and delete the material from any computer.
*****************************************************************