UTF-8 library

anatoli anatoli@yahoo.com
Sat, 10 Aug 2002 03:03:01 -0700 (PDT)


--- Sven Moritz Hallberg <pesco@gmx.de> wrote:
> I argue _strongly_ against associating some sort of locale state with
> handles.
> 
> 1) In agreement with Ashley's statements, file IO should use octets,
> because that's what's in a file.

By the same token, we should handle CR/LF/CR-LF/LF-CR mess by hand.
(Files don't have lines in them, they are just sequences of octets.)

I prefer somewhat higher-level view of files.

> 2) If you need to decode those octets to characters, or vice-versa,
> compose a (de)serialization function before it.

I *always* need that. (Except for binary IO). Might as well have this 
functionality built in a handle.

> 3) A "best shot" character reading(or writing, for that matter)
> function, will be convenient. This should probably use your current
> locale, because when writing a character, you'll probably want to be
> able to write your own language's characters correctly.

I routinely read and write messages in three different languages that
use three different encodings. All of them are my "own" languages.

> 4) For decoding, we'll need some parsing functionality, as someone
> already mentioned. With that we can have functions like parseUTF8.
> "Associating a locale with a stream", as you put it, is a matter of, if
> f is the raw Word8 stream, g = parseUTF8 f, where g is the Char stream,
> parsed as UTF-8-encoded characters from f.

A "Word8 stream" can be either Handle (Word8Handle?) or [Word8]. We can transform
[Word8] to [Char], but not Word8Handle to CharHandle. I argue that the latter
is needed as well.

-- 
a.

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com