UTF-8 library

anatoli anatoli@yahoo.com
Sat, 10 Aug 2002 09:08:10 -0700 (PDT)


[apologies if you see multiple copies; I forgot to Cc: the list
the first time around.]

--- Sven Moritz Hallberg <pesco@gmx.de> wrote:

> [...] I think that it's
> ugly, though, to do it somewhere outside, pretending the issue's not
> there. I value about Haskell it's clean representation of reality.
> Attaching all kinds of state to handles just isn't as clear as "Look
> here, a file: It's a sequence of octets.", "Watch out though, each file
> can use an entirely different encoding.", "The Char versions of the IO
> functions will try to deal with encoding for you.", and "If you know you
> need some special treatment, we have these functions blahblahblah..."

As I view it, a Handle is always a stream of Char data. Why? Simply because Haskell
threats Handles as streams of Char data *today*. There's no good reason to change
that, unless you want to wheak havoc in existing programs.

To make things i18n-friendly, the simplest (IMHO) approach is to declare that
under each Hadle (i.e. Char stream) there is a BinaryHandle (i.e. Word8 stream)
*plus* an associated encoding (and also maybe CR/LF handler while we're at it).

I certainly don't want the same Handle type to be able to represent a sequence of 
octets and a sequence of Char at the same time.

> > I routinely read and write messages in three different languages that
> > use three different encodings. All of them are my "own" languages.
> 
> Where is the problem? The system is not going to be able to decide which
> one to use either way, so you must make the encoding explicit. Now we
> just have to come up with a convenient way to do it. Transforming
> between [Word8] and [Char] seems plausible to me.

I want to be able to specify encoding explicitly *and* be able to use existing
Char IO, because that's what my programs use *today* and I don't want to rework 
them. Rewriting all my IO because it's now Word8-based instead of Char-based is
NOT convenient.
 
> > A "Word8 stream" can be either Handle (Word8Handle?) or [Word8]. We can transform
> > [Word8] to [Char], but not Word8Handle to CharHandle. I argue that the latter
> > is needed as well.
> 
> The only reason for that would be efficiency. Simon said something about
> that. I admit that I have no clue about it.

What about backward compatibility? With my approach, in order to make a Haskell
program i18n-aware, you only need to change a few calls to openFile and make them
openFileWithEncoding. Otherwise they will just use default encoding.

-- 
a.

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com