Why are strings linked lists?

Glynn Clements glynn.clements at virgin.net
Sat Nov 29 11:44:47 EST 2003


Wolfgang Jeltsch wrote:

> > Right now, values of type Char are, in reality, ISO Latin-1 codepoints
> > padded out to 4 bytes per char.
> 
> No, because this would mean that you wouldn't have chars with codes greater 
> than 255 which is not the case with GHC.

However, the behaviour of codes greater than 255 is undefined. Well,
effectively undefined; I can't imagine anyone wanting to explicitly
define the current behaviour, particularly the fact that:

	putChar c
and:
	putChar (chr (ord c + n * 256))

are equivalent for all integral n.

> But, of course, I agree with you that currently the main part of Unicode 
> support is missing.

I think that it goes much deeper than that.

Fixing the Char functions (to{Upper,Lower}, is*) is the easy part.

The hard part is dealing with the legacy of the I/O "fiction", i.e. 
the notion that the gap (or, rather, gulf) between characters and
octets can just be waved away, or at least made simple enough that it
can be effectively hidden.

For practical purposes, you need binary I/O, and you need I/O of text
in arbitrary encodings. The correct encoding may be different for
different parts of a program, and for different parts of data obtained
from a single source. The correct encoding may not be known at the
point that I/O occurs (at least, not for input), so you need to be
able to read octets then translate them to Chars once you actually
know the encoding. You also need to be able to handle data where the
encoding is unknown, or which isn't correctly encoded.

This isn't something which can be hidden; at least, not without
reducing Haskell to a toy language (e.g. only handles UTF-8, or only
handles the encoding specified by the locale etc).

-- 
Glynn Clements <glynn.clements at virgin.net>


More information about the Haskell mailing list