FPS/Data.ByteString candidate

Tue Apr 25 12:50:05 EDT 2006

On Tue, 2006-04-25 at 22:34 +1000, Donald Bruce Stewart wrote:
> ross:
> > On Tue, Apr 25, 2006 at 12:08:45PM +0300, Einar Karttunen wrote:
> > > The name Latin1 is particularly bad since there are many other
> > > single byte encodings around.
> > 
> > The name is quite appropriate, since that is the particular encoding of
> > Char that is exposed by the interface.  What's bad is that there's no
> > choice.  Calling it Latin1 is just being honest about that, and leaving
> > room for modules with other encodings or an interface parameterized
> > by encoding.
> 
> Ok. Duncan, Ketil, Ross and Simon make good points here.
> I'll move Data.ByteString.Char -> Data.ByteString.Latin1

Ok one final point from a discussion between me and Einar Karttunen...

(I'm mindful of Simon's comment about sheds... :-) )

There are two different common uses of a 8-bit string library with
different assumptions and guarantees. (As it happens they have the same
implementation)

In one use case, we want to be able to guarantee that we can get Chars
out of our string and guarantee that they really are Haskell Chars. That
is that they are valid Unicode code points which we could pass to
functions like isUpper and get valid answers. As an example consider
Char 'Â' (chr 0xC2, Latin capital A with circumflex). This is not ASCII
but it is clearly upper case. If we don't know that we're working with
an 8-bit subset of Unicode then we can't use Unicode properties like
isUpper etc.

Then the other common use case is where we have some character string
encoding which contains ASCII as a subset. That is we don't know the
encoding exactly (it may be Latin1, LatinN, UTF8, etc) but we do know
that ASCII chars 0-127 are represent by those same numbers in our byte
stream. Examples where this is useful is in parsing network protocols.
There are several examples of these which use 8-bit extensions of ASCII
but the protocol only gives semantics to chars in the ASCII subset. For
this case it would be very inconvenient to have to use an API based just
on Word8 but on the other hand we can't give a proper guarantee on being
able to turn bytes into Haskell Chars (only for bytes <127).

So what do we do about this?

Einar was thinking about an API that might look like this:
Data.ByteString.{Char8, Latin1, Latin2, ..., UTF8, ...}

Char8 should provide:
* litle overhead
* For ascii characters the right translation
* c2w . w2c = id
* toUpper and toLower on Ascii
* Ord with raw byte values

Latin1 should guarantee:
* Correct translation for Latin1, C0 and C1 characters
* Really just a subset of unicode for character handling
* Predicates like toUpper and toLower
* toUpper and toLower per Unicode definition
  (there is no common latin1 definition afaik)
* Ord per UCA (unicode collation algorithm)
* Or use locale for toUpper/toLower and Ord.

So basically the .Char8 module is for the ASCII extension case and
the .Latin1 is for the 8-bit Unicode subset case.

I think in fact that darcs would want the .Char8 version but I expect
that may other users will want a library that can guarantee conversions
to ordinary Haskell Chars (which involves an assumption on the character
encoding).

Duncan