Data.ByteString candidate 3

Einar Karttunen ekarttun at cs.helsinki.fi
Tue Apr 25 19:16:38 EDT 2006


On 25.04 13:46, John Meacham wrote:
> I think all we really need are
> 
> Data.ByteString
> Data.PackedString
> 
> (Though, I suppose Latin1 could be useful)

Using the Word8 API is not very pleasant, because all
character constants etc are not Word8.

As for Latin1 - what semantics do we use for toUpper/toLower and Ord?
Using the unicode ones or locale seems the sensible thing if the data
really is Latin1.

Thus a simple wrapper to the Word8 api is desirable. Make it follow
few simple rules:
* c2w . w2c = id  (conversion is a bijection)
* ascii characters translated correctly
* toLower/toUpper for ascii
* Ord by byte values.

This is very useful for many purposes and does not mean that there
should not be a fancy UTF8 module. Rather than arguing about killing
this, wouldn't it be more productive to create the UTF8 module?

> but note, do the people that want latin1 just need ASCII? because it should be
> noted that if we have a UTF8 PackedString, then we can make
> ASCII-specific access routines that are just as fast as the ones in the
> Latin1 variety without giving up the ability to store full unicode
> values in the string.

Case conversions and ordering need to be different. Thus we need to newtype
things to avoid having two conflicting Ord instances. The UTF8 layer
should provide:

* Unicode toUpper/toLower
* Unicode collation (UCA) for Ord
* Graphemes (see Perl6 for good ways to do this)
* Normalisation

- Einar Karttunen


More information about the Libraries mailing list