Data.ByteString candidate 3

John Meacham john at repetae.net
Tue Apr 25 20:26:40 EDT 2006


On Wed, Apr 26, 2006 at 02:16:38AM +0300, Einar Karttunen wrote:
> Using the Word8 API is not very pleasant, because all
> character constants etc are not Word8.

yeah, but using the version restricted to latin1 seems rather special
case, I can't imagine (or certainly hope) it won't be used in general
internally unless people are already doing low level stuff. In this day
and age, I expect unicode to work pretty much everywhere.

> This is very useful for many purposes and does not mean that there
> should not be a fancy UTF8 module. Rather than arguing about killing
> this, wouldn't it be more productive to create the UTF8 module?

I am not saying we should kill the latin1 version, since there is
interest in it, just that it doesn't fill the need for a general fast
string replacement.

> > but note, do the people that want latin1 just need ASCII? because it should be
> > noted that if we have a UTF8 PackedString, then we can make
> > ASCII-specific access routines that are just as fast as the ones in the
> > Latin1 variety without giving up the ability to store full unicode
> > values in the string.
> 
> Case conversions and ordering need to be different. Thus we need to newtype
> things to avoid having two conflicting Ord instances. The UTF8 layer
> should provide:

I don't see why. ascii is a subset of utf8, the routines building a
packedstring from an ascii string or a utf8 string can be identical, if
you know your string is ascii to begin with you can use an optimized
routine but the end result is the same as if you used the general utf8
version.

> * Unicode toUpper/toLower
> * Unicode collation (UCA) for Ord
> * Graphemes (see Perl6 for good ways to do this)
> * Normalisation

well, none of these are UTF8 specific, we should not worry about the
encoding and just think of what 'PackedString' should do, the encoding
is unimportant to the API and semantics, the fact that you just happen
to be able to quickly convert to/from ascii and utf8 should be the only
visible difference in behavior.

the proper thing for PackedString is to make it behave exactly as the
String instances behave, since it is suposed to be a drop in
replacement. Which means the natuarl ordering based on the Char order
and the toLower and toUpper from the libraries.

uncode collation, graphemes, normalization, and localized sorting can be
provided as separate routines as another project (it would be nice to
have them work on both Strings and PackedStrings, so perhaps they could
be in a class?)

certainly a 
newtype LocalizedPackedString = LocalizedPackedString PackedString
with different instances would be a useful thing too.

but this should be a separate but related project from just getting a
fast string replacement. (as in, it shouldn't hold up PackedString
development)

        John

-- 
John Meacham - ⑆repetae.net⑆john⑈


More information about the Libraries mailing list