Data.ByteString candidate 3

Tue Apr 25 22:17:09 EDT 2006

On Wed, Apr 26, 2006 at 04:48:52AM +0300, Einar Karttunen wrote:
> I would like:
> * Data.ByteString.Word8
> * Data.ByteString.Char8
> * Data.ByteString.UTF
> 
> And select your favorite and make Data.ByteString export that one.
> I think that could be the Word8 or the UTF one.

ByteString should be the pure Word8 version. the others can be based on
it. ByteString is quite a useful data type independent of anything to do
with strings.

I'd like to see Data.PackedString be what you are calling
Data.ByteString.UTF and PackedString _specifically_ be a drop-in
replacement for String with an abstract internal representation and
should behave the same as String except when it comes to time and space.
I want to be able to just change a few types and routines to
PackedString from String in a library and be guarenteed I am not
affecting the meaning of a program. (or vice versa)

though, I do much much prefer the 'Char8' term to 'Latin1'. I think it
better represents what it does. just 'Chars truncated to 8 bits' while
'latin1' might have other unintended connotations. The fact that the
standard routines will interpret them as latin1 can be infered from the
fact that the standard routines interpret Chars as unicode code points. 

In particular, if you do something wacky where you don't store unicode
values in a 'Char' it doesn't magically become 'Latin1' just because you
store it in a latin1 string, it just becomes whatever you put in
truncated to 8 bits and hopefully you know what you are doing.

> > I don't see why. ascii is a subset of utf8, the routines building a
> > packedstring from an ascii string or a utf8 string can be identical, if
> > you know your string is ascii to begin with you can use an optimized
> > routine but the end result is the same as if you used the general utf8
> > version.
> 
> Actually toUpper works differently on ascii + something in the high bytes
> and ISO-8859-1. Same with all the isXXX predicates, fortunately not a problem
> for things like whitespace.

I am not sure what you mean, the data would always be utf8 full unicode
values in a PackedString, there would just be efficient ways to pull in
data you know is ascii since it can just use a memcpy rather than
recoding it from whatever format it is in. The fact that it happens to
just contain values < 128 won't make a different for subsequent handling
of the string. (except perhaps some routines will be faster). when I say
ASCII here, I just mean a utf8 string where all values happen to be <
128, which is happily binary compatable with ASCII.

> > the proper thing for PackedString is to make it behave exactly as the
> > String instances behave, since it is suposed to be a drop in
> > replacement. Which means the natuarl ordering based on the Char order
> > and the toLower and toUpper from the libraries.
> 
> toUpper and toLower are the correct version in the standard
> and they use the unicode tables. The natural ordering by
> codepoint without any normalization is not very useful for
> text handling, but works for e.g. putting strings in a Map.

yeah, and it is fast. I always thought we should have two Ord classes,
one for human digestable ordering and the other for fast implementation
dependent ordering for use only in things like Map and Set. but that is
a different issue.

in any case, the point I was trying to make is that PackedString should
behave exactly like String, whether the instances for String are doing
the right thing is a different matter.

> > uncode collation, graphemes, normalization, and localized sorting can be
> > provided as separate routines as another project (it would be nice to
> > have them work on both Strings and PackedStrings, so perhaps they could
> > be in a class?)
> 
> These are quite essential for really working with unicode characters.
> It didn't matter much before as Haskell didn't provide good ways
> to handle unicode chars with IO, but these are very important,
> otherwise it becomes hard to do many useful things with the parsed
> unicode characters.

yeah, they would be useful things to have. but no need to tie them
specifically to PackedString (though, they would operate on
PackedStrings most likely). ginsu and jhc both use unicode extensivly
without these routines, so saying it is hard to do useful things is
somewhat strong. but they would definitly be very useful to have and
necessary for certain applications.

> How are we supposed to process user input without normalization
> e.g. if we need to compare Strings for equivalence?

we implement normalization and provide it as a library :)

> But a simple UTF8 layer with more features added later is a good way.

I don't think these features should be in PackedString proper unless
they are added to String as well. (as in, in the default instances),
however a 'UnicodeString' that is a newtype of PackedString would be
easy enough with just different instance declarations.

the library routines for performing these transformations can be
provided in PackedString of course if that makes sense if they don't
conflict with any String operations of the same name.

but being able to do 'normalize a == normalize b' would be useful for
PackedStrings independent of UnicodeString.

        John

-- 
John Meacham - ⑆repetae.net⑆john⑈