[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a
Unicode ByteString
Twan van Laarhoven
twanvl at gmail.com
Mon Feb 5 07:14:26 EST 2007
Chris Kuklewicz wrote:
>
> Can I be among the first to ask that any Unicode variant of ByteString use a
> recognized encoding?
>
> <snip>
>
> In reading all the poke/peek function I did not see anything that your tag bits
> accomplish that the tag bits in utf-8 do not, except that you want to write only
> a single routine for the poke/peek forwards and backwards operations instead of
> two routines. It is definitely more compact in the worst case, and more "Once
> And Only Once", but at a very high cost of incompatibility.
The reason for inventing my own encoding is that it is easier to use and
takes less space than UTF-8. The only advantage UTF-8 has is that it can
be read and written directly. I guess this is a trade off, faster
manipulation and smaller storage compared to simpler and faster io. I
have not benchmarked it either way, so it is just guesswork for now.
Fortunately the entire library can be easily converted to use a
different encoding by just changing the peekChar/pokeChar functions.
> One of the biggest wins with with a Unicode ByteString will be the ability to
> transfer the buffer directly to and from the disk and network. Your code will
> always need the data to be rewritten both incoming and outgoing.
>
> The most ideal case would be the ability to load different encodings via import
> statements while using the same API.
I was hoping that there would be only a single string type, with
different encodings handled by functions:
> encode :: CompactString -> ByteString
> decode :: ByteString -> CompactString
This is important if it is not know beforehand how a file is encoded.
For example on windows Unicode files are either UTF-8 or UTF-16,
identified by a byte order mark.
Twan
More information about the Haskell
mailing list