[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

Twan van Laarhoven twanvl at gmail.com
Mon Feb 5 07:14:26 EST 2007


Chris Kuklewicz wrote:
> 
> Can I be among the first to ask that any Unicode variant of ByteString use a
> recognized encoding?
> 
> <snip>
> 
> In reading all the poke/peek function I did not see anything that your tag bits
> accomplish that the tag bits in utf-8 do not, except that you want to write only
> a single routine for the poke/peek forwards and backwards operations instead of
> two routines.  It is definitely more compact in the worst case, and more "Once
> And Only Once", but at a very high cost of incompatibility.

The reason for inventing my own encoding is that it is easier to use and 
takes less space than UTF-8. The only advantage UTF-8 has is that it can 
be read and written directly. I guess this is a trade off, faster 
manipulation and smaller storage compared to simpler and faster io. I 
have not benchmarked it either way, so it is just guesswork for now.

Fortunately the entire library can be easily converted to use a 
different encoding by just changing the peekChar/pokeChar functions.

> One of the biggest wins with with a Unicode ByteString will be the ability to
> transfer the buffer directly to and from the disk and network.  Your code will
> always need the data to be rewritten both incoming and outgoing.
> 
> The most ideal case would be the ability to load different encodings via import
> statements while using the same API.

I was hoping that there would be only a single string type, with 
different encodings handled by functions:
  > encode :: CompactString -> ByteString
  > decode :: ByteString -> CompactString

This is important if it is not know beforehand how a file is encoded. 
For example on windows Unicode files are either UTF-8 or UTF-16, 
identified by a byte order mark.

Twan


More information about the Haskell mailing list