[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

Mon Feb 5 05:49:31 EST 2007

Twan van Laarhoven wrote:
> Hello all,
> 
> I would like to announce my attempt at making a Unicode version of
> Data.ByteString. The library is named Data.CompactString to avoid
> conflict with other (Fast)PackedString libraries.
> 
> The library uses a variable length encoding (1 to 3 bytes) of Chars into
> Word8s, which are then stored in a ByteString.

Can I be among the first to ask that any Unicode variant of ByteString use a
recognized encoding?

You have invented a new encoding:

> -- Reading/writing chars
> --
> 
> -- Uses a custom encoding which looks like UTF8, but is slightly more efficient.
> 
> -- It requires at most 3 byes, as opposed to 4 for UTF8.
> 
> --
> 
> -- Encoding looks like
> 
> --                    0zzzzzzz -> 0zzzzzzz
> 
> --           00yyyyyy yzzzzzzz -> 1xxxxxxx 1yyyyyyy
> 
> --  000xxxxx xxyyyyyy yzzzzzzz -> 1xxxxxxx 0yyyyyyy 1zzzzzzz
> --
> -- The reasoning behind the tag bits is that this allows the char to be read both forwards
> -- and backwards.
> 
> -- | Write a character and return the size needed
> pokeCharFun :: Char -> (Int, Ptr Word8 -> IO ())
> pokeCharFun c = case ord c of
>  x | x < 0x80   -> (1, \p ->    poke        p   $ fromIntegral  x )
>    | x < 0x4000 -> (2, \p -> do poke        p   $ fromIntegral (x `shiftR`  7) .|. 0x80
>                                 pokeByteOff p 1 $ fromIntegral  x              .|. 0x80 )
>    | otherwise  -> (3, \p -> do poke        p   $ fromIntegral (x `shiftR` 14) .|. 0x80
>                                 pokeByteOff p 1 $ fromIntegral (x `shiftR`  7) .&. 0x7f
>                                 pokeByteOff p 2 $ fromIntegral  x              .|. 0x80 )
> {-# INLINE pokeCharFun #-}
> 
> -- | Write a character and return the size used
> pokeChar :: Ptr Word8 -> Char -> IO Int
> pokeChar p c = case pokeCharFun c of (l,f) -> f p >> return l
> {-# INLINE pokeChar #-}
> 
> -- | Write a character and return the size used
> pokeCharRev :: Ptr Word8 -> Char -> IO Int
> pokeCharRev p c = case pokeCharFun c of (l,f) -> f (p `plusPtr` (1-l)) >> return l
> {-# INLINE pokeCharRev #-}

In reading all the poke/peek function I did not see anything that your tag bits
accomplish that the tag bits in utf-8 do not, except that you want to write only
a single routine for the poke/peek forwards and backwards operations instead of
two routines.  It is definitely more compact in the worst case, and more "Once
And Only Once", but at a very high cost of incompatibility.

One of the biggest wins with with a Unicode ByteString will be the ability to
transfer the buffer directly to and from the disk and network.  Your code will
always need the data to be rewritten both incoming and outgoing.

The most ideal case would be the ability to load different encodings via import
statements while using the same API.