Data encoding library

Sun Oct 14 14:11:14 EDT 2007

Magnus Therning wrote:
>>> 2. Codecs, i.e. encoder/decoder pairs such as charset converters
>>>   data Codec base derived = MkCodec
>>>   {
>>>     encode :: derived -> base,
>>>     decode :: base -> Maybe derived -- or other Monad
>>>   }
>>>   utf8 :: Codec [Word8] String
>>>   xml :: Codec String XML
>>
>>   type ASCII = String
>>   base16    :: Codec ASCII [Word8]
>>   ...
>>
>>   encode base16 [0xde,0xad,0xbe,0xef] :: ASCII
> 
> A similar result could be gotten by using phantom types, right?

Most likely, although I'm not sure whether the choice from your blog is 
the right one. I mean, the only-a-little-bit-phantom type

   newtype Base16 a = Base16 { unBase16 :: a } deriving (Eq,Show)

will do the job too

   instance DataEncoding Base16 where
      encode = Base16 . b16Encode
      decode = b16Decode . unBase16

      chop n = Base16 . b16chop n . unBase16
      unchop = Base16 . b16unchop . unBase16

      liberate    = unBase16
      incarcerate = Base16

Usually, the "normal" phantom type approach would be to make the 
encoding a phantom argument of a string type, not the other way round:

   newtype EncodedString enc = ES String

   data Base16     -- empty type, no constructors

   instance DataEncoding (EncodedString Base16) where
      ...

But your idea of fixing the encoding in the type for more type safety is 
good. Another way to do that would be to have an abstract data type

      -- this is not a String, this is base16-encoded data!
   newtype Base16 = Base16 String

with functions

   encode :: [Word8] -> Base16
   decode :: Base16  -> [Word8]

and functions

   encode :: Base16  -> String
   decode :: String  -> Maybe Base16

The "normal" phantom type approach has the advantage of making the last 
functions polymorphic

   encode :: EncodedString enc -> String
   decode :: String -> EncodedString enc

   encode (ES s) = s
   decode s = ES s

at the expense of shifting the possible failure to

   decode :: EncodedString Base16 -> Maybe [Word8]

Of course, you can use both phantom types and the codec approach 
eliminating the need for a type class

   base16 :: Codec [Word8] (EncodedString Base16)
   string :: Codec (EncodedString a) String

> But then there must be some way of liberating the result.
> I'm not sure yet whether they are worth it.
> 
> AFAIU the example from above then changes to
> 
>    encode [0xde,0xad,0xbe,0xef] :: Base16 ASCII

Concerning the choice between encoding the encoding (... ;-) in the 
types (like Base16) or as values (like  base16 :: Codec ...), the 
observation is that you have to specify the encoding anyway :) either as 
  type annotation ("type argument")

   encode [0xde,0xad,0xbe,0xef] :: EncodedString Base16
   encode' (undefined :: Base16) [0xde,0xad,0xbe,0xef]

or as value argument

   encode base16 [0xde,0xad,0xbe,0xef]

In this case, I would prefer the value argument approach for its brevity 
and mnemonics ("encode in base16 the following data"). However, possible 
strong type guarantees usually are a good argument for the typed approach.

To be true, I'm not really sure whether strong types would gain us 
something here.

>> Also, I don't have a clue about what  chop  and  unchop  are supposed
>> to do.
> 
> For some encodings there are standard ways of splitting an encoded
> string over several lines.  Unfortunately it's not always as simple as
> just splitting a string at a particular length.  Uuencode is the most
> complicated I've come across so far.  That's what chop/unchop is for.

Ah, that's what they are for. An idea would be to build the line length 
into the encoding, like

   base16 :: Int -> Codec [Word8] [String]

with the intention that

   encode (base16 70) x

will encode x with a line length of 70 characters. Hm, should

   decode (base16 70) s

fail when the lines are not 70 characters in length, or should it accept 
any line length? Maybe it should be

   basae16 :: Maybe Int -> Codec [Words8] [String]

since the programmer may choose to not wrap lines anyway. But perhaps 
the line length is best paired with the data

   base16 :: Codec ([Words8], Maybe Int) [String]

so that

   encode base16 (..., Just 70) x

will encode with a line length of 70 characters and

   let (,ll) = decode base16 s in ...

will return the parsed line length in ll .

Oh my lambda, it's wondrous how Haskell gives so many possibilities to 
ponder for such a seemingly innocent API design problem :)

Regards,
apfelmus