Text in Haskell: a second proposal

Ashley Yakeley ashley@semantic.org
Tue, 13 Aug 2002 18:55:09 -0700


At 2002-08-13 04:13, Simon Marlow wrote:

>That depends what you mean by efficient: these functions represent an
>extra layer of intermediate list between the handle buffer and the final
>[Char], and furthermore they don't work with partial reads - the input
>has to be a lazy stream gotten from hGetContents.

For ISO-8859-1 each Char is exactly one Word8, so surely it would work 
fine with partial reads?

     decodeCharISO88591 :: Word8 -> Char;

     encodeCharISO88591 :: Char -> Word8;

     decodeISO88591 :: [Word8] -> [Char];
     decodeISO88591 = fmap decodeCharISO88591;

     encodeISO88591 :: [Char] -> [Word8];
     encodeISO88591 = fmap encodeCharISO88591;

>> A monadic stream-transformer:
>> 
>>    decodeStreamUTF8 :: (Monad m) => m Word8 -> m Char;
>> 
>>    hGetChar h = decodeStreamUTF8 (hGetWord8 h);
>> 
>> This works provided each Char corresponds to a contiguous block of 
>> Word8s, with no state between them. I think that includes all the 
>> standard character encoding schemes.
>
>This is better: it doesn't force you to use lazy I/O, and when
>specialised to the IO monad it might get decent performance.  The
>problem is that in general I don't think you can assume the lack of
>state.  For example: UTF-7 has a state which needs to be retained
>between characters, and UTF-16 and UTF-32 have an endianness state which
>can be changed by a special sequence at the beginning of the file.  Some
>other encodings have states too.

But it is possible to do this in Haskell...

The rule for the many functions in the standard libraries seems to be 
"implement as much in Haskell as possible". Why is it any different with 
the file APIs?

-- 
Ashley Yakeley, Seattle WA