Text in Haskell: a second proposal

Wolfgang Jeltsch wolfgang@jeltsch.net
10 Aug 2002 13:34:39 +0200


On Friday, 2002-08-09, 08:40, CEST, Ashley Yakeley wrote:
> At 2002-08-08 23:10, Ken Shan wrote:
> 
> > 1. Octets.
> > 2. C "char".
> > 3. Unicode code points.
> > 4. Unicode code values, useful only for UTF-16, which is seldom used.
> > 5. "What handles handle".
> ...
> >I suggest that the following Haskell types be used for the five items
> >above:
> >
> > 1. Word8
> > 2. CChar
> > 3. CodePoint
> > 4. Word16
> > 5. Char
> 
> I disagree, they should be:
> 
> 1. Word8
> 2. CChar
> 3. Char
> 4. Word16
> 5. Word8
> 
> >Let me elaborate.  Files are funny because the information units they
> >contain can be treated as both numbers and characters.
> 
> No, a file is always a list of octets. Nothing else (ignoring metadata, 
> forks etc.). Of course, you can interpret those octets as text using 
> "ASCII" or "UTF-8" or whatever, equally, you can interpret those octets 
> as an image using "PNG", "JPEG" etc. But those are secondary 
> transformations, separate from the business of reading from and writing 
> to a file.
> 
> We should have Word8-based interfaces to file and network handles. 
> Whether or not the old Char-based ones should be deprecated, or whatever, 
> I don't know.
> 
> As for Unicode codepoints, if there's to be an internationalisation 
> effort for Haskell, the type of character literals, Char, should be fixed 
> as the type for Unicode codepoints, much as it already is in GHC.
> 
> -- 
> Ashley Yakeley, Seattle WA

Some remarks:
        * A file doesn't have to be a list of octets. On the other hand,
          the assumption of files consisting of octets makes sense for
          most platforms. Therefore, I think, using Word8 for
          file/stream elements is a good solution.
        * Maybe traditional character-based I/O operations should use
          the default locale. This way, they could be very useful for
          reading from and writing to terminals. For file access I would
          discourage the use of them and propagate the combination of
          octet based I/O and encoding functions/decoding parsers.
Apart from these two points I fully agree with Ashley.

Wolfgang