Text in Haskell: a second proposal
Ashley Yakeley
ashley@semantic.org
Thu, 8 Aug 2002 23:40:42 -0700
At 2002-08-08 23:10, Ken Shan wrote:
> 1. Octets.
> 2. C "char".
> 3. Unicode code points.
> 4. Unicode code values, useful only for UTF-16, which is seldom used.
> 5. "What handles handle".
...
>I suggest that the following Haskell types be used for the five items
>above:
>
> 1. Word8
> 2. CChar
> 3. CodePoint
> 4. Word16
> 5. Char
I disagree, they should be:
1. Word8
2. CChar
3. Char
4. Word16
5. Word8
>Let me elaborate. Files are funny because the information units they
>contain can be treated as both numbers and characters.
No, a file is always a list of octets. Nothing else (ignoring metadata,
forks etc.). Of course, you can interpret those octets as text using
"ASCII" or "UTF-8" or whatever, equally, you can interpret those octets
as an image using "PNG", "JPEG" etc. But those are secondary
transformations, separate from the business of reading from and writing
to a file.
We should have Word8-based interfaces to file and network handles.
Whether or not the old Char-based ones should be deprecated, or whatever,
I don't know.
As for Unicode codepoints, if there's to be an internationalisation
effort for Haskell, the type of character literals, Char, should be fixed
as the type for Unicode codepoints, much as it already is in GHC.
--
Ashley Yakeley, Seattle WA