[Haskell-cafe] invalid character encoding

Keean Schupke k.schupke at imperial.ac.uk
Thu Mar 17 16:47:04 EST 2005


I cannot help feeling that all this multi-language support is a mess.

All strings should be coded in a universal encoding (like UTF8) so that 
the code for a character is the same independant of locale.

It seems stupid that the locale affects the character encodings... the 
code for an 'a' should be the same all over the world... as should the 
code for a particular japanese character.

In other words the locale should have no affect on character encodings, 
it should select between multi-lingual error messages which are supplied 
as distinct strings for each region.

While we may have to inter-operate with 'C' code, we could have a 
Haskell library that does things properly.

    Keean.

Marcin 'Qrczak' Kowalczyk wrote:

>Glynn Clements <glynn at gclements.plus.com> writes:
>
>  
>
>>The (non-wchar) curses API functions take byte strings (char*),
>>so the Haskell bindings should take CString or [Word8] arguments.
>>    
>>
>
>Programmers will not want to use such interface. When they want to
>display a string, it will be in Haskell String type.
>
>And it prevents having a single Haskell interface which uses either
>the narrow or wide version of curses interface, depending on what is
>available.
>
>  
>
>>If you provide "wrapper" functions which take String arguments,
>>either they should have an encoding argument or the encoding should
>>be a mutable per-terminal setting.
>>    
>>
>
>There is already a mutable setting. It's called "locale".
>
>  
>
>>I don't know enough about the wchar version of curses to comment on
>>that.
>>    
>>
>
>It uses wcsrtombs or eqiuvalents to display characters. And the
>reverse to interpret keystrokes.
>
>  
>
>>It is possible for curses to be used with a terminal which doesn't
>>use the locale's encoding.
>>    
>>
>
>No, it will break under the new wide character curses API, and it will
>confuse programs which use the old narrow character API.
>
>The user (or the administrator) is responsible for matching the locale
>encoding with the terminal encoding.
>
>  
>
>>Also, it's quite common to use non-standard encodings with terminals
>>(e.g. codepage 437, which has graphic characters beyond the ACS_* set
>>which terminfo understands).
>>    
>>
>
>curses don't support that.
>
>  
>
>>>The locale encoding is the right encoding to use for conversion of the
>>>result of strerror, gai_strerror, msg member of gzip compressor state
>>>etc. When an I/O error occurs and the error code is translated to a
>>>Haskell exception and then shown to the user, why would the application
>>>need to specify the encoding and how?
>>>      
>>>
>>Because the application may be using multiple locales/encodings.
>>    
>>
>
>But strerror always returns messages in the locale encoding.
>Just like Gtk+2 always accepts texts in UTF-8.
>
>For compatibility the default locale is "C", but new programs
>which are prepared for I18N should do setlocale(LC_CTYPE, "")
>and setlocale(LC_MESSAGES, "").
>
>There are places where the encoding is settable independently,
>or stored explicitly. For them Haskell should have withCString /
>peekCString / etc. with an explicit encoding. And there are
>places which use the locale encoding instead of having a separate
>switch.
>
>  
>
>>[The most common example is printf("%f"). You need to use the C
>>locale (decimal point) for machine-readable text but the user's
>>locale (locale-specific decimal separator) for human-readable text.
>>    
>>
>
>This is a different thing, and it is what IMHO C did wrong.
>
>  
>
>>This isn't directly related to encodings per se, but a good example
>>of why parameters are preferable to state.]
>>    
>>
>
>The LC_* environment variables are the parameters for the encoding.
>There is no other convention to pass the encoding to be used for
>textual output to stdout for example.
>
>  
>
>>C libraries which use the locale do so as a last resort.
>>    
>>
>
>No, they do it by default.
>
>  
>
>>The only reason that the C locale mechanism isn't a major nuisance
>>is that you can largely ignore it altogether.
>>    
>>
>
>Then how would a Haskell program know what encoding to use for stdout
>messages? How would it know how to interpret filenames for graphical
>display?
>
>Do you want to invent a separate mechanism for communicating that, so
>that an administrator has to set up a dozen of environment variables
>and teach each program separately about the encoding it should assume
>by default? We had this mess 10 years ago, and parts of it are still
>alive until today - you must sometimes configure xterm or Emacs
>separately, but it's being more common that programs know to use the
>system-supplied setting and don't have to be configured separately.
>
>  
>
>>Code which requires real I18N can use other mechanisms, and code
>>which doesn't require any I18N can just pass byte strings around and
>>leave encoding issues to code which actually has enough context to
>>handle them correctly.
>>    
>>
>
>Haskell can't just pass byte strings around without turning the
>Unicode support into a joke (which it is now).
>
>  
>



More information about the Haskell-Cafe mailing list