[xmonad] spawn functions are not unicode safe

Khudyakov Alexey alexey.skladnoy at gmail.com
Sat Jan 17 10:10:07 EST 2009


> If I'm understanding you, the answer is 'you can safely call
> encodeString on ASCII text, and UTF text, but you cannot on ISO8859-1
> & ASCII Extended'. So we can either default to calling encodeString,
> checking whether it's ISO/Extended (and not calling encodeString if
> True); or we can default to not calling encodeString, and enabling it
> if a check for UTF returns true.
>
> I guess since Alexey has already provided a check for UTF, then we
> should probably use the latter strategy.

The only problem are user with one byte encodings. encodeString can be safely 
called on string which contain only ASCII characters. For ASCII input 
encodeString == id.  So nothing will change for ASCII. Users with UTF8 locale 
will get nicely encoded strings. Users will get garbage. But they have it 
anyway, it only will look different. 

So I think it sound solution to wrap everything into encodeString. It is not 
The Right Way To Do Things. It's only a workaround... still better than 
nothing. It's job for standard libraries... not for software developers.

For more information on issue read below. 




There is no such thing as [some encoding] encoded string in haskell (most of 
the time). Strings in haskell _are_ unicode. Char is valid unicode code 
point. Not byte, word32, etc. It's fairly abstract code point.  Usually they 
contain only ASCII characters but they unicode nevertheless. 

Char is represented somehow but one shouldn't bother about it most of the 
time. Problems arise when char passed to outside world. World understand 
sequences of bytes so strings must be encoded somehow. 

Standard library uses very simple method: (\c -> c .&. 0xff). Every character 
translated to one byte. Simple but works only for ASCII (and maybe latin-1).
Because of that behavior all that encodeStrings are needed. 

Some examples to illustrate above: 

ы U+044B Name: CYRILLIC SMALL LETTER YERU
Prelude> fromEnum $ 'ы'
1099  -- (1099 == 0x44b)
Prelude> putStrLn . encodeString $ [toEnum 0x44b]
ы
Prelude> putStrLn $ [toEnum 0x44b]
K
Prelude> putStrLn . encodeString $ [toEnum $ 0xff .&. 0x44b]
K


P.S. It is not safe to call encodeString on UTF8 encoded string. 

No encoding. Pass string as it is
Prelude> putStrLn "Ну что тут с уникодом?"
C GB> BCB A C=8:>4><?

Encode string in UTF8
Prelude> putStrLn . encodeString $ "Ну что тут с уникодом?"
Ну что тут с уникодом?

Encode string which already UTF8 encoded.
Prelude> putStrLn . encodeString . encodeString $ "Ну что тут с уникодом?"
ÐÑ ÑÑо ÑÑÑ Ñ Ñникодом?


More information about the xmonad mailing list