[xmonad] spawn functions are not unicode safe
alexey.skladnoy at gmail.com
Thu Jan 15 11:04:04 EST 2009
On Thursday 15 January 2009 16:53:49 Roman Cheplyaka wrote:
> RFC 3629  states:
> o UTF-8 strings can be fairly reliably recognized as such by a
> simple algorithm, i.e., the probability that a string of
> characters in any other encoding appears as valid UTF-8 is low,
> diminishing with increasing string length.
> However, no references to the algorithm itself are given.
> Google brought me this sample algorithm .
> Probably it's worth to implement something like that and include into
> utf8-string if it's not already there.
> 1. http://www.ietf.org/rfc/rfc3629.txt
> 2. http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html
Something like this? (code below) Algorithm is trivial — check for impossible
bytes combinations. If there is no such bytes, pairs etc. byte sequence is
probably UTF8 encoded string.
But problem not with decoding unicode strings i.e. not with functions like
fromUnicode :: [Word8] -> [Char]
but with encoding of string. Char represent unicode symbol, and thus
everything OK at this point. However unix system calls know nothing about
unicode and accept (char*) or [Word8] in haskell terminology.
And conversion from [Char] to [Word8] is problem. It arise whenever haskell
need to pass some string to outside world. Currently Char simply truncated
to one byte regardless of its value. Its because of that `encode' function is
needed. Not only executeFile affected.
> import Control.Monad
> import Data.Word
> import Data.Bits
> import Data.Maybe
> is11,is10,is0x :: Word8 -> Bool
> is11 b = (b `shiftR` 6) == 3
> is10 b = (b `shiftR` 6) == 2
> is0x b = b < 128
> -- Test if pair allowed in UTF8 encoded string.
> validPair :: Word8 -> Word8 -> Maybe Word8
> validPair a b = if (b < 254) && not ((is0x a && is10 b) ||
> (is11 a && (not $ is10 b)))
> then Just b
> else Nothing
> -- Check if sequence of bytes UTF8 encoded string. Note that this
> -- check is probabilistic. If function returns False this string is
> -- not UTF8. If it return True string still may fail to decode.
> isUTF8 :: [Word8] -> Bool
> isUTF8 = isJust . foldM validPair 0
More information about the xmonad