[Haskell-cafe] Re: getting crazy with character encoding

Andrea Rossato mailing_list at istitutocolli.org
Thu Sep 13 05:40:01 EDT 2007

On Thu, Sep 13, 2007 at 11:07:03AM +0200, Stephane Bortzmeyer wrote:
> On Thu, Sep 13, 2007 at 12:23:33AM +0000,
>  Aaron Denney <wnoise at ofb.net> wrote 
>  a message of 76 lines which said:
> > the characters read and written should correspond to the native
> > environment notions and encodings.  These are, under Unix,
> > determined by the locale system.
> Locales, while fine for things like the language of the error messages
> or the format to use to display the time, are *not* a good solution
> for things like file names and file contents.
> Even on a single Unix machine (without networking), there are
> *several* users. Using the locale to find out the charset used for a
> file name won't work if these users use different locales.

Yes indeed. And I find it a real mess. And I don't see any way out.

> Same thing for file contents. The charset used must be marked in the
> file (XML...) or in the metadata, somehow. Otherwise, there is no way
> to exchange files or even to change the locale (if I switch from
> Latin1 to UTF-8, what do my files become?)

Ok, you are perfectly right, but we live in an imperfect world and we
must come up with a solution. In my case I'm developing this prompt
for xmonad and a Chinese user wants directory and file names to be
correctly displayed. What else can I do but using locale technologies?
This is something I don't know.

The code below is not perfect but it works to some extent.
Nonetheless, if you have 2 users using an iso-8859-1 locale the first
and utf-8 one the second, non ascii characters in file names of the
first users will produce invalid character sequences for the second
users. The reverse will work, though.

I'm still puzzled and still find the thread title appropriate.

Thanks for your kind attention.


The locale aware version of the previous code (needs hsc2hs)

{-# OPTIONS -fglasgow-exts #-}
import Prelude hiding (catch)
import System.Process
import System.IO
import Control.Monad
import System.Directory
import Foreign
import Foreign.C
import Data.Char
import Control.Exception

runProcessWithInput cmd args input = do
  (pin, pout, perr, ph) <- runInteractiveProcess cmd args Nothing Nothing
  hPutStr pin input
  hClose pin
  output <- hGetContents pout
  when (output==output) $ return ()
  hClose pout
  hClose perr
  waitForProcess ph
  return output

main = do
  l <- fmap lines $ runProcessWithInput "/bin/bash" [] "ls ab*\n"
  l' <- mapM fromLocale l
  l'' <- mapM toLocale l'
  putStrLn (show l')
  mapM_ putStrLn l''
  mapM_ (putStrLn . show . length) l'

-- This code comes from John Meacham's HsLocale
-- http://repetae.net/john/repos/HsLocale/
toLocale :: String -> IO String
toLocale s = catch (stringToBytes s >>= return . map (chr . fromIntegral))
                   (const $ return "invalid character sequence")

fromLocale :: String -> IO String
fromLocale s = bytesToString (map (fromIntegral . ord) s) 
                  `catch` \_ ->  return "invalid character sequence" 

stringToBytes :: String -> IO [Word8]
stringToBytes cs = (withIConv "" "UTF-32" $ \ic -> convertRaw ic cs) 

bytesToString :: [Word8] -> IO String
bytesToString xs =  (withIConv "UTF-32" "" $ \ic ->  convertRaw ic xs) >>= return . f where
    f ('\65279':xs) = xs   -- discard byte order marker
    f xs = xs

newtype IConv = IConv (#type intptr_t)

foreign import ccall unsafe "iconv.h iconv_open"
  iconv_open :: Ptr CChar -> Ptr CChar -> IO IConv
foreign import ccall unsafe "iconv.h iconv_close"
  iconv_close :: IConv -> IO CInt
foreign import ccall unsafe "iconv.h iconv" 
  iconv :: IConv -> Ptr (Ptr CChar) -> Ptr CSize -> Ptr (Ptr CChar) -> Ptr CSize -> IO CInt

withIConv :: String -> String -> (IConv -> IO a) -> IO a 
withIConv to from action = bracket open close action where
    close ic = throwErrnoIfMinus1_ "iconv_close" (iconv_close ic)
    open = throwErrnoIfMinus1 "iconv_open" iopen
    iopen = do
        withCAString to $ \to -> do
        withCAString from $ \from -> do
        iconv_open to from

convertRaw :: (Storable a, Storable b) => IConv -> [a] -> IO [b]
convertRaw ic xs = do 
    with (fromIntegral $ sizeOf (head xs) * length xs) $ \inptrSz -> do
    withArray xs $ \arr -> do  
    with (castPtr arr) $ \inptr -> do
    allocaBytes (1024) $ \outptr -> do
    with outptr $ \outptrptr -> do
    with 1024 $ \outptrSz -> do
    let outSz = fromIntegral $ sizeOf $ unsafePerformIO (peek outptr) 
        go = do 
            ret <- iconv ic inptr inptrSz (castPtr outptrptr) outptrSz 
            err <- getErrno
            case (ret,err) of
                (-1,_) | err == e2BIG -> do
                    oz <- peek outptrSz
                    x <- peekArray ((1024 - fromIntegral oz) `div` outSz) (castPtr outptr) 
                    poke outptrptr outptr
                    poke outptrSz 1024
                    y <- go
                    return $ x ++ y
                (-1,_) -> throwErrno "iconv"
                (_,_) -> do
                    oz <- peek outptrSz
                    peekArray ((1024 - fromIntegral oz) `div` outSz) outptr 

#include <locale.h>
foreign import ccall unsafe "locale.h setlocale"
    setlocale :: CInt -> CString -> IO CString

setupLocale :: IO ()
setupLocale = withCString "" $ \s -> do
                setlocale (#const LC_ALL) s
                return ()

More information about the Haskell-Cafe mailing list