[Haskell-cafe] character encoding problems with hxt (I think)
nadine.and.henry at pobox.com
nadine.and.henry at pobox.com
Sat Oct 13 16:37:05 CEST 2012
Dear Haskellers,
I'm trying to write some code that grabs countries and provinces from the
iso_3166 files on Linux systems. I seem to be running into some kind of
character encoding problem. file says iso_3166_2.xml is a utf8 file, and
isutf8 agrees, but when I run the following code, it crashes.
uft8Copy makes a byte for byte copy as expected.
noCrash read and writes the document without crashing, but the accented
characters in the strings show up garbled. Just search for "DE" and you'll
see what I mean. crash (on my system, (Debian testing)) produces the error
message below.
Can anyone enlighten me on what is going on?
Thanks in advance.
Henry Laxen
------------------------------------------------------------------------
{-# LANGUAGE Arrows #-}
import Text.XML.HXT.Core
import Data.List
import qualified System.IO.UTF8 as U
isoFile = "/usr/share/xml/iso-codes/iso_3166_2.xml"
countZerosInLines = length . filter (\x -> x == '0') . concat
utf8Copy = do
xml <- U.readFile isoFile
let doc = readString [withValidate no] xml
U.writeFile "iso_written.xml" xml
putStrLn "Ran utf8Copy"
noCrash :: IO ()
noCrash = do
lines <- runX (readDocument [withValidate no] isoFile >>>
writeDocumentToString [withShowTree yes] )
-- mapM_ putStrLn lines
print $ countZerosInLines lines
putStrLn "Ran noCrash"
crash :: IO ()
crash = do
xml <- U.readFile isoFile
let doc = readString [withValidate no] xml
src <- runX $ doc //> writeDocumentToString [withShowTree yes]
print $ countZerosInLines src
putStrLn "Ran crash (didn't)"
{-
Ran utf8Copy
1236
Ran noCrash
error: UTF-8 encoding error at input position 2640: InvalidLaterByte 1
error: UTF-8 encoding error at input position 2646: InvalidLaterByte 1
*** Exception: Enum.toEnum{Word8}: tag (363) is outside of bounds (0,255)
-}
main = do
utf8Copy
noCrash
crash
------------------------------------------------------------------------
More information about the Haskell-Cafe
mailing list