[Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP

Michael Snoyman michael at snoyman.com
Sun Oct 17 08:37:41 EDT 2010


On Sun, Oct 17, 2010 at 2:26 PM, Ionut G. Stan <ionut.g.stan at gmail.com> wrote:
> On 17/Oct/10 8:02 AM, Michael Snoyman wrote:
>>
>> In the gist you sent, the problem is that you are reading the HTTP
>> response as a String. The HTTP library doesn't deal well with
>> non-Latin characters when doing String requests; you should be using
>> ByteString and then converting. It's a little tedious using the HTTP
>> library with ByteStrings, which is one of the reasons I wrote
>> http-enumerator. Here's some working code. The main point is to
>> convert the UTF8 octets to a String.
>>
>> You could also consider using one of the JSON libraries that support
>> bytestrings directly instead of strings, which will likely result in
>> much better performance. Contenders include JSONb[1] and
>> yajl-enumerator[2].
>>
>> import Network.HTTP.Enumerator
>> import qualified Text.JSON as JSON
>> import qualified Data.ByteString.Lazy.UTF8 as BSLU
>>
>> data GithubUser = GithubUser {
>>         name     :: String,
>>         location :: String
>>     } deriving (Eq, Show)
>>
>>
>> instance JSON.JSON GithubUser where
>>     readJSON (JSON.JSObject object) =
>>         let (Just a)          = lookupM "user" $ JSON.fromJSObject object
>>             (JSON.JSObject b) = a
>>             user              = JSON.fromJSObject b
>>         in do name<- lookupM "name"     user>>= JSON.readJSON
>>               location<- lookupM "location" user>>= JSON.readJSON
>>               return $ GithubUser {
>>                   name     = name,
>>                   location = location
>>               }
>>
>>     showJSON user = JSON.makeObj [
>>                         ("name",     JSON.showJSON $ name user),
>>                         ("location", JSON.showJSON $ location user)
>>                     ]
>>
>>
>> lookupM :: (Monad m) =>  String ->  [(String, a)] ->  m a
>> lookupM x xs = maybe (fail $ "No such element: " ++ x) return (lookup x
>> xs)
>>
>> main = do jsonLbs<- simpleHttp
>> "http://github.com/api/v2/json/user/show/igstan"
>>           let jsonText = BSLU.toString jsonLbs
>>           let result = JSON.decode jsonText :: JSON.Result GithubUser
>>           showResult result
>>        where showResult (JSON.Ok json) = putStrLn $ name json
>>              showResult (JSON.Error e) = putStrLn e
>>
>> Michael
>>
>> [1] http://hackage.haskell.org/package/JSONb-1.0.2
>> [2] http://hackage.haskell.org/package/yajl-enumerator
>
> Thanks Michael, now it works indeed. But I don't understand, is there any
> inherent problem with Haskell's built-in String? Should one choose
> ByteString when dealing with Unicode stuff? Or, is there any resource that
> describes in one place all the problems Haskell has with Unicode?

There's no problem with String; you just need to remember what it
means. A String is a list of Chars, and a Char is a unicode codepoint.
On the other hand, the HTTP protocol deals with *bytes*, not Unicode
codepoints. In order to convert between the two, you need some type of
encoding; in the case of JSON, I believe this is always specified as
UTF-8.

The problem for you is that the HTTP package does *not* perform UTF-8
decoding of the raw bytes sent over the network. Instead, I believe it
is doing the naive byte-to-codepoint conversion, aka Latin-1 decoding.
By downloading the data as bytes (ie, a ByteString), you can then
explicitly state that you want to do UTF-8 decoding instead of
Latin-1.

It would be entirely possible to write an HTTP library that does this
automatically, but it would be inherently limited to a single encoding
type. By dealing directly with bytestrings, you can work with any
character encoding, as well as binary data such as images which does
not have any character encoding.

Michael


More information about the Haskell-Cafe mailing list