[Haskell-cafe] UTF-8 problems when decoding JSON data coming from Network.HTTP

Ionut G. Stan ionut.g.stan at gmail.com
Sun Oct 17 08:57:06 EDT 2010


On 17/Oct/10 3:37 PM, Michael Snoyman wrote:
> On Sun, Oct 17, 2010 at 2:26 PM, Ionut G. Stan<ionut.g.stan at gmail.com>  wrote:
>> Thanks Michael, now it works indeed. But I don't understand, is there any
>> inherent problem with Haskell's built-in String? Should one choose
>> ByteString when dealing with Unicode stuff? Or, is there any resource that
>> describes in one place all the problems Haskell has with Unicode?
>
> There's no problem with String; you just need to remember what it
> means. A String is a list of Chars, and a Char is a unicode codepoint.
> On the other hand, the HTTP protocol deals with *bytes*, not Unicode
> codepoints. In order to convert between the two, you need some type of
> encoding; in the case of JSON, I believe this is always specified as
> UTF-8.
>
> The problem for you is that the HTTP package does *not* perform UTF-8
> decoding of the raw bytes sent over the network. Instead, I believe it
> is doing the naive byte-to-codepoint conversion, aka Latin-1 decoding.
> By downloading the data as bytes (ie, a ByteString), you can then
> explicitly state that you want to do UTF-8 decoding instead of
> Latin-1.
>
> It would be entirely possible to write an HTTP library that does this
> automatically, but it would be inherently limited to a single encoding
> type. By dealing directly with bytestrings, you can work with any
> character encoding, as well as binary data such as images which does
> not have any character encoding.

OK, I think I understand now. I was under the assumption that the 
Network.HTTP package will take a look at the Content-Type header and do 
a behind-the-scene conversion before decoding those bytes.

Thanks for your help.

-- 
Ionuț G. Stan  |  http://igstan.ro


More information about the Haskell-Cafe mailing list