HTTP and character encodings
ganesh at earth.li
Tue Sep 11 00:22:33 CEST 2012
tl;dr: I'd like to remove the String instances from the HTTP package.
The HTTP library is overloaded on the type for request and response
bodies; there are instances for String and both strict and lazy Bytestrings.
Unfortunately, the String instance is rather broken. A String ought to
represent Unicode data, but the HTTP wire format is bytes, and HTTP
makes no attempt to handle encoding.
In particular uploaded data (e.g. in POSTs) gets silently truncated and
downloaded data is improperly embedded as one byte per character no
matter what encoding the server advertises in the Content-Type header.
I've spent a while investigating the option of making HTTP encode and
decode Strings appropriately, but my tentative conclusion is that it's
- on upload we'd have to pick an encoding by default - probably UTF-8 -
and also add it to the Content-Type header which may involve messing
with any header supplied by the user. If the user supplied a different
encoding in Content-Type then we probably would need to notice and
- on upload Content-Length may also need to be managed somehow.
- on download we'd need to be able to handle at least common encodings
that the server might send, but on Windows even common encodings like
iso-8859-* don't exist and there aren't always appropriate substitutes.
- on download we'd also really want to parse HTML/XML documents looking
for in-document specifications of the encoding in META tags and XML
declarations (see http://www.w3.org/QA/2008/03/html-charset.html)
- we'd need to also parse Content-Type to detect when the data is
supposed to be binary, and then check that it is actually 8-bit clean on
upload. If the user doesn't supply Content-Type at all, then what?
I think the right way to do this would be to have proper high-level and
low-level APIs where only the high-level API supports strings but also
does a lot more active management of standard HTTP headers like
content-type/content-length. But HTTP as it stands is a long way from
doing that and a short-term fix is needed.
So I'm reluctantly drawn to the conclusion that the only reasonable
thing to do is to remove the String instances from HTTP completely for now.
I imagine this could be quite disruptive, but on the other hand people
using the String instance are getting silently broken behaviour and a
couple of people have been bitten by this recently.
More information about the Libraries