HTTP and character encodings
Christian.Maeder at dfki.de
Tue Sep 11 10:30:54 CEST 2012
Am 11.09.2012 00:22, schrieb Ganesh Sittampalam:
> tl;dr: I'd like to remove the String instances from the HTTP package.
> The HTTP library is overloaded on the type for request and response
> bodies; there are instances for String and both strict and lazy Bytestrings.
> Unfortunately, the String instance is rather broken. A String ought to
> represent Unicode data, but the HTTP wire format is bytes, and HTTP
> makes no attempt to handle encoding.
if you remove the String instance I would need to encode my strings
manually (and maybe worse than it is done now).
Which instance does the package cabal-install use?
Which alternative (better maintained) packages could I use if I have to
change my code anyway?
The header of Network.HTTP contains a "Portability" saying "non-portable
(not tested)", but the package contains a test-suite.
Are tests (or their lack) a portability issue?
(I've seen packages claiming portability with plenty of ghc extensions,
that probably only work for a certain ghc versions on few architectures.)
> In particular uploaded data (e.g. in POSTs) gets silently truncated and
> downloaded data is improperly embedded as one byte per character no
> matter what encoding the server advertises in the Content-Type header.
> I've spent a while investigating the option of making HTTP encode and
> decode Strings appropriately, but my tentative conclusion is that it's
> too hard:
> - on upload we'd have to pick an encoding by default - probably UTF-8 -
> and also add it to the Content-Type header which may involve messing
> with any header supplied by the user. If the user supplied a different
> encoding in Content-Type then we probably would need to notice and
> respect that.
> - on upload Content-Length may also need to be managed somehow.
> - on download we'd need to be able to handle at least common encodings
> that the server might send, but on Windows even common encodings like
> iso-8859-* don't exist and there aren't always appropriate substitutes.
> - on download we'd also really want to parse HTML/XML documents looking
> for in-document specifications of the encoding in META tags and XML
> declarations (see http://www.w3.org/QA/2008/03/html-charset.html)
> - we'd need to also parse Content-Type to detect when the data is
> supposed to be binary, and then check that it is actually 8-bit clean on
> upload. If the user doesn't supply Content-Type at all, then what?
> I think the right way to do this would be to have proper high-level and
> low-level APIs where only the high-level API supports strings but also
> does a lot more active management of standard HTTP headers like
> content-type/content-length. But HTTP as it stands is a long way from
> doing that and a short-term fix is needed.
> So I'm reluctantly drawn to the conclusion that the only reasonable
> thing to do is to remove the String instances from HTTP completely for now.
> I imagine this could be quite disruptive, but on the other hand people
> using the String instance are getting silently broken behaviour and a
> couple of people have been bitten by this recently.
> Any thoughts?
More information about the Libraries