Adding Network.URI.escape

Fri Dec 25 16:19:09 EST 2009

Gwern Branwen wrote:
> Network.URI.escapeURIString is pretty much always used to make a
> String a URL or a part of a URL.
> 
> The existing definition
> http://www.haskell.org/ghc/docs/6.10.4/html/libraries/network/Network-URI.html#v%3AescapeURIString
> forces one to do extra work by having to specify a `Char -> Bool`.
> 
> More than a few packages & libraries simply define an 'escape'
> function `escapeURIString isAllowedInURI` (either inline or as a named
> function). This sort of repetition is unfortunate.

Hmmm... I think that's not strictly correct - it should be 'escapeURIString
isUnescapedInURI'.  The form used above would leave literal '%' characters
unescaped.

> The name 'escape' is commonly used to express exactly that
> functionality: http://holumbus.fh-wedel.de/hayoo/hayoo.html#0:escape
> 
> What would people say to adding such a function?

The reason that the 'escapeURIString' always takes the Char -> Bool function is
that the rules for escaping can very between URI schemes, and between components
within a single URI.  For example, a literal '/' or '?' appearing within a path
segment in an http: URI would need to be escaped, but that's not included by the
common case of 'escapeURIString isUnescapedInURI'.

The 'isAllowedInURI' function, IIRC, is a kind of least-common-denominator
function that causes non-URI characters to be escaped so that the resulting
string is at least syntactically valid according to RFC3986.  But in some cases
(i.e. for some schemes) this may not be enough - see RFC 3986, section 2.1 ("A
percent-encoding mechanism is used to represent a data octet in a component when
that octet's corresponding character is outside the allowed set or is being used
as a delimiter of, or within, the component" --
http://www.apps.ietf.org/rfc/rfc3986.html#sec-2.1 ); see also section 2.4.

So, while one could define an additional function as you suggest, I'm not sure
it is necessarily wise, because having the explicit function to designate
characters to be escaped does at least draw attention to exactly which
characters would be escaped in the context of use.  But OTOH, if implementations
tend to use 'escapeURIString isAllowedInURI' as you say, maybe this just creates
an opportunity for additional errors.

URI escaping is, to some extent, a necessarily messy and error-prone business -
it's really hard to define a generic escaping mechanism that neatly covers all
eventualities, because of the multiple stages of interpretation that can take
place when actually using a URI.

#g