Adding Network.URI.escape

Mon Jan 4 08:35:16 EST 2010

Gwern Branwen wrote:
> On Fri, Dec 25, 2009 at 4:17 PM, Graham Klyne <GK-lists at ninebynine.org> wrote:
>> Gwern Branwen wrote:
>>> Network.URI.escapeURIString is pretty much always used to make a
>>> String a URL or a part of a URL.
>>>
>>> The existing definition
>>>
>>> http://www.haskell.org/ghc/docs/6.10.4/html/libraries/network/Network-URI.html#v%3AescapeURIString
>>> forces one to do extra work by having to specify a `Char -> Bool`.
>>>
>>> More than a few packages & libraries simply define an 'escape'
>>> function `escapeURIString isAllowedInURI` (either inline or as a named
>>> function). This sort of repetition is unfortunate.
>> Hmmm... I think that's not strictly correct - it should be 'escapeURIString
>> isUnescapedInURI'.  The form used above would leave literal '%' characters
>> unescaped.
> 
> That's unfortunate! But it also takes care of a long-niggling worry -
> I had come across an old #haskell log where someone said that that
> definition is wrong, but they didn't explain how. I guess I ought to
> go around to every user of that definition, like Gitit, and correct
> them...
> 
>>> The name 'escape' is commonly used to express exactly that
>>> functionality: http://holumbus.fh-wedel.de/hayoo/hayoo.html#0:escape
>>>
>>> What would people say to adding such a function?
>> The reason that the 'escapeURIString' always takes the Char -> Bool function
>> is that the rules for escaping can very between URI schemes, and between
>> components within a single URI.  For example, a literal '/' or '?' appearing
>> within a path segment in an http: URI would need to be escaped, but that's
>> not included by the common case of 'escapeURIString isUnescapedInURI'.
>>
>> The 'isAllowedInURI' function, IIRC, is a kind of least-common-denominator
>> function that causes non-URI characters to be escaped so that the resulting
>> string is at least syntactically valid according to RFC3986.  But in some
>> cases (i.e. for some schemes) this may not be enough - see RFC 3986, section
>> 2.1 ("A percent-encoding mechanism is used to represent a data octet in a
>> component when that octet's corresponding character is outside the allowed
>> set or is being used as a delimiter of, or within, the component" --
>> http://www.apps.ietf.org/rfc/rfc3986.html#sec-2.1 ); see also section 2.4.
>>
>> So, while one could define an additional function as you suggest, I'm not
>> sure it is necessarily wise, because having the explicit function to
>> designate characters to be escaped does at least draw attention to exactly
>> which characters would be escaped in the context of use.  But OTOH, if
>> implementations tend to use 'escapeURIString isAllowedInURI' as you say,
>> maybe this just creates an opportunity for additional errors.
>>
>> URI escaping is, to some extent, a necessarily messy and error-prone
>> business - it's really hard to define a generic escaping mechanism that
>> neatly covers all eventualities, because of the multiple stages of
>> interpretation that can take place when actually using a URI.
>>
>> #g
> 
> Thanks for the information; I start to see what you mean by the
> difficulty. But as you say, while a 'escape' may be dangerous, it's
> not like people are being safe now without an 'escape'.
> 
> Is it possible to identify the most common escaping scenarios and come
> up with the correct shortcuts?
> 
> For example, perhaps we could defined an 'escapeURL = escapeURIString
> isUnescapedInURI' which is suitable for garden-variety tasks like
> `"http://gitit.net"++escapeURL pagename`, and then another for the
> octets you mention ('escapeOctet'?).

It's clearly *possible*, but where do we stop?  That said, I guess a *small*
number of special cases would make sense, e.g.

   escapeHttpOrFileUri

with carefully written health warnings in the associated documentation; e.g.
"This function applies URI escaping to an http: or file: URI on the assumption
that the individual path segments within the URI do not contain '/' or '?' or
'#' or [...] characters.  If any of these characters are present in any path
segment then the URI components and path segments should be escaped separately
before being assembled into a final URI, and no further escaping should be
applied once the URI has been constructed (cf. RFC 3986 [...])", etc., etc.

My point here is that crafting a clear description of when the provided escaping
is correct to use will be somewhat harder than writing the functions.  Also, I
suspect that escaping function will need to be a little more subtle than
'escapeURIString isUnescapedInURI', at least to the extent of splitting off the
query and fragment before escaping the pieces separately, then re-assembling.

Maybe the greatest value in doing this would be to demonstrate concretely the
complexities inherent in URI escaping, and provide some code that can be adapted
for different schemes and circumstances.

#g
--