[web-devel] Type-safe URL handling
jeremy at n-heptane.com
Mon Mar 22 22:37:04 EDT 2010
On Mon, Mar 22, 2010 at 9:11 PM, Michael Snoyman <michael at snoyman.com>wrote:
> On Mon, Mar 22, 2010 at 4:20 PM, Jeremy Shaw <jeremy at n-heptane.com> wrote:
>> On Sun, Mar 21, 2010 at 12:04 AM, Michael Snoyman <michael at snoyman.com>wrote:
>>> That made perfect sense, thank you for doing such thorough research on
>>> I've attached two files; test1.html is UTF-8 encoded, test3.html is
>>> windows-1255 (Hebrew). On my system, both links point to the same location,
>>> implying to me that you are spot on that UTF-8 should always be used for
>>> URLs. I had made a mistake with my test on Friday; apparently we only have
>>> the encoding issue with the query string.
>> Hmm. Those files do not contain value urls. The strings in the hrefs
>> contain characters that are not in the limited set allowed by the URI spec.
>> The part that is true is that even though the files have different encodings
>> (utf-8 vs windows-1255) the characters in the strings are the same, so the
>> urls are the same. I guess maybe the reason you put in invalid characters is
>> because it is hard to test whether different encodings matter if you are
>> only testing characters that are represented by the same octets in both
>> Well, you guessed correctly at my reason for constructing the files as I
> did. Not this is actually relevant to the discussion at hand, I believe that
> it is valid HTML to put values in the HREF fields that are not in the
> appropriate character range and assume the web browser will take care of
> things. </off-topic>
I believe the html 4.0 explicitly states that it is illegal here (though it
recommends that user agents do something sensible anyway):
The big trip up would be forms with method GET. The form submission is
>> handled by taking the form set data, encoding it as
>> application/x-www-form-urlencoded, and then append ? and the encoded data to
>> the end of the action. If the action already contained a ?, that would not
>> work out.
>> You can't have a URL containing a ?; the closest you can come is a URL
> containing an *escaped* ?, which will simply be absorbed by the [String]
> piece of the URL. Unless I'm missing your point here.
What I meant is that if the url supplied to the action already had a query
string, then something undesirable would probably happen.
encodePathInfo / decodePathInfo is probably a good choice of names. Adding
>> them to web-encodings is likely useful, but I will just use local copies in
>> urlt, because web-encodings brings in too many extra dependencies that I
>> don't want at that level. I don't think I will export them though, so it
>> should not cause a conflict.
>> I have no problem with that decision, but out of curiosity which
> dependencies are problematic? The only non-HP packages are failure, safe,
> text and wai. The only ones which could in theory be eliminated are failure
> and safe; if there is desire for me to do so, I'll look into it.
Well, I see no reason to make all of urlt depend on failure, safe, text,
wai, and web encodings when two small local functions would do the trick.
Using the functions from web-encodings would not really increase
compatibility / interoperability in any way, and I don't expect a lot a bug
fixes that will have to be applied to multiple locations.
Remember that I plan to split urlt up into a few pieces soon. I don't want
happstack users complaining they have to install wai, or wai users
complaining they have to install happstack. Even if happstack is ported to
wai, there are extra layers that happstack adds which might benefit from
some extra functions in urlt.
> Also, my implementation is not quite right. It escapes more characters than
>> is strictly required. path segments have the following ABNF:
>> path_segments = segment *( "/" segment )
>> segment = *pchar *( ";" param )
>> param = *pchar
>> pchar = unreserved | escaped |
>> ":" | "@" | "&" | "=" | "+" | "$" | ","
>> Also, . and .. are allowed in a path segment, but have special meaning.
>> Not sure what we want to do about those. I like the property that *any*
>> String value is automatically escaped and has no special meaning. So the
>> same should be true for '.' and '..'. But if you do need to use '.' and '..'
>> for some reason, there is no mechanism to do it in the current system.
>> Though I am not sure what a compelling use case would be, so I am ok with
>> just not allowing them for now.
>> I'm not sure if they have meaning at the HTTP level. At the HTML level,
> they specify relative paths, but I don't think they mean anything once it
> enters HTTP.
I'm not sure what to do with this information. It is true that they may be
normalized by the browser before they are passed to the server. But urlt is
being used primarily to create URLs that will be used in HTML pages. So, I
think will still have to decide what to do with them.. Also, we shouldn't
assume that the client normalized the .. stuff. Perhaps a malicious client
won't in the hopes that it can retrieve
http://example.com/../../../../etc/passwd or something.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the web-devel