[web-devel] Type-safe URL handling
michael at snoyman.com
Mon Mar 22 22:41:51 EDT 2010
On Mon, Mar 22, 2010 at 7:37 PM, Jeremy Shaw <jeremy at n-heptane.com> wrote:
> On Mon, Mar 22, 2010 at 9:11 PM, Michael Snoyman <michael at snoyman.com>wrote:
>> On Mon, Mar 22, 2010 at 4:20 PM, Jeremy Shaw <jeremy at n-heptane.com>wrote:
>>> On Sun, Mar 21, 2010 at 12:04 AM, Michael Snoyman <michael at snoyman.com>wrote:
>>>> That made perfect sense, thank you for doing such thorough research on
>>>> I've attached two files; test1.html is UTF-8 encoded, test3.html is
>>>> windows-1255 (Hebrew). On my system, both links point to the same location,
>>>> implying to me that you are spot on that UTF-8 should always be used for
>>>> URLs. I had made a mistake with my test on Friday; apparently we only have
>>>> the encoding issue with the query string.
>>> Hmm. Those files do not contain value urls. The strings in the hrefs
>>> contain characters that are not in the limited set allowed by the URI spec.
>>> The part that is true is that even though the files have different encodings
>>> (utf-8 vs windows-1255) the characters in the strings are the same, so the
>>> urls are the same. I guess maybe the reason you put in invalid characters is
>>> because it is hard to test whether different encodings matter if you are
>>> only testing characters that are represented by the same octets in both
>>> Well, you guessed correctly at my reason for constructing the files as I
>> did. Not this is actually relevant to the discussion at hand, I believe that
>> it is valid HTML to put values in the HREF fields that are not in the
>> appropriate character range and assume the web browser will take care of
>> things. </off-topic>
> I believe the html 4.0 explicitly states that it is illegal here (though it
> recommends that user agents do something sensible anyway):
> The big trip up would be forms with method GET. The form submission is
>>> handled by taking the form set data, encoding it as
>>> application/x-www-form-urlencoded, and then append ? and the encoded data to
>>> the end of the action. If the action already contained a ?, that would not
>>> work out.
>>> You can't have a URL containing a ?; the closest you can come is a URL
>> containing an *escaped* ?, which will simply be absorbed by the [String]
>> piece of the URL. Unless I'm missing your point here.
> What I meant is that if the url supplied to the action already had a query
> string, then something undesirable would probably happen.
> encodePathInfo / decodePathInfo is probably a good choice of names. Adding
>>> them to web-encodings is likely useful, but I will just use local copies in
>>> urlt, because web-encodings brings in too many extra dependencies that I
>>> don't want at that level. I don't think I will export them though, so it
>>> should not cause a conflict.
>>> I have no problem with that decision, but out of curiosity which
>> dependencies are problematic? The only non-HP packages are failure, safe,
>> text and wai. The only ones which could in theory be eliminated are failure
>> and safe; if there is desire for me to do so, I'll look into it.
> Well, I see no reason to make all of urlt depend on failure, safe, text,
> wai, and web encodings when two small local functions would do the trick.
> Using the functions from web-encodings would not really increase
> compatibility / interoperability in any way, and I don't expect a lot a bug
> fixes that will have to be applied to multiple locations.
> Remember that I plan to split urlt up into a few pieces soon. I don't want
> happstack users complaining they have to install wai, or wai users
> complaining they have to install happstack. Even if happstack is ported to
> wai, there are extra layers that happstack adds which might benefit from
> some extra functions in urlt.
I was asking more in general if people took issue with the dependency list.
I agree that URLT should not depend on web-encodings.
> Also, my implementation is not quite right. It escapes more characters than
>>> is strictly required. path segments have the following ABNF:
>>> path_segments = segment *( "/" segment )
>>> segment = *pchar *( ";" param )
>>> param = *pchar
>>> pchar = unreserved | escaped |
>>> ":" | "@" | "&" | "=" | "+" | "$" | ","
>>> Also, . and .. are allowed in a path segment, but have special meaning.
>>> Not sure what we want to do about those. I like the property that *any*
>>> String value is automatically escaped and has no special meaning. So the
>>> same should be true for '.' and '..'. But if you do need to use '.' and '..'
>>> for some reason, there is no mechanism to do it in the current system.
>>> Though I am not sure what a compelling use case would be, so I am ok with
>>> just not allowing them for now.
>>> I'm not sure if they have meaning at the HTTP level. At the HTML level,
>> they specify relative paths, but I don't think they mean anything once it
>> enters HTTP.
> I'm not sure what to do with this information. It is true that they may be
> normalized by the browser before they are passed to the server. But urlt is
> being used primarily to create URLs that will be used in HTML pages. So, I
> think will still have to decide what to do with them.. Also, we shouldn't
> assume that the client normalized the .. stuff. Perhaps a malicious client
> won't in the hopes that it can retrieve
> http://example.com/../../../../etc/passwd or something.
> - jeremy
What I meant to say is I think we should just leave the . and .. in the data
and let the client deal with it, which I *think* is what you're saying.
If I'm not mistaken, I think that addresses all the issues on the table; is
there anything left to decide? I look forward to seeing a sample URLT :).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the web-devel