[web-devel] Type-safe URL handling
michael at snoyman.com
Mon Mar 22 22:11:02 EDT 2010
On Mon, Mar 22, 2010 at 4:20 PM, Jeremy Shaw <jeremy at n-heptane.com> wrote:
> On Sun, Mar 21, 2010 at 12:04 AM, Michael Snoyman <michael at snoyman.com>wrote:
>> That made perfect sense, thank you for doing such thorough research on
>> I've attached two files; test1.html is UTF-8 encoded, test3.html is
>> windows-1255 (Hebrew). On my system, both links point to the same location,
>> implying to me that you are spot on that UTF-8 should always be used for
>> URLs. I had made a mistake with my test on Friday; apparently we only have
>> the encoding issue with the query string.
> Hmm. Those files do not contain value urls. The strings in the hrefs
> contain characters that are not in the limited set allowed by the URI spec.
> The part that is true is that even though the files have different encodings
> (utf-8 vs windows-1255) the characters in the strings are the same, so the
> urls are the same. I guess maybe the reason you put in invalid characters is
> because it is hard to test whether different encodings matter if you are
> only testing characters that are represented by the same octets in both
> Well, you guessed correctly at my reason for constructing the files as I
did. Not this is actually relevant to the discussion at hand, I believe that
it is valid HTML to put values in the HREF fields that are not in the
appropriate character range and assume the web browser will take care of
> Regarding your encoding issue with the query string. I believe there may
> have been 'nothing wrong'. At the URI level there is no specification as to
> how the query string is to be interpreted, or what underlying charset it
> should be associated with. It does have the requirement that it can only
> contain a limited set up characters, and that other characters must be
> converted to octets and then percent encoded.
> Now, things get interesting when you look at forms
> and application/x-www-form-urlencoded. When you create a form you have a
> form element that looks something like this:
> <form action="/submit" method=POST
> Except internet explorer, and a bunch of servers get stupid if you actually
> set the charset=utf-8. So the de facto standard is that the form is
> submitted using the same character encoding as the page it came from. So if
> the <head> contains <meta charset="windows-1255">, then the form data will
> be encoded as windows-1255, converted to octets, and then percent encoded,
> plus the other things that url encoding does (such as + for spaces). You can
> also add the, accept-charset="utf-8" if you want to override the default and
> have the form submit some other character encoding. Not sure how widely
> supported that is.
> Now, if we were to change the method=POST to method=GET, then the
> urlencoded data would be passed as a query string, with its windows-1255
> encoded payload. And that is perfectly valid.
> So, the choice of how to encode the pathInfo and query string is pretty
> much application specific. For the URLT stuff we are both generating and
> parsing the path components, so we can choose whatever encoding we want --
> with utf-8 being a good choice.
I agree; the issue of query-string encoding not being under our control is
further reason to discourage its inclusion in URLT.
> Now, back to your point: I'm not sure why you want to include the query
>> string and fragment as part of the URL. Regarding the fragment: it will
>> never be passed to the server, so it's *impossible* to consider it for
>> parsing URLs. I understand that you might want to generate URLs with a
>> fragment, but we would then need to have parse and render functions which do
>> not parallel each other properly.
> Right. I forgot about how fragments actually work.
>> Regarding the query string, I can see more of an argument being made to
>> include it, but it feels wrong to me. Precedence in most places does not
>> allow you to route requests based on the query string, and this seems like a
>> Good Idea. I know it would be nice to be guaranteed that there is a certain
>> GET parameter present, but I really think this should be dealt with at the
>> handler level.
> What do you mean by 'precedence' ?
> I mean I've never seen a system that allows routing based on the query
string. In PHP, you create files that match the pathinfo; in Django, you
match regexs on the path info; I believe the same is true for Rails. This
isn't a proof that this is the Right Thing, merely an observation.
Including query string in urlt is certainly nice for some contexts. For
> data UserURL = AllUsers SortOrder
> data SortOrder = Asc | Desc
> Here the sort order is required. But the sort order does not really add
> hiearchy to the system, so it belongs more in the query string and less in
> the path. We might want a URL like:
> On the other hand, those two possible URLs are not really *unique
resources* (to use more RESTful terminology). The sortOrder is not really
specifying *what* to return, just *how* to return it. Most well-designed URL
schemes would work that way. The badly designed ones, like
/user.php?id=5&name=michael&... shouldn't really be considered I think.
Now let's say we wrap that up in a larger site:
> data SiteURL = Users UserURL
> The Users constructor is adding hierarchy, so it shouldn't be modifying the
> query string. So it will just add something like:
> So only the last component gets to add a query string.
> Not quite sure how we should enforce something like that.
> The big trip up would be forms with method GET. The form submission is
> handled by taking the form set data, encoding it as
> application/x-www-form-urlencoded, and then append ? and the encoded data to
> the end of the action. If the action already contained a ?, that would not
> work out.
> You can't have a URL containing a ?; the closest you can come is a URL
containing an *escaped* ?, which will simply be absorbed by the [String]
piece of the URL. Unless I'm missing your point here.
So, the toUrl / fromUrl instances would have to know if the url was going to
> be used as the target for an action and prohibit the use of a query string.
> That could be tricky :-/
> Also, in my example, I am handling parameters that are url specific. But
> many sites might have some sort of global parameters that can be tacked on
> to every query string. Not really sure how that would work out either.
>> If we can agree on this, I don't see a necessity to rely on an external
>> package to provide the URL datatype (since we would just be using [String]).
>> I can provide the encodeURL/decodeURL functions in web-encodings if that's
>> acceptable- your implementation seems correct to me. However, since it does
>> not function on fully-qualified URLs, perhaps we should call it
> encodePathInfo / decodePathInfo is probably a good choice of names. Adding
> them to web-encodings is likely useful, but I will just use local copies in
> urlt, because web-encodings brings in too many extra dependencies that I
> don't want at that level. I don't think I will export them though, so it
> should not cause a conflict.
> I have no problem with that decision, but out of curiosity which
dependencies are problematic? The only non-HP packages are failure, safe,
text and wai. The only ones which could in theory be eliminated are failure
and safe; if there is desire for me to do so, I'll look into it.
> Also, my implementation is not quite right. It escapes more characters than
> is strictly required. path segments have the following ABNF:
> path_segments = segment *( "/" segment )
> segment = *pchar *( ";" param )
> param = *pchar
> pchar = unreserved | escaped |
> ":" | "@" | "&" | "=" | "+" | "$" | ","
> Also, . and .. are allowed in a path segment, but have special meaning. Not
> sure what we want to do about those. I like the property that *any* String
> value is automatically escaped and has no special meaning. So the same
> should be true for '.' and '..'. But if you do need to use '.' and '..' for
> some reason, there is no mechanism to do it in the current system. Though I
> am not sure what a compelling use case would be, so I am ok with just not
> allowing them for now.
> I'm not sure if they have meaning at the HTTP level. At the HTML level,
they specify relative paths, but I don't think they mean anything once it
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the web-devel