[web-devel] Type-safe URL handling

Mon Mar 22 22:11:02 EDT 2010

On Mon, Mar 22, 2010 at 4:20 PM, Jeremy Shaw <jeremy at n-heptane.com> wrote:

> On Sun, Mar 21, 2010 at 12:04 AM, Michael Snoyman <michael at snoyman.com>wrote:
>
>> That made perfect sense, thank you for doing such thorough research on
>> this.
>>
>> I've attached two files; test1.html is UTF-8 encoded, test3.html is
>> windows-1255 (Hebrew). On my system, both links point to the same location,
>> implying to me that you are spot on that UTF-8 should always be used for
>> URLs. I had made a mistake with my test on Friday; apparently we only have
>> the encoding issue with the query string.
>>
>
> Hmm. Those files do not contain value urls. The strings in the hrefs
> contain characters that are not in the limited set allowed by the URI spec.
> The part that is true is that even though the files have different encodings
> (utf-8 vs windows-1255) the characters in the strings are the same, so the
> urls are the same. I guess maybe the reason you put in invalid characters is
> because it is hard to test whether different encodings matter if you are
> only testing characters that are represented by the same octets in both
> encodings.
>
> Well, you guessed correctly at my reason for constructing the files as I
did. Not this is actually relevant to the discussion at hand, I believe that
it is valid HTML to put values in the HREF fields that are not in the
appropriate character range and assume the web browser will take care of
things. </off-topic>

> Regarding your encoding issue with the query string. I believe there may
> have been 'nothing wrong'. At the URI level there is no specification as to
> how the query string is to be interpreted, or what underlying charset it
> should be associated with. It does have the requirement that it can only
> contain a limited set up characters, and that other characters must be
> converted to octets and then percent encoded.
>
> Now, things get interesting when you look at forms
> and application/x-www-form-urlencoded. When you create a form you have a
> form element that looks something like this:
>
> <form action="/submit" method=POST
> enctype="application/x-www-form-urlencoded;charset=utf-8">...</form>
>
> Except internet explorer, and a bunch of servers get stupid if you actually
> set the charset=utf-8. So the de facto standard is that the form is
> submitted using the same character encoding as the page it came from.  So if
> the <head> contains <meta charset="windows-1255">, then the form data will
> be encoded as windows-1255, converted to octets, and then percent encoded,
> plus the other things that url encoding does (such as + for spaces). You can
> also add the, accept-charset="utf-8" if you want to override the default and
> have the form submit some other character encoding. Not sure how widely
> supported that is.
>
> Now, if we were to change the method=POST to method=GET, then the
> urlencoded data would be passed as a query string, with its windows-1255
> encoded payload. And that is perfectly valid.
>
> So, the choice of how to encode the pathInfo and query string is pretty
> much application specific. For the URLT stuff we are both generating and
> parsing the path components, so we can choose whatever encoding we want --
> with utf-8 being a good choice.
>

>
I agree; the issue of query-string encoding not being under our control is
further reason to discourage its inclusion in URLT.

> Now, back to your point: I'm not sure why you want to include the query
>> string and fragment as part of the URL. Regarding the fragment: it will
>> never be passed to the server, so it's *impossible* to consider it for
>> parsing URLs. I understand that you might want to generate URLs with a
>> fragment, but we would then need to have parse and render functions which do
>> not parallel each other properly.
>>
>
> Right. I forgot about how fragments actually work.
>
>
>> Regarding the query string, I can see more of an argument being made to
>> include it, but it feels wrong to me. Precedence in most places does not
>> allow you to route requests based on the query string, and this seems like a
>> Good Idea. I know it would be nice to be guaranteed that there is a certain
>> GET parameter present, but I really think this should be dealt with at the
>> handler level.
>>
>
> What do you mean by 'precedence' ?
>
> I mean I've never seen a system that allows routing based on the query
string. In PHP, you create files that match the pathinfo; in Django, you
match regexs on the path info; I believe the same is true for Rails. This
isn't a proof that this is the Right Thing, merely an observation.

Including query string in urlt is certainly nice for some contexts. For
> example:
>
> data UserURL = AllUsers SortOrder
>
> data SortOrder = Asc | Desc
>
> Here the sort order is required. But the sort order does not really add
> hiearchy to the system, so it belongs more in the query string and less in
> the path. We might want a URL like:
>
> /allusers?sortOrder=asc
>
> On the other hand, those two possible URLs are not really *unique
resources* (to use more RESTful terminology). The sortOrder is not really
specifying *what* to return, just *how* to return it. Most well-designed URL
schemes would work that way. The badly designed ones, like
/user.php?id=5&name=michael&... shouldn't really be considered I think.

Now let's say we wrap that up in a larger site:
>
> data SiteURL = Users UserURL
>
> The Users constructor is adding hierarchy, so it shouldn't be modifying the
> query string. So it will just add something like:
>
> /users/allusers?sortOrder=asc
>
> So only the last component gets to add a query string.
>
> Not quite sure how we should enforce something like that.

> The big trip up would be forms with method GET. The form submission is
> handled by taking the form set data, encoding it as
> application/x-www-form-urlencoded, and then append ? and the encoded data to
> the end of the action. If the action already contained a ?, that would not
> work out.
>
> You can't have a URL containing a ?; the closest you can come is a URL
containing an *escaped* ?, which will simply be absorbed by the [String]
piece of the URL. Unless I'm missing your point here.

So, the toUrl / fromUrl instances would have to know if the url was going to
> be used as the target for an action and prohibit the use of a query string.
> That could be tricky :-/
>
> Also, in my example, I am handling parameters that are url specific. But
> many sites might have some sort of global parameters that can be tacked on
> to every query string. Not really sure how that would work out either.
>
>
>> If we can agree on this, I don't see a necessity to rely on an external
>> package to provide the URL datatype (since we would just be using [String]).
>> I can provide the encodeURL/decodeURL functions in web-encodings if that's
>> acceptable- your implementation seems correct to me. However, since it does
>> not function on fully-qualified URLs, perhaps we should call it
>> encodePathInfo/decodePathInfo?
>>
>
> encodePathInfo  / decodePathInfo is probably a good choice of names. Adding
> them to web-encodings is likely useful, but I will just use local copies in
> urlt, because web-encodings brings in too many extra dependencies that I
> don't want at that level.  I don't think I will export them though, so it
> should not cause a conflict.
>
> I have no problem with that decision, but out of curiosity which
dependencies are problematic? The only non-HP packages are failure, safe,
text and wai. The only ones which could in theory be eliminated are failure
and safe; if there is desire for me to do so, I'll look into it.

> Also, my implementation is not quite right. It escapes more characters than
> is strictly required. path segments have the following ABNF:
>
>       path_segments = segment *( "/" segment )
>       segment       = *pchar *( ";" param )
>       param         = *pchar
>
>       pchar         = unreserved | escaped |
>                       ":" | "@" | "&" | "=" | "+" | "$" | ","
>
> Also, . and .. are allowed in a path segment, but have special meaning. Not
> sure what we want to do about those. I like the property that *any* String
> value is automatically escaped and has no special meaning. So the same
> should be true for '.' and '..'. But if you do need to use '.' and '..' for
> some reason, there is no mechanism to do it in the current system. Though I
> am not sure what a compelling use case would be, so I am ok with just not
> allowing them for now.
>
> I'm not sure if they have meaning at the HTTP level. At the HTML level,
they specify relative paths, but I don't think they mean anything once it
enters HTTP.

Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/web-devel/attachments/20100322/d925beaa/attachment-0001.html