[web-devel] Type-safe URL handling

Jeremy Shaw jeremy at n-heptane.com
Mon Mar 22 19:20:47 EDT 2010


On Sun, Mar 21, 2010 at 12:04 AM, Michael Snoyman <michael at snoyman.com>wrote:

> That made perfect sense, thank you for doing such thorough research on
> this.
>
> I've attached two files; test1.html is UTF-8 encoded, test3.html is
> windows-1255 (Hebrew). On my system, both links point to the same location,
> implying to me that you are spot on that UTF-8 should always be used for
> URLs. I had made a mistake with my test on Friday; apparently we only have
> the encoding issue with the query string.
>

Hmm. Those files do not contain value urls. The strings in the hrefs contain
characters that are not in the limited set allowed by the URI spec. The part
that is true is that even though the files have different encodings (utf-8
vs windows-1255) the characters in the strings are the same, so the urls are
the same. I guess maybe the reason you put in invalid characters is because
it is hard to test whether different encodings matter if you are only
testing characters that are represented by the same octets in both
encodings.

Regarding your encoding issue with the query string. I believe there may
have been 'nothing wrong'. At the URI level there is no specification as to
how the query string is to be interpreted, or what underlying charset it
should be associated with. It does have the requirement that it can only
contain a limited set up characters, and that other characters must be
converted to octets and then percent encoded.

Now, things get interesting when you look at forms
and application/x-www-form-urlencoded. When you create a form you have a
form element that looks something like this:

<form action="/submit" method=POST
enctype="application/x-www-form-urlencoded;charset=utf-8">...</form>

Except internet explorer, and a bunch of servers get stupid if you actually
set the charset=utf-8. So the de facto standard is that the form is
submitted using the same character encoding as the page it came from.  So if
the <head> contains <meta charset="windows-1255">, then the form data will
be encoded as windows-1255, converted to octets, and then percent encoded,
plus the other things that url encoding does (such as + for spaces). You can
also add the, accept-charset="utf-8" if you want to override the default and
have the form submit some other character encoding. Not sure how widely
supported that is.

Now, if we were to change the method=POST to method=GET, then the urlencoded
data would be passed as a query string, with its windows-1255 encoded
payload. And that is perfectly valid.

So, the choice of how to encode the pathInfo and query string is pretty much
application specific. For the URLT stuff we are both generating and parsing
the path components, so we can choose whatever encoding we want -- with
utf-8 being a good choice.


> Now, back to your point: I'm not sure why you want to include the query
> string and fragment as part of the URL. Regarding the fragment: it will
> never be passed to the server, so it's *impossible* to consider it for
> parsing URLs. I understand that you might want to generate URLs with a
> fragment, but we would then need to have parse and render functions which do
> not parallel each other properly.
>

Right. I forgot about how fragments actually work.


> Regarding the query string, I can see more of an argument being made to
> include it, but it feels wrong to me. Precedence in most places does not
> allow you to route requests based on the query string, and this seems like a
> Good Idea. I know it would be nice to be guaranteed that there is a certain
> GET parameter present, but I really think this should be dealt with at the
> handler level.
>

What do you mean by 'precedence' ?

Including query string in urlt is certainly nice for some contexts. For
example:

data UserURL = AllUsers SortOrder

data SortOrder = Asc | Desc

Here the sort order is required. But the sort order does not really add
hiearchy to the system, so it belongs more in the query string and less in
the path. We might want a URL like:

/allusers?sortOrder=asc

Now let's say we wrap that up in a larger site:

data SiteURL = Users UserURL

The Users constructor is adding hierarchy, so it shouldn't be modifying the
query string. So it will just add something like:

/users/allusers?sortOrder=asc

So only the last component gets to add a query string.

The big trip up would be forms with method GET. The form submission is
handled by taking the form set data, encoding it as
application/x-www-form-urlencoded, and then append ? and the encoded data to
the end of the action. If the action already contained a ?, that would not
work out.

So, the toUrl / fromUrl instances would have to know if the url was going to
be used as the target for an action and prohibit the use of a query string.
That could be tricky :-/

Also, in my example, I am handling parameters that are url specific. But
many sites might have some sort of global parameters that can be tacked on
to every query string. Not really sure how that would work out either.


> If we can agree on this, I don't see a necessity to rely on an external
> package to provide the URL datatype (since we would just be using [String]).
> I can provide the encodeURL/decodeURL functions in web-encodings if that's
> acceptable- your implementation seems correct to me. However, since it does
> not function on fully-qualified URLs, perhaps we should call it
> encodePathInfo/decodePathInfo?
>

encodePathInfo  / decodePathInfo is probably a good choice of names. Adding
them to web-encodings is likely useful, but I will just use local copies in
urlt, because web-encodings brings in too many extra dependencies that I
don't want at that level.  I don't think I will export them though, so it
should not cause a conflict.

Also, my implementation is not quite right. It escapes more characters than
is strictly required. path segments have the following ABNF:

      path_segments = segment *( "/" segment )
      segment       = *pchar *( ";" param )
      param         = *pchar

      pchar         = unreserved | escaped |
                      ":" | "@" | "&" | "=" | "+" | "$" | ","

Also, . and .. are allowed in a path segment, but have special meaning. Not
sure what we want to do about those. I like the property that *any* String
value is automatically escaped and has no special meaning. So the same
should be true for '.' and '..'. But if you do need to use '.' and '..' for
some reason, there is no mechanism to do it in the current system. Though I
am not sure what a compelling use case would be, so I am ok with just not
allowing them for now.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/web-devel/attachments/20100322/c3891432/attachment.html


More information about the web-devel mailing list