Network.URI module
Graham Klyne
gk at ninebynine.org
Mon Feb 16 16:12:37 EST 2004
I went rather quiet on this topic, partly because there have been a flurry
of late comments and changes to the RFC2396bis work-in-progress spec [1]
(some with the effect of simplifying the syntax).
I plan to update my implementation accordingly within the next 2 weeks, and
add some (separate) path-normalization logic so that I can run all of my
test cases.
Meanwhile, there were a couple of questions in my previous message to which
I've not noticed any answers...
What are the ground rules for potentially non-backward-compatible changes?
What are the procedures for lodging new releases (CVS?).
A proposal to change the code structure while (mostly) preserving
backward compatibility (details below).
#g
--
[1] http://gbiv.com/protocols/uri/rev-2002/rfc2396bis.html
At 10:44 03/02/04 +0000, Simon Marlow wrote:
> > (1) Network.URI
> >
> > I've written a new parser, and extended the module interface
> > slightly, thus:
>
>[snip]
>
>If you'd like to become the maintainer of this module and incorporate
>your changes, you're entirely welcome (I'm not using this code actively
>at the moment).
I'd be happy to do that.
What are the ground rules for potentially non-backward-compatible changes?
What are the procedures for lodging new releases (CVS?).
>You mentioned that there were problems with the existing implementation
>- perhaps you could explain further? As far as I'm aware, the regular
>expression and the test cases were taken directly from RFC 2396, and the
>implementation was correct at the time - did something change? The
>current testcases are in testsuite/tests/ghc-regress/lib/net/uri001.hs.
I don't think it was a problem with the regular expression per se:
(a) the regex in RFC2396 doesn't tell you (reliably) if a URI is or is not
valid. What it does do is, assuming a valid URI is presented, is pick
apart the various components.
(b) I would have stuck with the regex-based implementation here, except
that the regex module used is not available on Windows. For me, it was
easier to construct a URI parser using Parsec, which doesn't depend on
system-dependent modules.
(c) there are some small changes in syntax that might affect the regex
implementation: reserving '[' and ']' for use in IPv6 literals comes to
mind. I haven't checked the details.
My parser follows the syntax in the RFC2396bis proposal very closely. As
such, it will reject some URIs that the regex implementation would
accept. My own test suite includes all the RFC2396 test cases. (The
RFC2396 proposal already has extensive review and broad consensus in the
URI working group; my Haskell work is providing some implementation feedback.)
The problems with behaviour of the current implementation that I did note
are covered below...
> > I have some concerns about the way URI strings are
> > reassembled from the
> > component parts using the current URI module interface (e.g.
> > problem with
> > empty fragment handling noted in a previous message). I
> > think the URI
> > implementation should be changed so that all the punctuation
> > characters
> > ("//", "?", "#", etc.) are stored as part of the component
> > values in a URI
> > structure, but I don't know what impact that might have on
> > existing code.
>
>If that's an unforced change I'd vote to keep the current behaviour, to
>avoid breaking code.
It's not entirely "unforced"... it has to do with the way a URI is stored
internally, and the consequences for reconstructing a URI string from the
URI components; e.g.
file:///path/name
is reconstructed as:
file:/path/name
http://example.org/path/resource#
is reconstructed as:
http://example.org/path/resource
I have a question [1] outstanding with the URI WG about the validity of the
first, and do believe that the second is incorrect (there has been some
discussion that the presence of a fragment is significant in some web
applications). There is a general presumption in Web circles that a URI
should be used in exactly the form given; cf. [2].
[1] http://www.w3.org/mid/5.1.0.14.2.20040202132114.00bd6ec8@127.0.0.1
(Can't get proper URI yet ... lists.w3.org is down as I write)
[2] http://www.w3.org/2001/tag/webarch/#lc-uri-chars
I suppose it would be possible to make a new implementation of the URI
structure that presents the same interface, but remembers the presence of
empty fields, but I'm concerned that would be locking in undesirable
complexity and propagating a debatable design. (Question: why would one
wish the URI components to be stored without their distinguishing punctuation?)
...
Here's a proposal:
(a) change URI thus:
[[
data URI = URI
{ uriScheme :: String -- ^ @http:@
, uriAuthority :: String -- ^ @//www.haskell.org@
, uriPath :: String -- ^ @/ghc@
, uriQuery :: String -- ^ @?query@
, uriFragment :: String -- ^ @#frag@
}
]]
(b) implement access functions that behave like the original field selectors.
Then the visible change in behaviour would be that 'show' of any URI would
reconstruct exactly the string supplied to construct it. If it turns out
that the alternative access functions are not needed, they could be dropped
in a later revision (hmmm... do Haskell impleemnetations offer a deprecated
flag?).
#g
------------
Graham Klyne
For email:
http://www.ninebynine.org/#Contact
More information about the Libraries
mailing list