Network.URI module

Mon Feb 16 16:12:37 EST 2004

I went rather quiet on this topic, partly because there have been a flurry 
of late comments and changes to the RFC2396bis work-in-progress spec [1] 
(some with the effect of simplifying the syntax).

I plan to update my implementation accordingly within the next 2 weeks, and 
add some (separate) path-normalization logic so that I can run all of my 
test cases.

Meanwhile, there were a couple of questions in my previous message to which 
I've not noticed any answers...

   What are the ground rules for potentially non-backward-compatible changes?
   What are the procedures for lodging new releases (CVS?).
   A proposal to change the code structure while (mostly) preserving 
backward compatibility (details below).

#g
--

[1] http://gbiv.com/protocols/uri/rev-2002/rfc2396bis.html

At 10:44 03/02/04 +0000, Simon Marlow wrote:
> > (1) Network.URI
> >
> > I've written a new parser, and extended the module interface
> > slightly, thus:
>
>[snip]
>
>If you'd like to become the maintainer of this module and incorporate
>your changes, you're entirely welcome (I'm not using this code actively
>at the moment).

I'd be happy to do that.

What are the ground rules for potentially non-backward-compatible changes?

What are the procedures for lodging new releases (CVS?).

>You mentioned that there were problems with the existing implementation
>- perhaps you could explain further?  As far as I'm aware, the regular
>expression and the test cases were taken directly from RFC 2396, and the
>implementation was correct at the time - did something change?  The
>current testcases are in testsuite/tests/ghc-regress/lib/net/uri001.hs.

I don't think it was a problem with the regular expression per se:
(a) the regex in RFC2396 doesn't tell you (reliably) if a URI is or is not 
valid.  What it does do is, assuming a valid URI is presented, is pick 
apart the various components.
(b) I would have stuck with the regex-based implementation here, except 
that the regex module used is not available on Windows.  For me, it was 
easier to construct a URI parser using Parsec, which doesn't depend on 
system-dependent modules.
(c) there are some small changes in syntax that might affect the regex 
implementation:  reserving '[' and ']' for use in IPv6 literals comes to 
mind.  I haven't checked the details.

My parser follows the syntax in the RFC2396bis proposal very closely.   As 
such, it will reject some URIs that the regex implementation would 
accept.  My own test suite includes all the RFC2396 test cases.  (The 
RFC2396 proposal already has extensive review and broad consensus in the 
URI working group;  my Haskell work is providing some implementation feedback.)

The problems with behaviour of the current implementation that I did note 
are covered below...

> > I have some concerns about the way URI strings are
> > reassembled from the
> > component parts using the current URI module interface (e.g.
> > problem with
> > empty fragment handling noted in a previous message).  I
> > think the URI
> > implementation should be changed so that all the punctuation
> > characters
> > ("//", "?", "#", etc.) are stored as part of the component
> > values in a URI
> > structure, but I don't know what impact that might have on
> > existing code.
>
>If that's an unforced change I'd vote to keep the current behaviour, to
>avoid breaking code.

It's not entirely "unforced"...  it has to do with the way a URI is stored 
internally, and the consequences for reconstructing a URI string from the 
URI components; e.g.

     file:///path/name
is reconstructed as:
     file:/path/name

     http://example.org/path/resource#
is reconstructed as:
     http://example.org/path/resource

I have a question [1] outstanding with the URI WG about the validity of the 
first, and do believe that the second is incorrect (there has been some 
discussion that the presence of a fragment is significant in some web 
applications).  There is a general presumption in Web circles that a URI 
should be used in exactly the form given; cf. [2].

[1] http://www.w3.org/mid/5.1.0.14.2.20040202132114.00bd6ec8@127.0.0.1
(Can't get proper URI yet ... lists.w3.org is down as I write)

[2] http://www.w3.org/2001/tag/webarch/#lc-uri-chars

I suppose it would be possible to make a new implementation of the URI 
structure that presents the same interface, but remembers the presence of 
empty fields, but I'm concerned that would be locking in undesirable 
complexity and propagating a debatable design.  (Question:  why would one 
wish the URI components to be stored without their distinguishing punctuation?)

...

Here's a proposal:

(a) change URI thus:

[[
data URI = URI
     { uriScheme    :: String   -- ^ @http:@
     , uriAuthority :: String   -- ^ @//www.haskell.org@
     , uriPath      :: String   -- ^ @/ghc@
     , uriQuery     :: String   -- ^ @?query@
     , uriFragment  :: String   -- ^ @#frag@
     }
]]

(b) implement access functions that behave like the original field selectors.

Then the visible change in behaviour would be that 'show' of any URI would 
reconstruct exactly the string supplied to construct it.  If it turns out 
that the alternative access functions are not needed, they could be dropped 
in a later revision (hmmm... do Haskell impleemnetations offer a deprecated 
flag?).

#g

------------
Graham Klyne
For email:
http://www.ninebynine.org/#Contact