[Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

Mon Mar 12 06:46:56 CET 2012

2012/3/12 Jeremy Shaw <jeremy at n-heptane.com>:
> > The syntax of URIs is a mechanism for describing data octets,
> > not Unicode code points. It is at variance to describe URIs in
> > terms of Unicode code points.
>
> Not sure what you mean by this. As the RFC says, a URI is defined entirely
> by the identity of the characters that are used. There is definitely no
> single, correct byte sequence for representing a URI. If I give you a
> sequence of bytes and tell you it is a URI, the only way to decode it is to
> first know what encoding the byte sequence represents.. ascii, utf-16, etc.
> Once you have decoded the byte sequence into a sequence of characters, only
> then can you parse the URI.

Hmm. Well, I have been reading the spec the other way around:
first you parse the URI to get the bytes, then you use encoding
information to interpret the bytes. I think this curious passage
from Section 2.5 is interesting to consider here:

   For most systems, an unreserved character appearing within a URI
   component is interpreted as representing the data octet corresponding
   to that character's encoding in US-ASCII.  Consumers of URIs assume
   that the letter "X" corresponds to the octet "01011000", and even
   when that assumption is incorrect, there is no harm in making it.  A
   system that internally provides identifiers in the form of a
   different character encoding, such as EBCDIC, will generally perform
   character translation of textual identifiers to UTF-8 [STD63] (or
   some other superset of the US-ASCII character encoding) at an
   internal interface, thereby providing more meaningful identifiers
   than those resulting from simply percent-encoding the original
   octets.

I am really not sure how to interpret this. I have been reading
'%' in productions as '0b00100101' and I have written my parser
this way; but that is probably backwards thinking.

> ...let's say we have the path segments ["foo", "bar/baz"] and we wish to use
> them in the path info of a URI. Because / is a special character it must be
> percent encoded as %2F. So, the path info for the url would be:
>
>  foo/bar%2Fbaz
>
> If we had the path segments, ["foo","bar","baz"], however that would be
> encoded as:
>
>  foo/bar/baz
>
> Now let's look at decoding the path. If we simple decode the percent encoded
> characters and give the user a ByteString then both urls will decode to:
>
>  pack "foo/bar/baz"
>
> Which is incorrect. ["foo", "bar/baz"] and ["foo","bar","baz"] represent
> different paths. The percent encoding there is required to distinguish
> between to two unique paths.

I read the section on paths differently: a path is sequence of
bytes, wherein slash runs are not permitted, among other rules.
However, re-reading the section, a big todo is made about
hierarchical data and path normalization; it really seems your
interpretation is the correct one. I tried it out in cURL, for
example:

  http://www.ietf.org/rfc%2Frfc3986.txt     # 404 Not Found
  http://www.ietf.org/rfc/rfc3986.txt       # 200 OK

My recently released released URL parser/pretty-printer is
actually wrong in its handling of paths and, when corrected,
will only amount to a parser of URLs that are encoded in
US-ASCII and supersets thereof.

--
Jason Dusek
pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B