[Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

Sun Mar 11 19:33:27 CET 2012

2012/3/11 Jeremy Shaw <jeremy at n-heptane.com>:
> Also, URIs are not defined in terms of octets.. but in terms
> of characters.  If you write a URI down on a piece of paper --
> what octets are you using?  None.. it's some scribbles on a
> paper. It is the characters that are important, not the bit
> representation.

Well, to quote one example from RFC 3986:

  2.1.  Percent-Encoding

   A percent-encoding mechanism is used to represent a data octet in a
   component when that octet's corresponding character is outside the
   allowed set or is being used as a delimiter of, or within, the
   component.

The syntax of URIs is a mechanism for describing data octets,
not Unicode code points. It is at variance to describe URIs in
terms of Unicode code points.

> If you render a URI in a utf-8 encoded document versus a
> utf-16 encoded document.. the octets will be different, but
> the meaning will be the same. Because it is the characters
> that are important. For a URI Text would be a more compact
> representation than String.. but ByteString is a bit dodgy
> since it is not well defined what those bytes represent.
> (though if you use a newtype wrapper around ByteString to
> declare that it is Ascii, then that would be fine).

This is all fine well and good for what a URI is parsed from
and what it is serialized too; but once parsed, the major
components of a URI are all octets, pure and simple. Like the
"host" part of the authority:

  host        = IP-literal / IPv4address / reg-name
  ...
  reg-name    = *( unreserved / pct-encoded / sub-delims )

The reg-name production is enough to show that, once the host
portion is parsed, it could contain any bytes whatever.
ByteString is the only correct representations for a parsed host
and userinfo, as well as a parsed path, query or fragment.

--
Jason Dusek
pgp  ///  solidsnack  1FD4C6C1 FED18A2B