[Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

Sun Mar 11 22:39:43 CET 2012

2012/3/11 Brandon Allbery <allbery.b at gmail.com>:
> On Sun, Mar 11, 2012 at 14:33, Jason Dusek <jason.dusek at gmail.com> wrote:
> > The syntax of URIs is a mechanism for describing data octets,
> > not Unicode code points. It is at variance to describe URIs in
> > terms of Unicode code points.
>
> You might want to take a glance at RFC 3492, though.

RFC 3492 covers Punycode, an approach to internationalized
domain names. The relationship of RFC 3986 to the restrictions
on the syntax of host names, as given by the DNS, is not simple.
On the one hand, we have:

   This specification does not mandate a particular registered
   name lookup technology and therefore does not restrict the
   syntax of reg-name beyond what is necessary for
   interoperability.

The production for reg-name is very liberal about allowable
octets:

  reg-name    = *( unreserved / pct-encoded / sub-delims )

However, we also have:

  The reg-name syntax allows percent-encoded octets in order to
  represent non-ASCII registered names in a uniform way that is
  independent of the underlying name resolution technology.
  Non-ASCII characters must first be encoded according to
  UTF-8...

The argument for representing reg-names as Text is pretty strong
since the only representable data under these rules is, indeed,
Unicode code points.

--
Jason Dusek
pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B