[Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?
Jason Dusek
jason.dusek at gmail.com
Sun Mar 11 22:39:43 CET 2012
2012/3/11 Brandon Allbery <allbery.b at gmail.com>:
> On Sun, Mar 11, 2012 at 14:33, Jason Dusek <jason.dusek at gmail.com> wrote:
> > The syntax of URIs is a mechanism for describing data octets,
> > not Unicode code points. It is at variance to describe URIs in
> > terms of Unicode code points.
>
> You might want to take a glance at RFC 3492, though.
RFC 3492 covers Punycode, an approach to internationalized
domain names. The relationship of RFC 3986 to the restrictions
on the syntax of host names, as given by the DNS, is not simple.
On the one hand, we have:
This specification does not mandate a particular registered
name lookup technology and therefore does not restrict the
syntax of reg-name beyond what is necessary for
interoperability.
The production for reg-name is very liberal about allowable
octets:
reg-name = *( unreserved / pct-encoded / sub-delims )
However, we also have:
The reg-name syntax allows percent-encoded octets in order to
represent non-ASCII registered names in a uniform way that is
independent of the underlying name resolution technology.
Non-ASCII characters must first be encoded according to
UTF-8...
The argument for representing reg-names as Text is pretty strong
since the only representable data under these rules is, indeed,
Unicode code points.
--
Jason Dusek
pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B
More information about the Haskell-Cafe
mailing list