[Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

Mon Mar 12 04:05:58 CET 2012

2012/3/11 Thedward Blevins <thedward at barsoom.net>:
> On Sun, Mar 11, 2012 at 13:33, Jason Dusek <jason.dusek at gmail.com> wrote:
> > The syntax of URIs is a mechanism for describing data octets,
> > not Unicode code points. It is at variance to describe URIs in
> > terms of Unicode code points.
>
> This claim is at odds with the RFC you quoted:
>
> 2. Characters
>
> The URI syntax provides a method of encoding data, presumably for the sake
> of identifying a resource, as a sequence of characters. The URI characters
> are, in turn, frequently encoded as octets for transport or presentation.
> This specification does not mandate any particular character encoding for
> mapping between URI characters and the octets used to store or transmit
> those characters.
>
> (Emphasis is mine)
>
> The RFC is specifically agnostic about serialization. I generally agree that
> there are a lot of places where ByteString should be used, but I'm not
> convinced this is one of them.

Hi Thedward,

I am CC'ing the list since you raise a good point that, I think,
reflects on the discussion broadly. It is true that intent of
the spec is to allow encoding of characters and not of bytes: I
misread its intent, attending only to the productions. But due
to the way URIs interact with character encoding, a general URI
parser is constrained to work with ByteStrings, just the same.

The RFC "...does not mandate any particular character encoding
for mapping between URI characters and the octets used to store
or transmit those characters..." and in Section 1.2.1 it is
allowed that the encoding of may depend on the scheme:

   In local or regional contexts and with improving technology, users
   might benefit from being able to use a wider range of characters;
   such use is not defined by this specification.  Percent-encoded
   octets (Section 2.1) may be used within a URI to represent characters
   outside the range of the US-ASCII coded character set if this
   representation is allowed by the scheme or by the protocol element in
   which the URI is referenced.

It seems possible for any octet, 0x00..0xFF, to show up in a
URI, and it is only after parsing the scheme that we can say
whether the octet belongs there are not. Thus a general URI
parser can only go as far as splitting into components and
percent decoding before handing off to scheme specific
validation rules (but that's a big help already!). I've
implemented a parser under these principles that handles
specifically URLs:

  http://hackage.haskell.org/package/URLb

Although the intent of the spec is to represent characters, I
contend it does not succeed in doing so. Is it wise to assume
more semantics than are actually there? The Internet and UNIX
are full of broken junk; but faithful representation would seem
to be better than idealization for those occasions where we must
deal with them. I'm not sure the assumption of "textiness"
really helps much in practice since the Universal Character Set
contains control codes and bidi characters -- data that isn't
really text.

--
Jason Dusek
pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B