[Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

Graham Klyne GK at ninebynine.org
Wed Mar 14 10:05:23 CET 2012


I only just noticed this discussion.  Essentially, I think you have arrived at 
the right conclusion regarding URIs.

For more background, the IRI document makes interesting reading in this context: 
http://tools.ietf.org/html/rfc3987; esp. sections 2, 2.1.

The IRI is defined in terms of Unicode characters, which themselves may be 
described/referenced in terms of their code points, but the character encoding 
is not prescribed.

In practice, I think systems are increasingly using UTF-8 for transmitting IRIs 
and URIs, and using either UTF-8 or UTF-16 for internal storage.  There is still 
a legacy of ISO-8859-1 being defined asthe default charset for HTML (cf. 
http://www.w3.org/International/O-HTTP-charset for further discussiomn).


On 14/03/2012 06:43, Jason Dusek wrote:
> 2012/3/12 Jeremy Shaw<jeremy at n-heptane.com>:
>> On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek<jason.dusek at gmail.com>  wrote:
>>> Well, to quote one example from RFC 3986:
>>>   2.1.  Percent-Encoding
>>>    A percent-encoding mechanism is used to represent a data octet in a
>>>    component when that octet's corresponding character is outside the
>>>    allowed set or is being used as a delimiter of, or within, the
>>>    component.
>> Right. This describes how to convert an octet into a sequence of characters,
>> since the only thing that can appear in a URI is sequences of characters.
>>> The syntax of URIs is a mechanism for describing data octets,
>>> not Unicode code points. It is at variance to describe URIs in
>>> terms of Unicode code points.
>> Not sure what you mean by this. As the RFC says, a URI is defined entirely
>> by the identity of the characters that are used. There is definitely no
>> single, correct byte sequence for representing a URI. If I give you a
>> sequence of bytes and tell you it is a URI, the only way to decode it is to
>> first know what encoding the byte sequence represents.. ascii, utf-16, etc.
>> Once you have decoded the byte sequence into a sequence of characters, only
>> then can you parse the URI.
> Mr. Shaw,
> Thanks for taking the time to explain all this. It's really
> helped me to understand a lot of parts of the URI spec a lot
> better. I have deprecated my module in the latest release
>    http://hackage.haskell.org/package/URLb-0.0.1
> because a URL parser working on bytes instead of characters
> stands out to me now as a confused idea.
> --
> Jason Dusek
> pgp  ///  solidsnack  1FD4C6C1 FED18A2B
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe

More information about the Haskell-Cafe mailing list