[Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?
Graham Klyne
GK at ninebynine.org
Wed Mar 14 10:05:23 CET 2012
Hi,
I only just noticed this discussion. Essentially, I think you have arrived at
the right conclusion regarding URIs.
For more background, the IRI document makes interesting reading in this context:
http://tools.ietf.org/html/rfc3987; esp. sections 2, 2.1.
The IRI is defined in terms of Unicode characters, which themselves may be
described/referenced in terms of their code points, but the character encoding
is not prescribed.
In practice, I think systems are increasingly using UTF-8 for transmitting IRIs
and URIs, and using either UTF-8 or UTF-16 for internal storage. There is still
a legacy of ISO-8859-1 being defined asthe default charset for HTML (cf.
http://www.w3.org/International/O-HTTP-charset for further discussiomn).
#g
--
On 14/03/2012 06:43, Jason Dusek wrote:
> 2012/3/12 Jeremy Shaw<jeremy at n-heptane.com>:
>> On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek<jason.dusek at gmail.com> wrote:
>>> Well, to quote one example from RFC 3986:
>>>
>>> 2.1. Percent-Encoding
>>>
>>> A percent-encoding mechanism is used to represent a data octet in a
>>> component when that octet's corresponding character is outside the
>>> allowed set or is being used as a delimiter of, or within, the
>>> component.
>>
>> Right. This describes how to convert an octet into a sequence of characters,
>> since the only thing that can appear in a URI is sequences of characters.
>>
>>> The syntax of URIs is a mechanism for describing data octets,
>>> not Unicode code points. It is at variance to describe URIs in
>>> terms of Unicode code points.
>>
>>
>> Not sure what you mean by this. As the RFC says, a URI is defined entirely
>> by the identity of the characters that are used. There is definitely no
>> single, correct byte sequence for representing a URI. If I give you a
>> sequence of bytes and tell you it is a URI, the only way to decode it is to
>> first know what encoding the byte sequence represents.. ascii, utf-16, etc.
>> Once you have decoded the byte sequence into a sequence of characters, only
>> then can you parse the URI.
>
> Mr. Shaw,
>
> Thanks for taking the time to explain all this. It's really
> helped me to understand a lot of parts of the URI spec a lot
> better. I have deprecated my module in the latest release
>
> http://hackage.haskell.org/package/URLb-0.0.1
>
> because a URL parser working on bytes instead of characters
> stands out to me now as a confused idea.
>
> --
> Jason Dusek
> pgp /// solidsnack 1FD4C6C1 FED18A2B
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
More information about the Haskell-Cafe
mailing list