[Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

Jeremy Shaw jeremy at n-heptane.com
Mon Mar 12 05:18:44 CET 2012


Argh. Email fail.

Hopefully this time I have managed to reply-all to the list *and* keep the
unicode properly intact.

Sorry about any duplicates you may have received.

On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek <jason.dusek at gmail.com> wrote:

> 2012/3/11 Jeremy Shaw <jeremy at n-heptane.com>:
> > Also, URIs are not defined in terms of octets.. but in terms
> > of characters.  If you write a URI down on a piece of paper --
> > what octets are you using?  None.. it's some scribbles on a
> > paper. It is the characters that are important, not the bit
> > representation.
>


To quote RFC1738:

   URLs are sequences of characters, i.e., letters, digits, and special
   characters. A URLs may be represented in a variety of ways: e.g., ink
   on paper, or a sequence of octets in a coded character set. The
   interpretation of a URL depends only on the identity of the
   characters used.


Well, to quote one example from RFC 3986:
>
>  2.1.  Percent-Encoding
>
>   A percent-encoding mechanism is used to represent a data octet in a
>   component when that octet's corresponding character is outside the
>   allowed set or is being used as a delimiter of, or within, the
>   component.
>

Right. This describes how to convert an octet into a sequence of
characters, since the only thing that can appear in a URI is sequences of
characters.


> The syntax of URIs is a mechanism for describing data octets,
> not Unicode code points. It is at variance to describe URIs in
> terms of Unicode code points.


Not sure what you mean by this. As the RFC says, a URI is defined entirely
by the identity of the characters that are used. There is definitely no
single, correct byte sequence for representing a URI. If I give you a
sequence of bytes and tell you it is a URI, the only way to decode it is to
first know what encoding the byte sequence represents.. ascii, utf-16, etc.
Once you have decoded the byte sequence into a sequence of characters, only
then can you parse the URI.


> > If you render a URI in a utf-8 encoded document versus a
> > utf-16 encoded document.. the octets will be diffiFor example, let's say
> that we have a unicode string and we want to use it in the URI path.
>
> > the meaning will be the same. Because it is the characters
> > that are important. For a URI Text would be a more compact
> > representation than String.. but ByteString is a bit dodgy
> > since it is not well defined what those bytes represent.
> > (though if you use a newtype wrapper around ByteString to
> > declare that it is Ascii, then that would be fine).
>
> This is all fine well and good for what a URI is parsed from
> and what it is serialized too; but once parsed, the major
> components of a URI are all octets, pure and simple.
>

Not quite. We can not, for example, change uriPath to be a ByteString and
decode any percent encoded characters for the user, because that would
change the meaning of the path and break applications.

For example, let's say we have the path segments ["foo", "bar/baz"] and we
wish to use them in the path info of a URI. Because / is a special
character it must be percent encoded as %2F. So, the path info for the url
would be:

 foo/bar%2Fbaz

If we had the path segments, ["foo","bar","baz"], however that would be
encoded as:

 foo/bar/baz

Now let's look at decoding the path. If we simple decode the percent
encoded characters and give the user a ByteString then both urls will
decode to:

 pack "foo/bar/baz"

Which is incorrect. ["foo", "bar/baz"] and ["foo","bar","baz"] represent
different paths. The percent encoding there is required to distinguish
between to two unique paths.

Let's look at another example, Let's say we want to encode the path
segments:

 ["I❤λ"]

How do we do that?

Well.. the RFCs do not mandate a specific way. While a URL is a sequence of
characters -- the set of allow characters in pretty restricted. So, we must
use some application specific way to transform that string into something
that is allowed in a uri path. We could do it by converting all characters
to their unicode character numbers like:

 "u73u2764u03BB"

Since the string now only contains acceptable characters, we can easily
convert it to a valid uri path. Later when someone requests that url, our
application can convert it back to a unicode character sequence.

Of course, no one actually uses that method. The commonly used (and I
believe, officially endorsed, but not required) method is a bit more
complicated.

 1. first we take the string "I❤λ" and utf-8 encoded it to get a octet
sequence:

   49 e2 9d a4 ce bb

 2. next we percent encode the bytes to get *back* to a character sequence
(such as a String, Text, or Ascii)

 "I%E2%9D%A4%CE%BB"

So, that is character sequence that would appear in the URI. *But* we do
not yet have octets that we can transmit over the internet. We only have a
sequence of characters. We must now convert those characters into octets.
For example, let's say we put the url as an 'href' in an <a> tag in a web
page that is UTF-16 encoded.

 3. Now we must convert the character sequence to a (big endian) utf-16
octet sequence:

 00 49 00 25 00 45 00 32 00 25 00 39 00 44 00 25 00 41 00 34 00 25 00 43 00
45 00 25 00 42 00 42

 So those are the octets that actually get embedded in the utf-16 encoded
.html document and transmitted over the net.

 4. the browser then decodes the utf-16 web page and gets back the sequence
of characters:

 "I%E2%9D%A4%CE%BB"

 Note that here the browser has a sequence of characters -- we know nothing
about how those bytes are represented internally by the browser. If the
browser was written in Haskell it might be  String or Text.

 Now let's say the browser wants to request the URL. It *must* encode the
url as ASCII (as per the spec).

 5. So, the browser encodes the string as the octet sequence

  49 25 45 32 25 39 44 25 41 34 25 43 45 25 42 42

 6. The server can now decode that sequence of octets back into a sequence
of characters:

 "I%E2%9D%A4%CE%BB"

  Now, the low-level Network.URI library can not really do much more than
that, because it does not know what those octets are really supposed to
mean (see the / example above).

 7. the application specific code, however, knows that it should now first
split the path on any / characters to get

  ["I%E2%9D%A4%CE%BB"]

 8. next it should percent decode each path segment to get a ByteString
sequence:

   49 e2 9d a4 ce bb

 9. And now it can utf-8 decode that octet sequence get a unicode character
sequence:

  I❤λ

So... the basic gist is that if you unicode characters embedded in an html
document, they will generally be encoded *three* different times. (First
the unicode characters are converted to a utf-8 byte sequence, then the
byte sequence is percent encoded, and then the percent encoded character
sequence is encoded as another byte sequence). But, applications can choose
to use other methods as well.

In terms of applicability to the URI type.. uriPath :: ByteString
definitely does not work. It is possible that uriPath :: [ByteString] might
work... assuming / is the only special character we need to worry about in
the uriPath. But, doing all the breaking on '/' and the percent decoding
may not be required for many applications. So, choosing to always do the
extra work raises some concerns.

Also, even with, uriPath :: [ByteString], we are losing some information.
The browser is free to percent encode characters -- even if it is not
required. For example the browser could request:

 "hello"

Or it could request:

 "%68%65%6c%6c%6f"

In this case the *meaning* is the same. So, doing the decoding is less
problematic. But I wonder if there might still be cases where we still want
to distinguished between those two requests?

hope this helps.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120311/a4345e09/attachment.htm>


More information about the Haskell-Cafe mailing list