Haskell Platform Proposal: add the 'text' library

Wed Oct 20 20:35:12 EDT 2010

On Wed, 2010-10-20 at 11:11 -0400, Tyson Whitehead wrote:
> On October 19, 2010 19:35:33 Duncan Coutts wrote:
> > Right, that's a very common misunderstanding of Unicode. A Unicode
> > code point (type Char) does not correspond 1:1 with the human notion
> > of a character. It would be nice if it did, but unfortunately it is
> > not something we can ignore. Because of this it is better not to think
> > of operations on individual Chars but on short sequences of Chars. In
> > any case, when processing text (even ASCII where Chars do match
> > characters) many of the most common operations that you want are
> > substring not element based.
> 
> I read the wikipedia article on code points, but still do not feel I have a 
> firm grasp as to what exactly you are referring to.
> 
> If you have a few minutes, would you mind providing a short example to clarify 
> this with a specific example (e.g., a specific code point that gives issues with 
> a 1:1 model and what those issues are).

Combining characters are the major one. These are things like accents,
but there are many more of them in some other languages. For most of the
European languages there are both all-in-one code points that combine
the base character with the extra mark (because those already existed in
previous character sets), but for many other languages the canonical
form is made up of multiple code points (and not necessarily just 2).

So if you're searching for a particular "character" then searching for a
single Char is not sufficient, you need to search for a short sequence
of Chars.

Duncan