Haskell Platform Proposal: add the 'text' library
Axel.Simon at in.tum.de
Wed Oct 20 15:45:44 EDT 2010
On Oct 20, 2010, at 19:44, Ian Lynagh wrote:
> Johan wrote:
>> If you process a string code point by code point you might mistakenly
>> confuse a plain "a" (A) with a "å" (A-RING *or* A + COMBINING RING).
> But when characters and codepoints are 1:1, you /can/ process code
> by code point.
> Am I missing something?
AFAIK there are scripts that have so many combinations that Unicode
does not have a single codepoints for each character. In Arabic you
can have one of 5 vowel signs on each of the 28 letters. But Unicode
does not provide 5*28 codepoints for the combinations. That is
probably the reason for have these combined characters.
Mac OS tries to take all the characters into as many codepoints as
possible whereas Windows tries to merge them as much as possible. I
don't think there is a good semantics for replace without knowing what
(normal) form you're working on. Normally, search/replace and sorting
on Unicode are specialized algorithms that cannot be reduces to simple
substitutions or permutations.
So I suggest to just provide functions on codepoints and let the user
struggle with the rest.
More information about the Libraries