Better casing functions (German ß, etc.)

박신환 ndospark320 at naver.com
Wed Jul 11 06:59:37 UTC 2018


Current Haskell has 'simple' `Char`-to-`Char` casing functions (as specified by Unicode), namely `toUpper`, `toLower` and `toTitle`.
 
So to convert cases of a `String`, Haskell intends `fmap toUpper`, etc. But this has some bugs.
 
Case 1. German ß (Eszett) 
 
'ß' (U+00DF), Latin Small Letter Sharp S, is a lowercase letter itself, but Unicode doesn't specify its 'simple' uppercase counterpart.
It's because its uppercase counterpart is not a single character, but two characters, "SS".
 
Case 2. Turkish İ and ı
Rather than the common 'I' and 'i' case pair, Turkish language has the 'İ' (U+0130) and 'i' pair and the 'I' and 'ı' (U+0131) pair. Those are, dotted I pair and dotless I pair, respectively.
 
Case 3. Greek Σ (Sigma) 
Greek 'Σ' (U+03A3) must be lowercase mapped to 'ς' (U+03C2) if followed by a whitespace, rather than normal 'σ' (U+03C3).
 
Case 4. Greek iota subscript (Ypogegrammeni)
Greek 'Capital' letters with iota subscripts (for example, 'ᾈ' (U+1F88)), though they are the 'simple' uppercase counterpart of their lowercase counterpart, they themselves are actually treated as titlecase characters. For example, the actual uppercase counterpart of 'ᾀ' (U+1F80) is "ἈΙ" (U+1F08 U+0399). That is, an actual capital iota instead of the iota subscript.
 
Case 5. Precomposed letters without upper/lowercase counterpart 
For example, ΐ (U+03B0) doesn't have precomposed uppercase counterpart. It must be effectively mapped to "Ϊ́" (U+03AA U+0301).

In Summary, we need more elaborated casing functions which are `String`-to-`String`.

Bibliography:
    The Unicode Standard Version 11.0 – Core Specification, Section 5.18.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20180711/4f0293be/attachment.html>


More information about the Libraries mailing list