[Haskell-i18n] Unicode in source
Glynn Clements
glynn.clements@virgin.net
Wed, 21 Aug 2002 22:55:29 +0100
Sven Moritz Hallberg wrote:
> > (aside: aren't there problems with Unicode not being a fixed-width
> > character set? Some characters are expected to combine with others to
> > form a glyph, there are multiple versions of some characters with
> > different widths, there are several widths of space, etc.)
>
> I think (...) these issues should not pose a problem.
>
> variable-width characters:
> Unicode specifically doesn't say anything about the glyph representation
> of the characters. So it is reasonable to assume there will be
> fixed-width unicode character sets. Remember that even our latin
> alphabet has characters of different width (i vs. w) which we just
> somehow manage to fit into glyphs of the same width. If one's editor
> would really use a variable-width font he'll already have the problem
> with ASCII.
For fonts which aren't restricted to Western alphabets, there are two
common interpretations of "fixed width".
One interpretation is that all glyphs are exactly the same width, so
even "narrow" characters ("l", "i", "1") are as wide as the widest CJK
characters. Many users will dislike such fonts; apart from looking
rather odd, they also waste screen space.
The other interpretation is that all glyphs have widths which are an
integral number of "columns". Western (latin, cyrillic, Greek)
characters are a single column wide, while CJK characters are
typically two columns wide. The (Unix98) wcwidth() function can be
used to obtain the width (in columns) of a given wide character
(wchar_t) in the current locale.
> composition characters:
> I think we should interpret each character in the source as exactly one
> and leave any possible composition to the level of editing tools. The
> way I imagine the use of these composition characters is, for instance,
> as keyboard input to an editor which then composes them into a single
> char before writing anything to a file. I'd say this issue belongs to
> the domain of text processing.
Character I/O functions should probably ignore composition, i.e.
LATIN_SMALL_LETTER_A + COMBINING_ACUTE_ACCENT should appear as two
separate characters to the application.
However, layout will only "work" if the compiler (or is it a
preprocessor?) uses the same algorithm as the editor. If the editor
shows a composition sequence as a single character cell, it needs to
be treated as a single column for the purposes of layout.
--
Glynn Clements <glynn.clements@virgin.net>