[Haskell-cafe] Re: String vs ByteString

Ketil Malde ketil at malde.org
Wed Aug 18 04:24:37 EDT 2010


John Millikin <jmillikin at gmail.com> writes:

> The reason many Japanese and Chinese users reject UTF-8 isn't due to 
> space constraints (UTF-8 and UTF-16 are roughly equal), it's because
> they reject Unicode itself. 

Probably because they don't think it's complicated enough¹?

> Shift-JIS and the various Chinese encodings both contain Han
> characters which are missing from Unicode, either due to the Han
> unification or simply were not considered important enough to include

Surely there's enough space left?  I seem to remember some Han
characters outside of the BMP, so I would have guessed this is an
argument from back in the UCS-2 days.

(BTW, on a long train ride, I brought the linear-B alphabet, and
practiced writing notes to my kids.  So linear-B isn't entirely useless
:-) 

>From casual browsing of Wikipedia, the current status in CJK-land seems
to be something like this:

China: GB2312 and its successor GB18030
Taiwan, Macao, and Hong Kong: Big5
Japan: Shift-JIS
Korea: EUC-KR

It is interesting that some of these provide a lot fewer characters than
Unicode.  Another feature of several of them is that ASCII and e.g. kana
scripts take up one byte, and ideograms take up two, which correlates
with the expected width of the glyphs.

Several of the pages indicate that Unicode, and mainly UTF-8, is
gradually taking over.

-k

¹ Those who remember Emacs in the MULE days will know what I mean.
-- 
If I haven't seen further, it is by standing in the footprints of giants


More information about the Haskell-Cafe mailing list