Unicode support

Karlsson Kent - keka keka@im.se
Tue, 9 Oct 2001 22:54:42 +0200


This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_001_01C15104.9E4A40F0
Content-Type: text/plain;
	charset="utf-8"


> -----Original Message-----
> From: Ketil Malde [mailto:ketil@ii.uib.no]
...
> > But as I said: they will not go away now, they are too 
> firmly established.
> 
> Yep.  But it appears that the "right" choice for external encoding
> scheme would be UTF-8.

You're free to use any one of UTF-8, UTF-16(BE/LE), or UTF-32(BE/LE).
And you should be prepared for those encodings from anyone else.
In some cases, like e-mail, only UTF-8 (unless encoded by base64)
can be used for Unicode.

> >> When not limited to ASCII, at least it avoids zero bytes and other
> >> potential problems.  UTF-16 will among other things, be full of
> >> NULLs.
> 
> > Yes, and so what?
> 
> So, I can use it for file names,

Millions of people do already, including me.  Most of them don't even know
about it.  (The file system must be made for that, of course, but at least
two
(commonly used) file system use UTF-16 for *all* [long] file names: NTFS and
HFS+.  There is another file system, UFS, that uses UTF-8, with the names
in normal form D(!), for all file names. (If the *standard* C file API is
used,
some kind of conversion is triggered.) Those file systems got it right, many
other file systems are at a loss when it comes to even that simple level of
I18N,
rendering non-pure-ASCII file names essentially useless, or at least
unreliable.)

> in regular expressions, 

If the system interpreting the RE is UTF-16 enabled, yes, of course.

> and in
> whatever legacy

No: in modern systems. One of the side effects of the popularity of XML is
that
support for both UTF-8 and UTF-16 (also as external encodings) is growing...

B.t.w. Java source code can be in UTF-8 or in UTF-16, as well as in legacy
encodings. Unfortunately the compiler has to be steered via a command line
parameter, while having the source files self declare their encoding would
be
much better (compare XML).

> applications that expect textual data.

> > So will a file filled with image data, video clips, or plainly a
> > list of raw integers dumped to file (not formatted as strings).
> 
> But none of these pretend to be text!

How is that relevant?  If you're going to do anything "higher-level" with
text,
you have to know the encoding, otherwise you'll get lots of more or less
hidden bugs.  Have you ever had any experience with any of the legacy
"multibyte" encodings used for Chinese/Japanese/etc.? In many of them,
if you hit a byte that might be an ASCII letter, it need not be that at all,
just a second byte component in the representation of a non-ASCII character.
If you think every "A" byte is an "A" (an interpret them in some special
way,
say a (part of a) command name), you're in trouble!  Often hard-to-find
trouble.  No-one that argues that one can take text in any "ASCII extension"
and look at the (apparent) ASCII only (and everything else to be in some
arbitrary extension, never affecting the processing) seems to be aware of
the details of those encodings.  

B.t.w. video clips (and images) can and do have Unicode (UTF-16?) texts as
components (e.g. subtitles).

> > True.  But implementing normalisation, or case mapping for 
> that matter,
> > is non-trivial too.  In practice, the additional complexity with
> > UTF-16 seems small. 
> 
> All right, but if there are no real advantages, why bother?

Efficiency (and backwards compatibility) is claimed from people who
work much more "in the trenches" with this than I do. And I have no
quarrel with that.

		Kind regards
		/kent k

------_=_NextPart_001_01C15104.9E4A40F0
Content-Type: text/html;
	charset="utf-8"

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2653.12">
<TITLE>RE: Unicode support</TITLE>
</HEAD>
<BODY>
<BR>

<P><FONT SIZE=2>&gt; -----Original Message-----</FONT>
<BR><FONT SIZE=2>&gt; From: Ketil Malde [<A HREF="mailto:ketil@ii.uib.no">mailto:ketil@ii.uib.no</A>]</FONT>
<BR><FONT SIZE=2>...</FONT>
<BR><FONT SIZE=2>&gt; &gt; But as I said: they will not go away now, they are too </FONT>
<BR><FONT SIZE=2>&gt; firmly established.</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Yep.&nbsp; But it appears that the &quot;right&quot; choice for external encoding</FONT>
<BR><FONT SIZE=2>&gt; scheme would be UTF-8.</FONT>
</P>

<P><FONT SIZE=2>You're free to use any one of UTF-8, UTF-16(BE/LE), or UTF-32(BE/LE).</FONT>
<BR><FONT SIZE=2>And you should be prepared for those encodings from anyone else.</FONT>
<BR><FONT SIZE=2>In some cases, like e-mail, only UTF-8 (unless encoded by base64)</FONT>
<BR><FONT SIZE=2>can be used for Unicode.</FONT>
</P>

<P><FONT SIZE=2>&gt; &gt;&gt; When not limited to ASCII, at least it avoids zero bytes and other</FONT>
<BR><FONT SIZE=2>&gt; &gt;&gt; potential problems.&nbsp; UTF-16 will among other things, be full of</FONT>
<BR><FONT SIZE=2>&gt; &gt;&gt; NULLs.</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; Yes, and so what?</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; So, I can use it for file names,</FONT>
</P>

<P><FONT SIZE=2>Millions of people do already, including me.&nbsp; Most of them don't even know</FONT>
<BR><FONT SIZE=2>about it.&nbsp; (The file system must be made for that, of course, but at least two</FONT>
<BR><FONT SIZE=2>(commonly used) file system use UTF-16 for *all* [long] file names: NTFS and</FONT>
<BR><FONT SIZE=2>HFS+.&nbsp; There is another file system, UFS, that uses UTF-8, with the names</FONT>
<BR><FONT SIZE=2>in normal form D(!), for all file names. (If the *standard* C file API is used,</FONT>
<BR><FONT SIZE=2>some kind of conversion is triggered.) Those file systems got it right, many</FONT>
<BR><FONT SIZE=2>other file systems are at a loss when it comes to even that simple level of I18N,</FONT>
<BR><FONT SIZE=2>rendering non-pure-ASCII file names essentially useless, or at least unreliable.)</FONT>
</P>

<P><FONT SIZE=2>&gt; in regular expressions, </FONT>
</P>

<P><FONT SIZE=2>If the system interpreting the RE is UTF-16 enabled, yes, of course.</FONT>
</P>

<P><FONT SIZE=2>&gt; and in</FONT>
<BR><FONT SIZE=2>&gt; whatever legacy</FONT>
</P>

<P><FONT SIZE=2>No: in modern systems. One of the side effects of the popularity of XML is that</FONT>
<BR><FONT SIZE=2>support for both UTF-8 and UTF-16 (also as external encodings) is growing...</FONT>
</P>

<P><FONT SIZE=2>B.t.w. Java source code can be in UTF-8 or in UTF-16, as well as in legacy</FONT>
<BR><FONT SIZE=2>encodings. Unfortunately the compiler has to be steered via a command line</FONT>
<BR><FONT SIZE=2>parameter, while having the source files self declare their encoding would be</FONT>
<BR><FONT SIZE=2>much better (compare XML).</FONT>
</P>

<P><FONT SIZE=2>&gt; applications that expect textual data.</FONT>
</P>

<P><FONT SIZE=2>&gt; &gt; So will a file filled with image data, video clips, or plainly a</FONT>
<BR><FONT SIZE=2>&gt; &gt; list of raw integers dumped to file (not formatted as strings).</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; But none of these pretend to be text!</FONT>
</P>

<P><FONT SIZE=2>How is that relevant?&nbsp; If you're going to do anything &quot;higher-level&quot; with text,</FONT>
<BR><FONT SIZE=2>you have to know the encoding, otherwise you'll get lots of more or less</FONT>
<BR><FONT SIZE=2>hidden bugs.&nbsp; Have you ever had any experience with any of the legacy</FONT>
<BR><FONT SIZE=2>&quot;multibyte&quot; encodings used for Chinese/Japanese/etc.? In many of them,</FONT>
<BR><FONT SIZE=2>if you hit a byte that might be an ASCII letter, it need not be that at all,</FONT>
<BR><FONT SIZE=2>just a second byte component in the representation of a non-ASCII character.</FONT>
<BR><FONT SIZE=2>If you think every &quot;A&quot; byte is an &quot;A&quot; (an interpret them in some special way,</FONT>
<BR><FONT SIZE=2>say a (part of a) command name), you're in trouble!&nbsp; Often hard-to-find</FONT>
<BR><FONT SIZE=2>trouble.&nbsp; No-one that argues that one can take text in any &quot;ASCII extension&quot;</FONT>
<BR><FONT SIZE=2>and look at the (apparent) ASCII only (and everything else to be in some</FONT>
<BR><FONT SIZE=2>arbitrary extension, never affecting the processing) seems to be aware of</FONT>
<BR><FONT SIZE=2>the details of those encodings.&nbsp; </FONT>
</P>

<P><FONT SIZE=2>B.t.w. video clips (and images) can and do have Unicode (UTF-16?) texts as</FONT>
<BR><FONT SIZE=2>components (e.g. subtitles).</FONT>
</P>

<P><FONT SIZE=2>&gt; &gt; True.&nbsp; But implementing normalisation, or case mapping for </FONT>
<BR><FONT SIZE=2>&gt; that matter,</FONT>
<BR><FONT SIZE=2>&gt; &gt; is non-trivial too.&nbsp; In practice, the additional complexity with</FONT>
<BR><FONT SIZE=2>&gt; &gt; UTF-16 seems small. </FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; All right, but if there are no real advantages, why bother?</FONT>
</P>

<P><FONT SIZE=2>Efficiency (and backwards compatibility) is claimed from people who</FONT>
<BR><FONT SIZE=2>work much more &quot;in the trenches&quot; with this than I do. And I have no</FONT>
<BR><FONT SIZE=2>quarrel with that.</FONT>
</P>

<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT SIZE=2>Kind regards</FONT>
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT SIZE=2>/kent k</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C15104.9E4A40F0--