Unicode support

Tue, 9 Oct 2001 22:54:42 +0200

This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_001_01C15104.9E4A40F0
Content-Type: text/plain;
	charset="utf-8"

> -----Original Message-----
> From: Ketil Malde [mailto:ketil@ii.uib.no]
...
> > But as I said: they will not go away now, they are too 
> firmly established.
> 
> Yep.  But it appears that the "right" choice for external encoding
> scheme would be UTF-8.

You're free to use any one of UTF-8, UTF-16(BE/LE), or UTF-32(BE/LE).
And you should be prepared for those encodings from anyone else.
In some cases, like e-mail, only UTF-8 (unless encoded by base64)
can be used for Unicode.

> >> When not limited to ASCII, at least it avoids zero bytes and other
> >> potential problems.  UTF-16 will among other things, be full of
> >> NULLs.
> 
> > Yes, and so what?
> 
> So, I can use it for file names,

Millions of people do already, including me.  Most of them don't even know
about it.  (The file system must be made for that, of course, but at least
two
(commonly used) file system use UTF-16 for *all* [long] file names: NTFS and
HFS+.  There is another file system, UFS, that uses UTF-8, with the names
in normal form D(!), for all file names. (If the *standard* C file API is
used,
some kind of conversion is triggered.) Those file systems got it right, many
other file systems are at a loss when it comes to even that simple level of
I18N,
rendering non-pure-ASCII file names essentially useless, or at least
unreliable.)

> in regular expressions, 

If the system interpreting the RE is UTF-16 enabled, yes, of course.

> and in
> whatever legacy

No: in modern systems. One of the side effects of the popularity of XML is
that
support for both UTF-8 and UTF-16 (also as external encodings) is growing...

B.t.w. Java source code can be in UTF-8 or in UTF-16, as well as in legacy
encodings. Unfortunately the compiler has to be steered via a command line
parameter, while having the source files self declare their encoding would
be
much better (compare XML).

> applications that expect textual data.

> > So will a file filled with image data, video clips, or plainly a
> > list of raw integers dumped to file (not formatted as strings).
> 
> But none of these pretend to be text!

How is that relevant?  If you're going to do anything "higher-level" with
text,
you have to know the encoding, otherwise you'll get lots of more or less
hidden bugs.  Have you ever had any experience with any of the legacy
"multibyte" encodings used for Chinese/Japanese/etc.? In many of them,
if you hit a byte that might be an ASCII letter, it need not be that at all,
just a second byte component in the representation of a non-ASCII character.
If you think every "A" byte is an "A" (an interpret them in some special
way,
say a (part of a) command name), you're in trouble!  Often hard-to-find
trouble.  No-one that argues that one can take text in any "ASCII extension"
and look at the (apparent) ASCII only (and everything else to be in some
arbitrary extension, never affecting the processing) seems to be aware of
the details of those encodings.  

B.t.w. video clips (and images) can and do have Unicode (UTF-16?) texts as
components (e.g. subtitles).

> > True.  But implementing normalisation, or case mapping for 
> that matter,
> > is non-trivial too.  In practice, the additional complexity with
> > UTF-16 seems small. 
> 
> All right, but if there are no real advantages, why bother?

Efficiency (and backwards compatibility) is claimed from people who
work much more "in the trenches" with this than I do. And I have no
quarrel with that.

		Kind regards
		/kent k

------_=_NextPart_001_01C15104.9E4A40F0
Content-Type: text/html;
	charset="utf-8"

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2653.12">
<TITLE>RE: Unicode support</TITLE>
</HEAD>
<BODY>
<BR>

<P><FONT SIZE=2>&gt; -----Original Message-----</FONT>
<BR><FONT SIZE=2>&gt; From: Ketil Malde [<A HREF="mailto:ketil@ii.uib.no">mailto:ketil@ii.uib.no</A>]</FONT>
<BR><FONT SIZE=2>...</FONT>
<BR><FONT SIZE=2>&gt; &gt; But as I said: they will not go away now, they are too </FONT>
<BR><FONT SIZE=2>&gt; firmly established.</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; Yep.&nbsp; But it appears that the &quot;right&quot; choice for external encoding</FONT>
<BR><FONT SIZE=2>&gt; scheme would be UTF-8.</FONT>
</P>

<P><FONT SIZE=2>You're free to use any one of UTF-8, UTF-16(BE/LE), or UTF-32(BE/LE).</FONT>
<BR><FONT SIZE=2>And you should be prepared for those encodings from anyone else.</FONT>
<BR><FONT SIZE=2>In some cases, like e-mail, only UTF-8 (unless encoded by base64)</FONT>
<BR><FONT SIZE=2>can be used for Unicode.</FONT>
</P>

<P><FONT SIZE=2>&gt; &gt;&gt; When not limited to ASCII, at least it avoids zero bytes and other</FONT>
<BR><FONT SIZE=2>&gt; &gt;&gt; potential problems.&nbsp; UTF-16 will among other things, be full of</FONT>
<BR><FONT SIZE=2>&gt; &gt;&gt; NULLs.</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; &gt; Yes, and so what?</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; So, I can use it for file names,</FONT>
</P>

Millions of people do already, including me.&nbsp; Most of them don't even know
 about it.&nbsp; (The file system must be made for that, of course, but at least two
 (commonly used) file system use UTF-16 for *all* [long] file names: NTFS and
 HFS+.&nbsp; There is another file system, UFS, that uses UTF-8, with the names
 in normal form D(!), for all file names. (If the *standard* C file API is used,
 some kind of conversion is triggered.) Those file systems got it right, many
 other file systems are at a loss when it comes to even that simple level of I18N,
 rendering non-pure-ASCII file names essentially useless, or at least unreliable.)

<P><FONT SIZE=2>&gt; in regular expressions, </FONT>
</P>

<P><FONT SIZE=2>If the system interpreting the RE is UTF-16 enabled, yes, of course.</FONT>
</P>

<P><FONT SIZE=2>&gt; and in</FONT>
<BR><FONT SIZE=2>&gt; whatever legacy</FONT>
</P>

No: in modern systems. One of the side effects of the popularity of XML is that
 support for both UTF-8 and UTF-16 (also as external encodings) is growing...

B.t.w. Java source code can be in UTF-8 or in UTF-16, as well as in legacy
 encodings. Unfortunately the compiler has to be steered via a command line
 parameter, while having the source files self declare their encoding would be
 much better (compare XML).

<P><FONT SIZE=2>&gt; applications that expect textual data.</FONT>
</P>

<P><FONT SIZE=2>&gt; &gt; So will a file filled with image data, video clips, or plainly a</FONT>
<BR><FONT SIZE=2>&gt; &gt; list of raw integers dumped to file (not formatted as strings).</FONT>
<BR><FONT SIZE=2>&gt; </FONT>
<BR><FONT SIZE=2>&gt; But none of these pretend to be text!</FONT>
</P>

How is that relevant?&nbsp; If you're going to do anything &quot;higher-level&quot; with text,
 you have to know the encoding, otherwise you'll get lots of more or less
 hidden bugs.&nbsp; Have you ever had any experience with any of the legacy
 &quot;multibyte&quot; encodings used for Chinese/Japanese/etc.? In many of them,
 if you hit a byte that might be an ASCII letter, it need not be that at all,
 just a second byte component in the representation of a non-ASCII character.
 If you think every &quot;A&quot; byte is an &quot;A&quot; (an interpret them in some special way,
 say a (part of a) command name), you're in trouble!&nbsp; Often hard-to-find
 trouble.&nbsp; No-one that argues that one can take text in any &quot;ASCII extension&quot;
 and look at the (apparent) ASCII only (and everything else to be in some
 arbitrary extension, never affecting the processing) seems to be aware of
 the details of those encodings.&nbsp;

B.t.w. video clips (and images) can and do have Unicode (UTF-16?) texts as
 components (e.g. subtitles).

&gt; &gt; True.&nbsp; But implementing normalisation, or case mapping for 
 &gt; that matter,
 &gt; &gt; is non-trivial too.&nbsp; In practice, the additional complexity with
 &gt; &gt; UTF-16 seems small. 
 &gt; 
 &gt; All right, but if there are no real advantages, why bother?

Efficiency (and backwards compatibility) is claimed from people who
 work much more &quot;in the trenches&quot; with this than I do. And I have no
 quarrel with that.

<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT SIZE=2>Kind regards</FONT>
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT SIZE=2>/kent k</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C15104.9E4A40F0--