X-Authentication-Warning: delorie.com: mail set sender to djgpp-workers-bounces using -f
Message-ID: <428F543B.2060801@phekda.gotadsl.co.uk>
Date: Sat, 21 May 2005 16:31:07 +0100
From: Richard Dawe <rich AT phekda DOT gotadsl DOT co DOT uk>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050513 Fedora/1.7.8-1.3.1
X-Accept-Language: en, de, fr
MIME-Version: 1.0
To: djgpp-workers AT delorie DOT com
Subject: Re: wchar_t implementation and multibyte encoding
References: <200505192107 DOT j4JL77xn003535 AT speedy DOT ludd DOT ltu DOT se>
In-Reply-To: <200505192107.j4JL77xn003535@speedy.ludd.ltu.se>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Reply-To: djgpp-workers AT delorie DOT com

Hello.

ams AT ludd DOT ltu DOT se wrote:
> According to Eli Zaretskii:
> 
>>>From: <ams AT ludd DOT ltu DOT se>
>>>Date: Sat, 14 May 2005 05:00:39 +0200 (CEST)
>>>
>>>Let say we decide to encode Unicode in wchar_t, which is the only
>>>sane choice today.
>>
>>What exactly do you mean by ``encode Unicode in wchar_t''?  Do you
>>mean we will store Unicode codepoints in there?  If so, it's a mistake
>>to call this ``an encoding'', since encoding means you transform
>>Unicode codepoints top some other form, like UTF-8 or cp1250.
> 
> 
> I don't know the terminology. Or find it confusing. I mean we put
> Unicode values in it, just like we put ASCII values in the type char.
> 
> I wrote the previous mail, thinking that wchar_t was int. Now I've
> looked and found it to be unsigned short.
> 
> That's one thing that has to change.
> 
> 
>>Alternatively, perhaps you meant UTF-16 or some such, which is indeed
>>an encoding.  But then it's not fixed-size, which is generally
>>inappropriate for wchar_t.
> 
> 
> No. I mean Unicode encoding, which defines the range 0-0x10ffff.
> Are you telling me that that isn't an encoding?
> 
> In your terminology, perhaps I want to say, "let's use Unicode
> codepoints".
> 
> But logically (to me) _that_ _is_ a certain encoding. Weird
> terminology.
[snip]

You're confusing the codepoint, which is the numbering of characters, 
symbols, etc. with how you represent them. The codepoints are abstract.

When you talk about "Unicode encoding", this is UTF-32, a mapping of 
0x10ffff to a 32-bit integer. That may not seem like an encoding, but it 
is, because of endianness in the encoded data.

UTF-8 encodes the codepoints into 1 to 6 bytes, depending on the 
codepoints. The ASCII codepoints happen to be representable using a 
single byte in UTF-8.

The Unicode FAQ is pretty helpful:

   http://www.cl.cam.ac.uk/~mgk25/unicode.html

Specifically:

   http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Bye, Rich =]

-- 
Richard Dawe [ http://homepages.nildram.co.uk/~phekda/richdawe/ ]

"You can't evaluate a man by logic alone."
   -- McCoy, "I, Mudd", Star Trek