X-Authentication-Warning: delorie.com: mail set sender to djgpp-workers-bounces using -f Message-ID: <428F543B.2060801@phekda.gotadsl.co.uk> Date: Sat, 21 May 2005 16:31:07 +0100 From: Richard Dawe User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050513 Fedora/1.7.8-1.3.1 X-Accept-Language: en, de, fr MIME-Version: 1.0 To: djgpp-workers AT delorie DOT com Subject: Re: wchar_t implementation and multibyte encoding References: <200505192107 DOT j4JL77xn003535 AT speedy DOT ludd DOT ltu DOT se> In-Reply-To: <200505192107.j4JL77xn003535@speedy.ludd.ltu.se> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Reply-To: djgpp-workers AT delorie DOT com Hello. ams AT ludd DOT ltu DOT se wrote: > According to Eli Zaretskii: > >>>From: >>>Date: Sat, 14 May 2005 05:00:39 +0200 (CEST) >>> >>>Let say we decide to encode Unicode in wchar_t, which is the only >>>sane choice today. >> >>What exactly do you mean by ``encode Unicode in wchar_t''? Do you >>mean we will store Unicode codepoints in there? If so, it's a mistake >>to call this ``an encoding'', since encoding means you transform >>Unicode codepoints top some other form, like UTF-8 or cp1250. > > > I don't know the terminology. Or find it confusing. I mean we put > Unicode values in it, just like we put ASCII values in the type char. > > I wrote the previous mail, thinking that wchar_t was int. Now I've > looked and found it to be unsigned short. > > That's one thing that has to change. > > >>Alternatively, perhaps you meant UTF-16 or some such, which is indeed >>an encoding. But then it's not fixed-size, which is generally >>inappropriate for wchar_t. > > > No. I mean Unicode encoding, which defines the range 0-0x10ffff. > Are you telling me that that isn't an encoding? > > In your terminology, perhaps I want to say, "let's use Unicode > codepoints". > > But logically (to me) _that_ _is_ a certain encoding. Weird > terminology. [snip] You're confusing the codepoint, which is the numbering of characters, symbols, etc. with how you represent them. The codepoints are abstract. When you talk about "Unicode encoding", this is UTF-32, a mapping of 0x10ffff to a 32-bit integer. That may not seem like an encoding, but it is, because of endianness in the encoded data. UTF-8 encodes the codepoints into 1 to 6 bytes, depending on the codepoints. The ASCII codepoints happen to be representable using a single byte in UTF-8. The Unicode FAQ is pretty helpful: http://www.cl.cam.ac.uk/~mgk25/unicode.html Specifically: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 Bye, Rich =] -- Richard Dawe [ http://homepages.nildram.co.uk/~phekda/richdawe/ ] "You can't evaluate a man by logic alone." -- McCoy, "I, Mudd", Star Trek