X-Authentication-Warning: delorie.com: mail set sender to djgpp-workers-bounces using -f From: Message-Id: <200505192107.j4JL77xn003535@speedy.ludd.ltu.se> Subject: Re: wchar_t implementation and multibyte encoding In-Reply-To: <01c55989$Blat.v2.4$ebacf720@zahav.net.il> "from Eli Zaretskii at May 15, 2005 11:07:52 pm" To: djgpp-workers AT delorie DOT com Date: Thu, 19 May 2005 23:07:06 +0200 (CEST) X-Mailer: ELM [version 2.4ME+ PL78 (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-ltu-MailScanner-Information: Please contact the ISP for more information X-ltu-MailScanner: Found to be clean X-MailScanner-From: ams AT ludd DOT ltu DOT se Reply-To: djgpp-workers AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: djgpp-workers AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk According to Eli Zaretskii: > > From: > > Date: Sat, 14 May 2005 05:00:39 +0200 (CEST) > > > > Let say we decide to encode Unicode in wchar_t, which is the only > > sane choice today. > > What exactly do you mean by ``encode Unicode in wchar_t''? Do you > mean we will store Unicode codepoints in there? If so, it's a mistake > to call this ``an encoding'', since encoding means you transform > Unicode codepoints top some other form, like UTF-8 or cp1250. I don't know the terminology. Or find it confusing. I mean we put Unicode values in it, just like we put ASCII values in the type char. I wrote the previous mail, thinking that wchar_t was int. Now I've looked and found it to be unsigned short. That's one thing that has to change. > Alternatively, perhaps you meant UTF-16 or some such, which is indeed > an encoding. But then it's not fixed-size, which is generally > inappropriate for wchar_t. No. I mean Unicode encoding, which defines the range 0-0x10ffff. Are you telling me that that isn't an encoding? In your terminology, perhaps I want to say, "let's use Unicode codepoints". But logically (to me) _that_ _is_ a certain encoding. Weird terminology. > If you do mean we shall store Unicode codepoints, we should decide how > wide will they be. 21 bits, which probably will mean 32 bits. > Currently, wchar_t is a 16-bit data type, which is > enough only for the BMP. Personally, I think that supporting the BMP > is good enough for us, but if we decide otherwise, we will have to go I disagree. > for a wider type (an incompatible change). Obviously, given the incorrect choice today. Either we do this properly. Or we don't do it at all. ... > > Now if those functions (isw*()) should return different results > > depending on locale, the sizes explode. So I hope not. > > Download the Unicode character database from the above Web site and > look there. I don't think the locale changes anything, but I might be > wrong. Well, if my locale is English, is Arabic or Chinese characters printable? They might be, but I'm not sure that my hardware and/or OS will be able to show them to me. So should iswprint() tell me 1 or 0? I. e. is those functions reporting the abstract idea of the character or the system's possibility? Hohum, writing that made me look up iswprint() in the standard. Then my eyes fell upon iswpunct() which description says "... tests for any printing wide character that is one of a locale-specific set of punctuation wide characters...". So at least some of those functions _are_ locale-dependent. Actually, almost the only one of those isw*() that doesn't mention locale-specific is iswprint()! Sooooo, that seems to imply not only will the tables and/or functions be huge, there will be an endless amount of them as well. Or do we get away from locale-specificness by using Unicode? Right, MartinS