X-Authentication-Warning: delorie.com: mail set sender to djgpp-workers-bounces using -f Date: Sun, 15 May 2005 23:07:52 +0300 From: "Eli Zaretskii" Sender: halo1 AT zahav DOT net DOT il To: djgpp-workers AT delorie DOT com Message-ID: <01c55989$Blat.v2.4$ebacf720@zahav.net.il> Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=ISO-8859-1 X-Mailer: emacs 22.0.50 (via feedmail 8 I) and Blat ver 2.4 In-reply-to: <200505140300.j4E30drm024968@speedy.ludd.ltu.se> (ams AT ludd DOT ltu DOT se) Subject: Re: wchar_t implementation and multibyte encoding References: <200505140300 DOT j4E30drm024968 AT speedy DOT ludd DOT ltu DOT se> Reply-To: djgpp-workers AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: djgpp-workers AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk > From: > Date: Sat, 14 May 2005 05:00:39 +0200 (CEST) > > Let say we decide to encode Unicode in wchar_t, which is the only > sane choice today. What exactly do you mean by ``encode Unicode in wchar_t''? Do you mean we will store Unicode codepoints in there? If so, it's a mistake to call this ``an encoding'', since encoding means you transform Unicode codepoints top some other form, like UTF-8 or cp1250. Alternatively, perhaps you meant UTF-16 or some such, which is indeed an encoding. But then it's not fixed-size, which is generally inappropriate for wchar_t. If you do mean we shall store Unicode codepoints, we should decide how wide will they be. Currently, wchar_t is a 16-bit data type, which is enough only for the BMP. Personally, I think that supporting the BMP is good enough for us, but if we decide otherwise, we will have to go for a wider type (an incompatible change). > Then the functions iswalnum(), iswalpha(), etc. are either going to be > implemented as: > > 1. switch() and many, many case:'s, > > 2. if( 0 <= char <= 31 ) { return 0 } > if( 32 <= char <= 126 ) { return 1 } > if( ... ) > ..., or > > 3. tables as isalnum(), isalpha(), etc. are today. > > > 1 and 2: A lot of code. If anything I think gcc extended case x ... y: > can come in useful, so I prefer 1 over 2. > > 3: An enourmous table. As Unicode has the range 0 - 0x10ffff, we are > talking about more than 1MB! See above wrt the range of the codepoints. Anyway, I suggest that we don't invent the wheel, but instead look at the Unicode consortium Web site (www.unicode.org) and in glibc. I'm certain they have some good suggestions. > Now if those functions (isw*()) should return different results > depending on locale, the sizes explode. So I hope not. Download the Unicode character database from the above Web site and look there. I don't think the locale changes anything, but I might be wrong. > With regard to which multibyte encoding we should use, I strongly > prefer UTF-8. Right.