X-Authentication-Warning: delorie.com: mail set sender to djgpp-workers-bounces using -f Date: Sun, 15 May 2005 15:58:09 -0600 From: Brian Inglis Subject: Re: wchar_t implementation and multibyte encoding In-reply-to: <200505140300.j4E30drm024968@speedy.ludd.ltu.se> To: djgpp-workers AT delorie DOT com Message-id: Organization: Systematic Software MIME-version: 1.0 X-Mailer: Forte Agent 1.93/32.576 English (American) Content-type: text/plain; charset=us-ascii References: <200505140300 DOT j4E30drm024968 AT speedy DOT ludd DOT ltu DOT se> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id j4FLwXmM024942 Reply-To: djgpp-workers AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: djgpp-workers AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk On Sat, 14 May 2005 05:00:39 +0200 (CEST), ams AT ludd DOT ltu DOT se wrote: >Hello. > >I've been thinking about this a little. Let say we decide to encode >Unicode in wchar_t, which is the only sane choice today. > >Then the functions iswalnum(), iswalpha(), etc. are either going to be >implemented as: > >1. switch() and many, many case:'s, > >2. if( 0 <= char <= 31 ) { return 0 } > if( 32 <= char <= 126 ) { return 1 } > if( ... ) > ..., or > >3. tables as isalnum(), isalpha(), etc. are today. > > >1 and 2: A lot of code. If anything I think gcc extended case x ... y: >can come in useful, so I prefer 1 over 2. > >3: An enourmous table. As Unicode has the range 0 - 0x10ffff, we are >talking about more than 1MB! Combine the ideas: a table of wide char ranges with the same properties: start code, end code, type properties; and binary search to find the range. Avoids lots of hard coded tests. >Now if those functions (isw*()) should return different results >depending on locale, the sizes explode. So I hope not. I thought the point was that all locales coexist, so no need to switch, but type properties need extended for Han characters, and maybe others. Been a while since I read up on the Unicode support, but ISTR some document(s) on extended properties and tests. >With regard to which multibyte encoding we should use, I strongly >prefer UTF-8. Agreed; but for wide chars, there is choice of UCS-2, which still has many multi-char encodings, or UCS-4, which has few, perhaps ignorable. Might need to allow extendability to set properties in private use areas, or could just ignore them? >Opinions? -- Thanks. Take care, Brian Inglis