Mail Archives: djgpp-workers/2005/05/15/17:58:51
On Sat, 14 May 2005 05:00:39 +0200 (CEST), ams AT ludd DOT ltu DOT se wrote:
>Hello.
>
>I've been thinking about this a little. Let say we decide to encode
>Unicode in wchar_t, which is the only sane choice today.
>
>Then the functions iswalnum(), iswalpha(), etc. are either going to be
>implemented as:
>
>1. switch() and many, many case:'s,
>
>2. if( 0 <= char <= 31 ) { return 0 }
> if( 32 <= char <= 126 ) { return 1 }
> if( ... )
> ..., or
>
>3. tables as isalnum(), isalpha(), etc. are today.
>
>
>1 and 2: A lot of code. If anything I think gcc extended case x ... y:
>can come in useful, so I prefer 1 over 2.
>
>3: An enourmous table. As Unicode has the range 0 - 0x10ffff, we are
>talking about more than 1MB!
Combine the ideas: a table of wide char ranges with the same
properties: start code, end code, type properties; and binary search
to find the range. Avoids lots of hard coded tests.
>Now if those functions (isw*()) should return different results
>depending on locale, the sizes explode. So I hope not.
I thought the point was that all locales coexist, so no need to
switch, but type properties need extended for Han characters, and
maybe others.
Been a while since I read up on the Unicode support, but ISTR some
document(s) on extended properties and tests.
>With regard to which multibyte encoding we should use, I strongly
>prefer UTF-8.
Agreed; but for wide chars, there is choice of UCS-2, which still has
many multi-char encodings, or UCS-4, which has few, perhaps ignorable.
Might need to allow extendability to set properties in private use
areas, or could just ignore them?
>Opinions?
--
Thanks. Take care, Brian Inglis
- Raw text -