Mail Archives: djgpp-workers/2005/05/15/16:11:02
> From: <ams AT ludd DOT ltu DOT se>
> Date: Sat, 14 May 2005 05:00:39 +0200 (CEST)
>
> Let say we decide to encode Unicode in wchar_t, which is the only
> sane choice today.
What exactly do you mean by ``encode Unicode in wchar_t''? Do you
mean we will store Unicode codepoints in there? If so, it's a mistake
to call this ``an encoding'', since encoding means you transform
Unicode codepoints top some other form, like UTF-8 or cp1250.
Alternatively, perhaps you meant UTF-16 or some such, which is indeed
an encoding. But then it's not fixed-size, which is generally
inappropriate for wchar_t.
If you do mean we shall store Unicode codepoints, we should decide how
wide will they be. Currently, wchar_t is a 16-bit data type, which is
enough only for the BMP. Personally, I think that supporting the BMP
is good enough for us, but if we decide otherwise, we will have to go
for a wider type (an incompatible change).
> Then the functions iswalnum(), iswalpha(), etc. are either going to be
> implemented as:
>
> 1. switch() and many, many case:'s,
>
> 2. if( 0 <= char <= 31 ) { return 0 }
> if( 32 <= char <= 126 ) { return 1 }
> if( ... )
> ..., or
>
> 3. tables as isalnum(), isalpha(), etc. are today.
>
>
> 1 and 2: A lot of code. If anything I think gcc extended case x ... y:
> can come in useful, so I prefer 1 over 2.
>
> 3: An enourmous table. As Unicode has the range 0 - 0x10ffff, we are
> talking about more than 1MB!
See above wrt the range of the codepoints.
Anyway, I suggest that we don't invent the wheel, but instead look at
the Unicode consortium Web site (www.unicode.org) and in glibc. I'm
certain they have some good suggestions.
> Now if those functions (isw*()) should return different results
> depending on locale, the sizes explode. So I hope not.
Download the Unicode character database from the above Web site and
look there. I don't think the locale changes anything, but I might be
wrong.
> With regard to which multibyte encoding we should use, I strongly
> prefer UTF-8.
Right.
- Raw text -