delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp-workers/2005/05/15/16:11:02

X-Authentication-Warning: delorie.com: mail set sender to djgpp-workers-bounces using -f
Date: Sun, 15 May 2005 23:07:52 +0300
From: "Eli Zaretskii" <eliz AT gnu DOT org>
Sender: halo1 AT zahav DOT net DOT il
To: djgpp-workers AT delorie DOT com
Message-ID: <01c55989$Blat.v2.4$ebacf720@zahav.net.il>
X-Mailer: emacs 22.0.50 (via feedmail 8 I) and Blat ver 2.4
In-reply-to: <200505140300.j4E30drm024968@speedy.ludd.ltu.se>
(ams AT ludd DOT ltu DOT se)
Subject: Re: wchar_t implementation and multibyte encoding
References: <200505140300 DOT j4E30drm024968 AT speedy DOT ludd DOT ltu DOT se>
Reply-To: djgpp-workers AT delorie DOT com
Errors-To: nobody AT delorie DOT com
X-Mailing-List: djgpp-workers AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com

> From: <ams AT ludd DOT ltu DOT se>
> Date: Sat, 14 May 2005 05:00:39 +0200 (CEST)
> 
> Let say we decide to encode Unicode in wchar_t, which is the only
> sane choice today.

What exactly do you mean by ``encode Unicode in wchar_t''?  Do you
mean we will store Unicode codepoints in there?  If so, it's a mistake
to call this ``an encoding'', since encoding means you transform
Unicode codepoints top some other form, like UTF-8 or cp1250.

Alternatively, perhaps you meant UTF-16 or some such, which is indeed
an encoding.  But then it's not fixed-size, which is generally
inappropriate for wchar_t.

If you do mean we shall store Unicode codepoints, we should decide how
wide will they be.  Currently, wchar_t is a 16-bit data type, which is
enough only for the BMP.  Personally, I think that supporting the BMP
is good enough for us, but if we decide otherwise, we will have to go
for a wider type (an incompatible change).

> Then the functions iswalnum(), iswalpha(), etc. are either going to be
> implemented as:
> 
> 1. switch() and many, many case:'s,
> 
> 2. if( 0 <= char <= 31 ) { return 0 }
>    if( 32 <= char <= 126 ) { return 1 }
>    if( ... )
>    ..., or
> 
> 3. tables as isalnum(), isalpha(), etc. are today.
> 
> 
> 1 and 2: A lot of code. If anything I think gcc extended case x ... y:
> can come in useful, so I prefer 1 over 2.
> 
> 3: An enourmous table. As Unicode has the range 0 - 0x10ffff, we are
> talking about more than 1MB!

See above wrt the range of the codepoints.

Anyway, I suggest that we don't invent the wheel, but instead look at
the Unicode consortium Web site (www.unicode.org) and in glibc.  I'm
certain they have some good suggestions.

> Now if those functions (isw*()) should return different results
> depending on locale, the sizes explode. So I hope not.

Download the Unicode character database from the above Web site and
look there.  I don't think the locale changes anything, but I might be
wrong.

> With regard to which multibyte encoding we should use, I strongly
> prefer UTF-8.

Right.

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019