delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp-workers/2005/05/15/17:58:51

X-Authentication-Warning: delorie.com: mail set sender to djgpp-workers-bounces using -f
Date: Sun, 15 May 2005 15:58:09 -0600
From: Brian Inglis <Brian DOT Inglis AT SystematicSw DOT ab DOT ca>
Subject: Re: wchar_t implementation and multibyte encoding
In-reply-to: <200505140300.j4E30drm024968@speedy.ludd.ltu.se>
To: djgpp-workers AT delorie DOT com
Message-id: <gehf81hstsq40b8aq409d2nqnh4du2ufd1@4ax.com>
Organization: Systematic Software
MIME-version: 1.0
X-Mailer: Forte Agent 1.93/32.576 English (American)
References: <200505140300 DOT j4E30drm024968 AT speedy DOT ludd DOT ltu DOT se>
X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id j4FLwXmM024942
Reply-To: djgpp-workers AT delorie DOT com
Errors-To: nobody AT delorie DOT com
X-Mailing-List: djgpp-workers AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com

On Sat, 14 May 2005 05:00:39 +0200 (CEST), ams AT ludd DOT ltu DOT se wrote:

>Hello.
>
>I've been thinking about this a little. Let say we decide to encode
>Unicode in wchar_t, which is the only sane choice today.
>
>Then the functions iswalnum(), iswalpha(), etc. are either going to be
>implemented as:
>
>1. switch() and many, many case:'s,
>
>2. if( 0 <= char <= 31 ) { return 0 }
>   if( 32 <= char <= 126 ) { return 1 }
>   if( ... )
>   ..., or
>
>3. tables as isalnum(), isalpha(), etc. are today.
>
>
>1 and 2: A lot of code. If anything I think gcc extended case x ... y:
>can come in useful, so I prefer 1 over 2.
>
>3: An enourmous table. As Unicode has the range 0 - 0x10ffff, we are
>talking about more than 1MB!

Combine the ideas: a table of wide char ranges with the same
properties: start code, end code, type properties; and binary search
to find the range. Avoids lots of hard coded tests. 

>Now if those functions (isw*()) should return different results
>depending on locale, the sizes explode. So I hope not.

I thought the point was that all locales coexist, so no need to
switch, but type properties need extended for Han characters, and
maybe others. 
Been a while since I read up on the Unicode support, but ISTR some
document(s) on extended properties and tests. 

>With regard to which multibyte encoding we should use, I strongly
>prefer UTF-8.

Agreed; but for wide chars, there is choice of UCS-2, which still has
many multi-char encodings, or UCS-4, which has few, perhaps ignorable.
Might need to allow extendability to set properties in private use
areas, or could just ignore them? 

>Opinions?

-- 
Thanks. Take care, Brian Inglis

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019