delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp-workers/2005/05/15/06:54:06

X-Authentication-Warning: delorie.com: mail set sender to djgpp-workers-bounces using -f
From: <ams AT ludd DOT ltu DOT se>
Message-Id: <200505140300.j4E30drm024968@speedy.ludd.ltu.se>
Subject: wchar_t implementation and multibyte encoding
To: DJGPP-WORKERS <djgpp-workers AT delorie DOT com>
Date: Sat, 14 May 2005 05:00:39 +0200 (CEST)
X-Mailer: ELM [version 2.4ME+ PL78 (25)]
MIME-Version: 1.0
X-ltu-MailScanner-Information: Please contact the ISP for more information
X-ltu-MailScanner: Found to be clean
X-MailScanner-From: ams AT ludd DOT ltu DOT se
Reply-To: djgpp-workers AT delorie DOT com

Hello.

I've been thinking about this a little. Let say we decide to encode
Unicode in wchar_t, which is the only sane choice today.

Then the functions iswalnum(), iswalpha(), etc. are either going to be
implemented as:

1. switch() and many, many case:'s,

2. if( 0 <= char <= 31 ) { return 0 }
   if( 32 <= char <= 126 ) { return 1 }
   if( ... )
   ..., or

3. tables as isalnum(), isalpha(), etc. are today.


1 and 2: A lot of code. If anything I think gcc extended case x ... y:
can come in useful, so I prefer 1 over 2.

3: An enourmous table. As Unicode has the range 0 - 0x10ffff, we are
talking about more than 1MB!


Now if those functions (isw*()) should return different results
depending on locale, the sizes explode. So I hope not.


With regard to which multibyte encoding we should use, I strongly
prefer UTF-8.


Opinions?


Right,

						MartinS

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019