X-Authentication-Warning: delorie.com: mail set sender to djgpp-workers-bounces using -f
From: <ams AT ludd DOT ltu DOT se>
Message-Id: <200505192107.j4JL77xn003535@speedy.ludd.ltu.se>
Subject: Re: wchar_t implementation and multibyte encoding
In-Reply-To: <01c55989$Blat.v2.4$ebacf720@zahav.net.il> "from Eli Zaretskii at
 May 15, 2005 11:07:52 pm"
To: djgpp-workers AT delorie DOT com
Date: Thu, 19 May 2005 23:07:06 +0200 (CEST)
X-Mailer: ELM [version 2.4ME+ PL78 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
X-ltu-MailScanner-Information: Please contact the ISP for more information
X-ltu-MailScanner: Found to be clean
X-MailScanner-From: ams AT ludd DOT ltu DOT se
Reply-To: djgpp-workers AT delorie DOT com
Errors-To: nobody AT delorie DOT com
X-Mailing-List: djgpp-workers AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com
Precedence: bulk

According to Eli Zaretskii:
> > From: <ams AT ludd DOT ltu DOT se>
> > Date: Sat, 14 May 2005 05:00:39 +0200 (CEST)
> > 
> > Let say we decide to encode Unicode in wchar_t, which is the only
> > sane choice today.
> 
> What exactly do you mean by ``encode Unicode in wchar_t''?  Do you
> mean we will store Unicode codepoints in there?  If so, it's a mistake
> to call this ``an encoding'', since encoding means you transform
> Unicode codepoints top some other form, like UTF-8 or cp1250.

I don't know the terminology. Or find it confusing. I mean we put
Unicode values in it, just like we put ASCII values in the type char.

I wrote the previous mail, thinking that wchar_t was int. Now I've
looked and found it to be unsigned short.

That's one thing that has to change.

> Alternatively, perhaps you meant UTF-16 or some such, which is indeed
> an encoding.  But then it's not fixed-size, which is generally
> inappropriate for wchar_t.

No. I mean Unicode encoding, which defines the range 0-0x10ffff.
Are you telling me that that isn't an encoding?

In your terminology, perhaps I want to say, "let's use Unicode
codepoints".

But logically (to me) _that_ _is_ a certain encoding. Weird
terminology.

> If you do mean we shall store Unicode codepoints, we should decide how
> wide will they be.

21 bits, which probably will mean 32 bits.

>  Currently, wchar_t is a 16-bit data type, which is
> enough only for the BMP.  Personally, I think that supporting the BMP
> is good enough for us, but if we decide otherwise, we will have to go

I disagree.

> for a wider type (an incompatible change).

Obviously, given the incorrect choice today.

Either we do this properly. Or we don't do it at all.

...
> > Now if those functions (isw*()) should return different results
> > depending on locale, the sizes explode. So I hope not.
> 
> Download the Unicode character database from the above Web site and
> look there.  I don't think the locale changes anything, but I might be
> wrong.

Well, if my locale is English, is Arabic or Chinese characters
printable? They might be, but I'm not sure that my hardware and/or OS
will be able to show them to me. So should iswprint() tell me 1 or 0?
I. e. is those functions reporting the abstract idea of the character
or the system's possibility?

Hohum, writing that made me look up iswprint() in the standard. Then
my eyes fell upon iswpunct() which description says "... tests for any
printing wide character that is one of a locale-specific set of
punctuation wide characters...". So at least some of those functions
_are_ locale-dependent. Actually, almost the only one of those isw*()
that doesn't mention locale-specific is iswprint()!

Sooooo, that seems to imply not only will the tables and/or functions
be huge, there will be an endless amount of them as well.

Or do we get away from locale-specificness by using Unicode?


Right,

						MartinS