Mail Archives: cygwin/2011/01/29/13:12:28

delorie.com/archives/browse.cgi

search

Mail Archives: cygwin/2011/01/29/13:12:28

X-Recipient: archive-cygwin AT delorie DOT com

X-Spam-Check-By: sourceware.org

Date: Sat, 29 Jan 2011 19:12:02 +0100

From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>

To: cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org

Subject: Re: Bug in libiconv?

Message-ID: <20110129181202.GA26611@calimero.vinschen.de>

Reply-To: cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org

Mail-Followup-To: cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org

References: <201101282312 DOT 50298 DOT bruno AT clisp DOT org> <20110129123014 DOT GA8671 AT calimero DOT vinschen DOT de> <4D442DDA DOT 4050807 AT redhat DOT com>

MIME-Version: 1.0

In-Reply-To: <4D442DDA.4050807@redhat.com>

User-Agent: Mutt/1.5.21 (2010-09-15)

Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm

List-Id: <cygwin.cygwin.com>

List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>

List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>

List-Archive: <http://sourceware.org/ml/cygwin/>

List-Post: <mailto:cygwin AT cygwin DOT com>

List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>

Sender: cygwin-owner AT cygwin DOT com

Mail-Followup-To: cygwin AT cygwin DOT com

Delivered-To: mailing list cygwin AT cygwin DOT com

[Duplicate message to honor the missing CC of bug-gnu-libiconv AT gnu DOT org]

On Jan 29 08:10, Eric Blake wrote:
> On 01/29/2011 05:30 AM, Corinna Vinschen wrote:
> >> But when characters outside the basic plane, such as
> >> U+12345 (CUNEIFORM SIGN URU TIMES KI), are encoded by 2 consecutive wchar_t
> >> values, values of type wchar_t don't correspond to ISO/IEC 10646 characters.
> >> (Or maybe I'm underestimating what "coded representations" means...?)
> > 
> > I don't read that from your above quote.  The core is that the *type*
> > wchar_t is a *coded* *representation* of the characters defined in
> > 10646.  At no point it says that a single wchar_t value must represent a
> > single character from 10646.  So I take it that UTF-16 is a valid, coded
> > representation of the characters from 10646.
> 
> POSIX is clear that wchar_t must be wide enough so that 1 wchar_t is one
> character.  Which limits a 2-byte wchar_t to just the Unicode basic
> plane.  There's nothing cygwin can do about this other than break LOTS
> of ABI to support a 4-byte wchar_t to supply all of Unicode.
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_03
> 
> "All wide-character codes in a given process consist of an equal number
> of bits. This is in contrast to characters, which can consist of a
> variable number of bytes. The byte or byte sequence that represents a
> character can also be represented as a wide-character code.
> Wide-character codes thus provide a uniform size for manipulating text
> data."
> 
> So, using UTF-16 surrogate encodings for characters outside the basic
> plane violates POSIX, but it's the best we can do for those characters.

Right, and we discussed this already on this list.  Or the developer
list, I don't remember.  Maybe we should have stick to the base plane
and only use UCS-2 to be more POSIX compatible.  I have to admit that
I was more interested to get all (or as much as possible) of Unicode
working than to follow POSIX to the last word in this regard.  And I
was interested to make sure that east asian users would get all of the
characters used and there *are* the CJK idograpsh in the 0x2xxxx plane.

However, the POSIX definition doesn't contradict what I said about the
definition of __STDC_ISO_10646__ as far as I'm concerned.

> Someday when gcc has better support for C+1x 16- and 32-bit characters
> (regardless of the sizing of wchar_t), then we can add all the new
> 32-bit character APIs that use Unicode unimpeded, without breaking
> existing ones that use wchar_t.

Yeah, that's what I'm waiting for as well.  But for the time being,
I'm confident that we have the best compromise possible at the time.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -

webmaster	delorie software privacy
Copyright © 2019 by DJ Delorie	Updated Jul 2019

X-Recipient:	archive-cygwin AT delorie DOT com
X-Spam-Check-By:	sourceware.org
Date:	Sat, 29 Jan 2011 19:12:02 +0100
From:	Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To:	cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org
Subject:	Re: Bug in libiconv?
Message-ID:	<20110129181202.GA26611@calimero.vinschen.de>
Reply-To:	cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org
Mail-Followup-To:	cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org
References:	<201101282312 DOT 50298 DOT bruno AT clisp DOT org> <20110129123014 DOT GA8671 AT calimero DOT vinschen DOT de> <4D442DDA DOT 4050807 AT redhat DOT com>
MIME-Version:	1.0
In-Reply-To:	<4D442DDA.4050807@redhat.com>
User-Agent:	Mutt/1.5.21 (2010-09-15)
Mailing-List:	contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id:	<cygwin.cygwin.com>
List-Unsubscribe:	<mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe:	<mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive:	<http://sourceware.org/ml/cygwin/>
List-Post:	<mailto:cygwin AT cygwin DOT com>
List-Help:	<mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender:	cygwin-owner AT cygwin DOT com
Mail-Followup-To:	cygwin AT cygwin DOT com
Delivered-To:	mailing list cygwin AT cygwin DOT com