X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Sat, 29 Jan 2011 19:12:02 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org Subject: Re: Bug in libiconv? Message-ID: <20110129181202.GA26611@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org Mail-Followup-To: cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org References: <201101282312 DOT 50298 DOT bruno AT clisp DOT org> <20110129123014 DOT GA8671 AT calimero DOT vinschen DOT de> <4D442DDA DOT 4050807 AT redhat DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <4D442DDA.4050807@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com [Duplicate message to honor the missing CC of bug-gnu-libiconv AT gnu DOT org] On Jan 29 08:10, Eric Blake wrote: > On 01/29/2011 05:30 AM, Corinna Vinschen wrote: > >> But when characters outside the basic plane, such as > >> U+12345 (CUNEIFORM SIGN URU TIMES KI), are encoded by 2 consecutive wchar_t > >> values, values of type wchar_t don't correspond to ISO/IEC 10646 characters. > >> (Or maybe I'm underestimating what "coded representations" means...?) > > > > I don't read that from your above quote. The core is that the *type* > > wchar_t is a *coded* *representation* of the characters defined in > > 10646. At no point it says that a single wchar_t value must represent a > > single character from 10646. So I take it that UTF-16 is a valid, coded > > representation of the characters from 10646. > > POSIX is clear that wchar_t must be wide enough so that 1 wchar_t is one > character. Which limits a 2-byte wchar_t to just the Unicode basic > plane. There's nothing cygwin can do about this other than break LOTS > of ABI to support a 4-byte wchar_t to supply all of Unicode. > > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_03 > > "All wide-character codes in a given process consist of an equal number > of bits. This is in contrast to characters, which can consist of a > variable number of bytes. The byte or byte sequence that represents a > character can also be represented as a wide-character code. > Wide-character codes thus provide a uniform size for manipulating text > data." > > So, using UTF-16 surrogate encodings for characters outside the basic > plane violates POSIX, but it's the best we can do for those characters. Right, and we discussed this already on this list. Or the developer list, I don't remember. Maybe we should have stick to the base plane and only use UCS-2 to be more POSIX compatible. I have to admit that I was more interested to get all (or as much as possible) of Unicode working than to follow POSIX to the last word in this regard. And I was interested to make sure that east asian users would get all of the characters used and there *are* the CJK idograpsh in the 0x2xxxx plane. However, the POSIX definition doesn't contradict what I said about the definition of __STDC_ISO_10646__ as far as I'm concerned. > Someday when gcc has better support for C+1x 16- and 32-bit characters > (regardless of the sizing of wchar_t), then we can add all the new > 32-bit character APIs that use Unicode unimpeded, without breaking > existing ones that use wchar_t. Yeah, that's what I'm waiting for as well. But for the time being, I'm confident that we have the best compromise possible at the time. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple