X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Sat, 29 Jan 2011 19:12:31 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org Subject: Re: Bug in libiconv? Message-ID: <20110129181231.GC1057@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org Mail-Followup-To: cygwin AT cygwin DOT com, bug-gnu-libiconv AT gnu DOT org References: <201101282312 DOT 50298 DOT bruno AT clisp DOT org> <20110129123014 DOT GA8671 AT calimero DOT vinschen DOT de> <4D442DDA DOT 4050807 AT redhat DOT com> <20110129160157 DOT GA1057 AT calimero DOT vinschen DOT de> <4D444CAC DOT 2010300 AT redhat DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <4D444CAC.2010300@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Jan 29 10:21, Eric Blake wrote: > On 01/29/2011 09:01 AM, Corinna Vinschen wrote: > >> So, using UTF-16 surrogate encodings for characters outside the basic > >> plane violates POSIX, but it's the best we can do for those characters. > > > > Right, and we discussed this already on this list. Or the developer > > list, I don't remember. Maybe we should have stick to the base plane > > and only use UCS-2 to be more POSIX compatible. > > The burden is on the application, not on cygwin. If the application > wants POSIX behavior, then they obey __STDC_ISO_10646__ and use ONLY > characters from the basic plane (no surrogates), at which point their > use of wchar_t fits the POSIX definition (one wchar_t per character). > The moment they pass a surrogate, they are no longer honoring the > restriction documented by __STDC_ISO_10646__ so they are no longer under > the rules of POSIX, and then cygwin can do whatever it wants (and in Erm... hang on. __STDC_ISO_10646__ and the POSIX requirement are two different beasts. I still think that __STDC_ISO_10646__ does not restrict a 2 byte wchar_t to UCS-2. Per the definition UTF-16 is a valid coded representation of characters from ISO/IEC 10646. So, to say it with your words, the moment applications pass a surrogate, they are no longer under the rules of POSIX, but they still honor the restriction documented by __STDC_ISO_10646__. However, *usually* an application shouldn't really notice that a surrogate has been used, at least as long as they only manipulate entire strings. > this case, QoI demands that we honor surrogates to the best of our > ability for full UTF-16 support, and you can have multi-wchar_t > characters just as you already have multi-byte UTF-8 char characters). > In other words, cygwin IS being POSIX-compliant by advertising only the > Unicode 4.0 character set in the __STDC_ISO_10646__, while still > supporting Unicode 5.2 (should we upgrade to Unicode 6.0?) as an > extension when you no longer care about POSIX. > > > However, the POSIX definition doesn't contradict what I said about the > > definition of __STDC_ISO_10646__ as far as I'm concerned. > > Yep - I think we're in violent agreement :) Hmm, I'm not quite sure, see above. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple