Date: Sat, 29 Jan 2011 19:12:31 +0100
From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: cygwin@cygwin.com, bug-gnu-libiconv@gnu.org
Subject: Re: Bug in libiconv?
Message-ID: <20110129181231.GC1057@calimero.vinschen.de>
Reply-To: cygwin@cygwin.com, bug-gnu-libiconv@gnu.org
Mail-Followup-To: cygwin@cygwin.com, bug-gnu-libiconv@gnu.org
References: <201101282312.50298.bruno@clisp.org> <20110129123014.GA8671@calimero.vinschen.de> <4D442DDA.4050807@redhat.com> <20110129160157.GA1057@calimero.vinschen.de> <4D444CAC.2010300@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <4D444CAC.2010300@redhat.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
Precedence: bulk
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com

On Jan 29 10:21, Eric Blake wrote:
> On 01/29/2011 09:01 AM, Corinna Vinschen wrote:
> >> So, using UTF-16 surrogate encodings for characters outside the basic
> >> plane violates POSIX, but it's the best we can do for those characters.
> > 
> > Right, and we discussed this already on this list.  Or the developer
> > list, I don't remember.  Maybe we should have stick to the base plane
> > and only use UCS-2 to be more POSIX compatible.
> 
> The burden is on the application, not on cygwin.  If the application
> wants POSIX behavior, then they obey __STDC_ISO_10646__ and use ONLY
> characters from the basic plane (no surrogates), at which point their
> use of wchar_t fits the POSIX definition (one wchar_t per character).
> The moment they pass a surrogate, they are no longer honoring the
> restriction documented by __STDC_ISO_10646__ so they are no longer under
> the rules of POSIX, and then cygwin can do whatever it wants (and in

Erm... hang on.  __STDC_ISO_10646__ and the POSIX requirement are two
different beasts.  I still think that __STDC_ISO_10646__ does not
restrict a 2 byte wchar_t to UCS-2.  Per the definition UTF-16 is a
valid coded representation of characters from ISO/IEC 10646.

So, to say it with your words, the moment applications pass a surrogate,
they are no longer under the rules of POSIX, but they still honor the
restriction documented by __STDC_ISO_10646__.

However, *usually* an application shouldn't really notice that a
surrogate has been used, at least as long as they only manipulate entire
strings.

> this case, QoI demands that we honor surrogates to the best of our
> ability for full UTF-16 support, and you can have multi-wchar_t
> characters just as you already have multi-byte UTF-8 char characters).
> In other words, cygwin IS being POSIX-compliant by advertising only the
> Unicode 4.0 character set in the __STDC_ISO_10646__, while still
> supporting Unicode 5.2 (should we upgrade to Unicode 6.0?) as an
> extension when you no longer care about POSIX.
> 
> > However, the POSIX definition doesn't contradict what I said about the
> > definition of __STDC_ISO_10646__ as far as I'm concerned.
> 
> Yep - I think we're in violent agreement :)

Hmm, I'm not quite sure, see above.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple