X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Wed, 2 Feb 2011 13:21:02 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org Subject: Re: 16-bit wchar_t on Windows and Cygwin Message-ID: <20110202122102.GD2675@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org Mail-Followup-To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org References: <201101310304 DOT 42975 DOT bruno AT clisp DOT org> <4D46EA2B DOT 1010307 AT redhat DOT com> <201102021229 DOT 04623 DOT bruno AT clisp DOT org> <20110202121442 DOT GC2675 AT calimero DOT vinschen DOT de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20110202121442.GC2675@calimero.vinschen.de> User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Feb 2 13:14, Corinna Vinschen wrote: > On Feb 2 12:29, Bruno Haible wrote: > > Hello Eric, > > > > > ... POSIX requires that 1 wchar_t corresponds to 1 character > > > ... > > > > What consequences does this have? > > > > > > > > 1) All code that uses the functions from (wide character > > > > classification and mapping) or wcwidth() malfunctions on strings that > > > > contains Unicode characters outside the BMP, i.e. outside the range > > > > U+0000..U+FFFF. > > > > > > Not necessarily. Such code falls outside of POSIX, but it may still be > > > a well-behaved extension if given sane behavior for how to deal with > > > surrogates. > > > > No. Code that uses and wcwidth() is written precisely according > > to POSIX. The problem is that this code cannot work correctly when wchar_t[] > > is in UTF-16 encoding. There simply is no way to define these functions > > in a reasonable way for surrogates. > > > > For example: > > U+1031E = 0xD800 0xDF1E is a letter (iswalpha should be true) > > U+10320 = 0xD800 0xDF20 is not a letter (iswalpha should be false) > > U+1D31E = 0xD834 0xDF1E is not a letter (iswalpha should be false) > > U+1D320 = 0xD834 0xDF20 is not a letter (iswalpha should be false) > > U+1D71E = 0xD835 0xDF1E is a letter (iswalpha should be true) > > U+1D720 = 0xD835 0xDF20 is a letter (iswalpha should be true) > > There is no way that a system can provide this information through a > > function 'iswalpha' that takes a single wchar_t argument. > > iswalpha takes wint_t, not wchar_t. Since sizeof (wint_t) is 4 byte, > the function can return the correct value, provided that the application > converts the UTF-16 surrogate to UTF-32 before calling iswalpha. And, please note the wording in SUSv4, for instance in http://calimero.vinschen.de/susv4/functions/iswalpha.html The wc argument is a wint_t, the value of which the application shall ^^^^^^ ^^^^^^^^^^^ ensure is a wide-character code corresponding to a valid character in the current locale, or equal to the value of the macro WEOF. If the argument has any other value, the behavior is undefined. I don't see any words in that which would disallow to convert UTF-16 wchar_t surrogates to a wint_t UTF-32 value before calling one of the wctype functions. Just like you have to be careful not to call the ctype functions with a signed char. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple