delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2011/02/02/07:15:25

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Wed, 2 Feb 2011 13:14:42 +0100
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org
Subject: Re: 16-bit wchar_t on Windows and Cygwin
Message-ID: <20110202121442.GC2675@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org
Mail-Followup-To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org
References: <201101310304 DOT 42975 DOT bruno AT clisp DOT org> <4D46EA2B DOT 1010307 AT redhat DOT com> <201102021229 DOT 04623 DOT bruno AT clisp DOT org>
MIME-Version: 1.0
In-Reply-To: <201102021229.04623.bruno@clisp.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Feb  2 12:29, Bruno Haible wrote:
> Hello Eric,
> 
> > ... POSIX requires that 1 wchar_t corresponds to 1 character
> > ...
> > > What consequences does this have?
> > > 
> > >   1) All code that uses the functions from <wctype.h> (wide character
> > >      classification and mapping) or wcwidth() malfunctions on strings that
> > >      contains Unicode characters outside the BMP, i.e. outside the range
> > >      U+0000..U+FFFF.
> > 
> > Not necessarily.  Such code falls outside of POSIX, but it may still be
> > a well-behaved extension if given sane behavior for how to deal with
> > surrogates.
> 
> No. Code that uses <wctype.h> and wcwidth() is written precisely according
> to POSIX. The problem is that this code cannot work correctly when wchar_t[]
> is in UTF-16 encoding. There simply is no way to define these functions
> in a reasonable way for surrogates.
> 
> For example:
>   U+1031E = 0xD800 0xDF1E   is a letter (iswalpha should be true)
>   U+10320 = 0xD800 0xDF20   is not a letter (iswalpha should be false)
>   U+1D31E = 0xD834 0xDF1E   is not a letter (iswalpha should be false)
>   U+1D320 = 0xD834 0xDF20   is not a letter (iswalpha should be false)
>   U+1D71E = 0xD835 0xDF1E   is a letter (iswalpha should be true)
>   U+1D720 = 0xD835 0xDF20   is a letter (iswalpha should be true)
> There is no way that a system can provide this information through a
> function 'iswalpha' that takes a single wchar_t argument.

iswalpha takes wint_t, not wchar_t.  Since sizeof (wint_t) is 4 byte,
the function can return the correct value, provided that the application
converts the UTF-16 surrogate to UTF-32 before calling iswalpha.

> We agree that it is a bug. And it is caused by
>   - the fact that Cygwin's wchar_t[] encoding is UTF-16, and
>   - there is no way to define the <wctype.h> POSIX functions sanely in this
>     setting, and

See above.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019