Mail Archives: cygwin/2011/01/31/13:22:35

delorie.com/archives/browse.cgi

search

Mail Archives: cygwin/2011/01/31/13:22:35

X-Recipient: archive-cygwin AT delorie DOT com

X-Spam-Check-By: sourceware.org

Date: Mon, 31 Jan 2011 19:22:10 +0100

From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>

To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org

Subject: Re: 16-bit wchar_t on Windows and Cygwin

Message-ID: <20110131182210.GL1057@calimero.vinschen.de>

Reply-To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org

Mail-Followup-To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org

References: <201101310304 DOT 42975 DOT bruno AT clisp DOT org> <4D46EA2B DOT 1010307 AT redhat DOT com>

MIME-Version: 1.0

In-Reply-To: <4D46EA2B.1010307@redhat.com>

User-Agent: Mutt/1.5.21 (2010-09-15)

Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm

List-Id: <cygwin.cygwin.com>

List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>

List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>

List-Archive: <http://sourceware.org/ml/cygwin/>

List-Post: <mailto:cygwin AT cygwin DOT com>

List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>

Sender: cygwin-owner AT cygwin DOT com

Mail-Followup-To: cygwin AT cygwin DOT com

Delivered-To: mailing list cygwin AT cygwin DOT com

On Jan 31 09:58, Eric Blake wrote:
> >   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
> >      On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
> >      but somewhat surprising way: wcrtomb() may return 0, that is, produce no
> >      output bytes when it consumes a wchar_t.
> 
> >   Now with a chinese character outside the BMP:
> >   $ 	
> >         1       4
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         3       6
> > 
> >   On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
> > 
> >   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
> >         1       5
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         2       7
> >
> >   So both the number of characters and the number of words are counted
> >   wrong as soon as non-BMP characters occur.
> >
> 
> Does this represent a bug in cygwin's mbrtowc routines that could be
> fixed by cygwin?
> 
> Or, does this represent a bug in coreutils for using mbrtowc one
> character at a time instead of something like mbsrtowcs to do bulk
> conversions?

Just to clarify a bit.  This has been discussed on the cygwin-developer
mailing list back in 2009.  The original code which handled UTF-16
surrogates always wrote at least 1 byte to the destination UTF-8 string.
However, the problem is that Windows filenames may contain lone
surrogate pairs, even though the filename is usually interpreted as
UTF-16.

So the current code returns 0 bytes for the first surrogate half and
only writes the full UTF-8 sequence after the second surrogate half has
been evaluated.  In the case where a lone high surrogate is still
pending, but the low surrogate is missing, we can just write out the
high surrogate in CESU-8 encoding.  This would not have been possible if
we had already written the first byte of the UTF-8 string.  Lone low
surrogates are written as CESU-8 sequence immediately so they are nothing
to worry about.

As for wctomb/wcrtomb returning 0:  Even if this looks like kind of a
stretch, this should not be a problem per POSIX.  A return value of 0
from wctomb/wcrtomb has no special meaning(*).  Even in the case where
the incoming wide char is L'\0', the resulting \0 is written and 1 is
returned.  Since 0 bytes have been written to the destination string,
returning 0 is perfectly valid.  If a calling function misinterprets the
return value of 0 as an error or EOF, it's not a bug in wctomb/wcrtomb.

For the original discussion, see
http://cygwin.com/ml/cygwin-developers/2009-09/msg00065.html

Corinna

(*) http://pubs.opengroup.org/onlinepubs/9699919799/functions/wcrtomb.html

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -

webmaster	delorie software privacy
Copyright © 2019 by DJ Delorie	Updated Jul 2019

X-Recipient:	archive-cygwin AT delorie DOT com
X-Spam-Check-By:	sourceware.org
Date:	Mon, 31 Jan 2011 19:22:10 +0100
From:	Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To:	cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org
Subject:	Re: 16-bit wchar_t on Windows and Cygwin
Message-ID:	<20110131182210.GL1057@calimero.vinschen.de>
Reply-To:	cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org
Mail-Followup-To:	cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org
References:	<201101310304 DOT 42975 DOT bruno AT clisp DOT org> <4D46EA2B DOT 1010307 AT redhat DOT com>
MIME-Version:	1.0
In-Reply-To:	<4D46EA2B.1010307@redhat.com>
User-Agent:	Mutt/1.5.21 (2010-09-15)
Mailing-List:	contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id:	<cygwin.cygwin.com>
List-Unsubscribe:	<mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe:	<mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive:	<http://sourceware.org/ml/cygwin/>
List-Post:	<mailto:cygwin AT cygwin DOT com>
List-Help:	<mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender:	cygwin-owner AT cygwin DOT com
Mail-Followup-To:	cygwin AT cygwin DOT com
Delivered-To:	mailing list cygwin AT cygwin DOT com