X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Mon, 31 Jan 2011 19:22:10 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org Subject: Re: 16-bit wchar_t on Windows and Cygwin Message-ID: <20110131182210.GL1057@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org Mail-Followup-To: cygwin AT cygwin DOT com, bug-gnulib AT gnu DOT org, bug-coreutils AT gnu DOT org References: <201101310304 DOT 42975 DOT bruno AT clisp DOT org> <4D46EA2B DOT 1010307 AT redhat DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <4D46EA2B.1010307@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Jan 31 09:58, Eric Blake wrote: > > 2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction. > > On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent > > but somewhat surprising way: wcrtomb() may return 0, that is, produce no > > output bytes when it consumes a wchar_t. > > > Now with a chinese character outside the BMP: > > $ > > 1 4 > > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > > 3 6 > > > > On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5): > > > > $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m > > 1 5 > > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > > 2 7 > > > > So both the number of characters and the number of words are counted > > wrong as soon as non-BMP characters occur. > > > > Does this represent a bug in cygwin's mbrtowc routines that could be > fixed by cygwin? > > Or, does this represent a bug in coreutils for using mbrtowc one > character at a time instead of something like mbsrtowcs to do bulk > conversions? Just to clarify a bit. This has been discussed on the cygwin-developer mailing list back in 2009. The original code which handled UTF-16 surrogates always wrote at least 1 byte to the destination UTF-8 string. However, the problem is that Windows filenames may contain lone surrogate pairs, even though the filename is usually interpreted as UTF-16. So the current code returns 0 bytes for the first surrogate half and only writes the full UTF-8 sequence after the second surrogate half has been evaluated. In the case where a lone high surrogate is still pending, but the low surrogate is missing, we can just write out the high surrogate in CESU-8 encoding. This would not have been possible if we had already written the first byte of the UTF-8 string. Lone low surrogates are written as CESU-8 sequence immediately so they are nothing to worry about. As for wctomb/wcrtomb returning 0: Even if this looks like kind of a stretch, this should not be a problem per POSIX. A return value of 0 from wctomb/wcrtomb has no special meaning(*). Even in the case where the incoming wide char is L'\0', the resulting \0 is written and 1 is returned. Since 0 bytes have been written to the destination string, returning 0 is perfectly valid. If a calling function misinterprets the return value of 0 as an error or EOF, it's not a bug in wctomb/wcrtomb. For the original discussion, see http://cygwin.com/ml/cygwin-developers/2009-09/msg00065.html Corinna (*) http://pubs.opengroup.org/onlinepubs/9699919799/functions/wcrtomb.html -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple