X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-6.8 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI,SPF_HELO_PASS,T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Message-ID: <4D46EA2B.1010307@redhat.com> Date: Mon, 31 Jan 2011 09:58:19 -0700 From: Eric Blake User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101209 Fedora/3.1.7-0.35.b3pre.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.7 MIME-Version: 1.0 To: Bruno Haible CC: bug-gnulib AT gnu DOT org, cygwin , bug-coreutils Subject: Re: 16-bit wchar_t on Windows and Cygwin References: <201101310304 DOT 42975 DOT bruno AT clisp DOT org> In-Reply-To: <201101310304.42975.bruno@clisp.org> OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------enig94CF3FEB4BA742E2A08505A3" X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com --------------enig94CF3FEB4BA742E2A08505A3 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable [adding cygwin and coreutils for a wc issue] On 01/30/2011 07:04 PM, Bruno Haible wrote: > Hi, >=20 > It is known for a long time that on native Windows, the wchar_t[] encodin= g on > strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is the= same > for Cygwin >=3D 1.7. [2] POSIX requires that 1 wchar_t corresponds to 1 character; so any use of surrogates to get the full benefit of UTF-16 falls outside the bounds of POSIX. At which point, the POSIX definition of those functions no longer apply, and we can (try) to make the various wc* functions try to behave as smartly as possible (as is the case with Cygwin); where those smarts are only needed when you use surrogate pairs. If cygwin's approach is correct, then maybe the thing to do is codify those smarts for all implementations with 16-bit wchar_t as an extension to POSIX that all gnulib clients can rely on, and thus minimize the #ifdefs in such clients. > What consequences does this have? >=20 > 1) All code that uses the functions from (wide character > classification and mapping) or wcwidth() malfunctions on strings that > contains Unicode characters outside the BMP, i.e. outside the range > U+0000..U+FFFF. Not necessarily. Such code falls outside of POSIX, but it may still be a well-behaved extension if given sane behavior for how to deal with surrogates. > 2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction. > On Cygwin >=3D 1.7 mbrtowc() and wcrtomb() is implemented in an inte= lligent > but somewhat surprising way: wcrtomb() may return 0, that is, produc= e no > output bytes when it consumes a wchar_t. > Now with a chinese character outside the BMP: > $=20=09 > 1 4 > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > 3 6 >=20 > On Cygwin 1.7.5 (with LANG=3DC.UTF-8 and 'wc' from GNU coreutils 8.5): >=20 > $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m > 1 5 > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > 2 7 > > So both the number of characters and the number of words are counted > wrong as soon as non-BMP characters occur. > Does this represent a bug in cygwin's mbrtowc routines that could be fixed by cygwin? Or, does this represent a bug in coreutils for using mbrtowc one character at a time instead of something like mbsrtowcs to do bulk conversions? And if we decide that cygwin's extensions are sane, how much harder is it to characterize what a program must do to be portable to both 16-bit and 32-bit wchar_t if they are guaranteed the same behavior for all hosts of the same-size wchar_t? In other words, would it really require that many #ifdefs in coreutils to portably and simultaneously support both sizes of wchar_t? > I'm more in favour of overriding wchar_t and all functions that depend on= it - > like we did successfully for the socket functions. >=20 > In practice, this would mean that on Windows (both native Windows and > Cygwin >=3D 1.7) the use of a 'wchar_t' module will > - override wchar_t to be 32 bits, like in glibc, > - cause functions from mbrtowc() to wcwidth() to be overridden. Since t= he > corresponding system functions are unusable, the replacements will us= e the > modules from libunistring (such as unictype/ctype-alnum and uniwidth/= width). That's a lot of overriding, for anything that uses wchar_t in its API, and throws out a lot of what cygwin already provides. It also means that compiler primitives, like L"xyz", which result in 16-bit wchar_t arrays, will be unusable with your 32-bit wchar_t override. In other words, I don't think it's a good idea to be doing that. C1x will be adding compiler support for mandatory char16_t and char32_t types for UTF-16 and UTF-32 data, independently of whether wchar_t is 16-bit or 32-bit; maybe the better thing is to proactively start providing the new interfaces in that will result from C1x adoption (and convert GNU programs to use this rather than wchar_t for character operations), although without compiler support for u"" and U"" (and even u8""), we are no better than ditching compiler support for L"" if you force a wchar_t size override. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists: 7.27 Unicode utilities 1 The header declares types and functions for manipulating Unicode characters. 2 The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19); char16_t which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and char32_t which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2). mbrtoc16 c16rtomb mbrtoc32 c32rtomb but no variants for replacing wprintf and friends (convert to multibyte and use printf and friends instead). --=20 Eric Blake eblake AT redhat DOT com +1-801-349-2682 Libvirt virtualization library http://libvirt.org --------------enig94CF3FEB4BA742E2A08505A3 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iQEcBAEBCAAGBQJNRuorAAoJEKeha0olJ0Nq75oH/RpS/V6+I5kdmDbm3JNIQeS5 SwN7b6/jhycI9Hs5y/MvjSfo0auhwstLyGPutmqtDTAnJ3TRjO/NDUshuBo3vDMg 6jLLzYwqKRAyEFMmSpLygON8UIgrAScJxb5gEmRwzW1m6Y4zZojfVDpO/qRmhXfJ y+9rSgDhpU4ex3Pevg9IuGFHVNh11ClNEFm96cJjFYLK46zQXyGaY6UrZO6CkcYf bVwzLD5nWx3btYi75XdBppPvx1hA9q6e291BrAgf6IU1zhq76TX9k9D9HZIu7FEh bv8gDkYy/T5FCF4+qo2/TtOvAX3H9kbkwPUziH8lQ+fcbbt5euRvCbM/HjkfSN0= =m8Gr -----END PGP SIGNATURE----- --------------enig94CF3FEB4BA742E2A08505A3--