Message-ID: <4D46EA2B.1010307@redhat.com>
Date: Mon, 31 Jan 2011 09:58:19 -0700
From: Eric Blake <eblake AT redhat DOT com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101209 Fedora/3.1.7-0.35.b3pre.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.7
MIME-Version: 1.0
To: Bruno Haible <bruno AT clisp DOT org>
CC: bug-gnulib AT gnu DOT org, cygwin <cygwin AT cygwin DOT com>,
        bug-coreutils <bug-coreutils AT gnu DOT org>
Subject: Re: 16-bit wchar_t on Windows and Cygwin
References: <201101310304 DOT 42975 DOT bruno AT clisp DOT org>
In-Reply-To: <201101310304.42975.bruno@clisp.org>
OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg
Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------enig94CF3FEB4BA742E2A08505A3"
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com

--------------enig94CF3FEB4BA742E2A08505A3
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

[adding cygwin and coreutils for a wc issue]

On 01/30/2011 07:04 PM, Bruno Haible wrote:
> Hi,
>=20
> It is known for a long time that on native Windows, the wchar_t[] encodin=
g on
> strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is the=
 same
> for Cygwin >=3D 1.7. [2]

POSIX requires that 1 wchar_t corresponds to 1 character; so any use of
surrogates to get the full benefit of UTF-16 falls outside the bounds of
POSIX.  At which point, the POSIX definition of those functions no
longer apply, and we can (try) to make the various wc* functions try to
behave as smartly as possible (as is the case with Cygwin); where those
smarts are only needed when you use surrogate pairs.  If cygwin's
approach is correct, then maybe the thing to do is codify those smarts
for all implementations with 16-bit wchar_t as an extension to POSIX
that all gnulib clients can rely on, and thus minimize the #ifdefs in
such clients.

> What consequences does this have?
>=20
>   1) All code that uses the functions from <wctype.h> (wide character
>      classification and mapping) or wcwidth() malfunctions on strings that
>      contains Unicode characters outside the BMP, i.e. outside the range
>      U+0000..U+FFFF.

Not necessarily.  Such code falls outside of POSIX, but it may still be
a well-behaved extension if given sane behavior for how to deal with
surrogates.

>   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
>      On Cygwin >=3D 1.7 mbrtowc() and wcrtomb() is implemented in an inte=
lligent
>      but somewhat surprising way: wcrtomb() may return 0, that is, produc=
e no
>      output bytes when it consumes a wchar_t.

>   Now with a chinese character outside the BMP:
>   $=20=09
>         1       4
>   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
>         3       6
>=20
>   On Cygwin 1.7.5 (with LANG=3DC.UTF-8 and 'wc' from GNU coreutils 8.5):
>=20
>   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
>         1       5
>   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
>         2       7
>
>   So both the number of characters and the number of words are counted
>   wrong as soon as non-BMP characters occur.
>

Does this represent a bug in cygwin's mbrtowc routines that could be
fixed by cygwin?

Or, does this represent a bug in coreutils for using mbrtowc one
character at a time instead of something like mbsrtowcs to do bulk
conversions?

And if we decide that cygwin's extensions are sane, how much harder is
it to characterize what a program must do to be portable to both 16-bit
and 32-bit wchar_t if they are guaranteed the same behavior for all
hosts of the same-size wchar_t?  In other words, would it really require
that many #ifdefs in coreutils to portably and simultaneously support
both sizes of wchar_t?

> I'm more in favour of overriding wchar_t and all functions that depend on=
 it -
> like we did successfully for the socket functions.
>=20
> In practice, this would mean that on Windows (both native Windows and
> Cygwin >=3D 1.7) the use of a 'wchar_t' module will
>   - override wchar_t to be 32 bits, like in glibc,
>   - cause functions from mbrtowc() to wcwidth() to be overridden. Since t=
he
>     corresponding system functions are unusable, the replacements will us=
e the
>     modules from libunistring (such as unictype/ctype-alnum and uniwidth/=
width).

That's a lot of overriding, for anything that uses wchar_t in its API,
and throws out a lot of what cygwin already provides.  It also means
that compiler primitives, like L"xyz", which result in 16-bit wchar_t
arrays, will be unusable with your 32-bit wchar_t override.  In other
words, I don't think it's a good idea to be doing that.

C1x will be adding compiler support for mandatory char16_t and char32_t
types for UTF-16 and UTF-32 data, independently of whether wchar_t is
16-bit or 32-bit; maybe the better thing is to proactively start
providing the new interfaces in <uchar.h> that will result from C1x
adoption (and convert GNU programs to use this rather than wchar_t for
character operations), although without compiler support for u"" and U""
(and even u8""), we are no better than ditching compiler support for L""
if you force a wchar_t size override.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists:

7.27 Unicode utilities <uchar.h>
1 The header <uchar.h> declares types and functions for manipulating Unicode
 characters.
2 The types declared are mbstate_t (described in 7.29.1) and size_t
(described in
 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the
same type as
uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the
same type as
uint_least32_t (also described in 7.20.1.2).

mbrtoc16
c16rtomb
mbrtoc32
c32rtomb

but no variants for replacing wprintf and friends (convert to multibyte
and use printf and friends instead).

--=20
Eric Blake   eblake AT redhat DOT com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org


--------------enig94CF3FEB4BA742E2A08505A3
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iQEcBAEBCAAGBQJNRuorAAoJEKeha0olJ0Nq75oH/RpS/V6+I5kdmDbm3JNIQeS5
SwN7b6/jhycI9Hs5y/MvjSfo0auhwstLyGPutmqtDTAnJ3TRjO/NDUshuBo3vDMg
6jLLzYwqKRAyEFMmSpLygON8UIgrAScJxb5gEmRwzW1m6Y4zZojfVDpO/qRmhXfJ
y+9rSgDhpU4ex3Pevg9IuGFHVNh11ClNEFm96cJjFYLK46zQXyGaY6UrZO6CkcYf
bVwzLD5nWx3btYi75XdBppPvx1hA9q6e291BrAgf6IU1zhq76TX9k9D9HZIu7FEh
bv8gDkYy/T5FCF4+qo2/TtOvAX3H9kbkwPUziH8lQ+fcbbt5euRvCbM/HjkfSN0=
=m8Gr
-----END PGP SIGNATURE-----

--------------enig94CF3FEB4BA742E2A08505A3--