MIME-Version: 1.0
Date: Tue, 28 Jul 2009 12:37:35 +0100
Message-ID: <416096c60907280437ie8febfme33c238431fa7da8@mail.gmail.com>
Subject: wchar_t width (was: bug in mbrtowc?)
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: cygwin AT cygwin DOT com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com

2009/7/28 Corinna Vinschen:
>> Trouble is, the hack will also only work correctly if the whole UTF-8
>> sequence for the non-BMP character is passed at once. If you pass the
>> bytes one-by-one instead, and assuming the bug above wasn't there,
>> you'd get this:
>
> Yes, I know. =C2=A0The real trouble is, I don't know how that can be fixed
> in a still sort-of-POSIXy way.

The way I'd suggested is sort-of-POSIXy, but perhaps not enough,
because apps that check the mbrtowc() return code (and not the written
wc) against zero will interpret a low surrogate as string end. An
alternative might be to just return an error when there's no compliant
way to return the low surrogate. Do you think either of these are
worth pursuing?


> Typical POSIX code doesn't know about
> UTF-16 and expects the wchar returned to be complete.

Indeed. In fact glibc explicitly guarantees that wchar_t is 32 bits wide:

"for GNU systems wchar_t is always 32 bits wide and, therefore,
capable of representing all UCS-4 values and, therefore, covering all
of ISO 10646. Some Unix systems define wchar_t as a 16-bit type and
thereby follow Unicode very strictly. This definition is perfectly
fine with the standard, but it also means that to represent all
characters from Unicode and ISO 10646 one has to use UTF-16 surrogate
characters, which is in fact a multi-wide-character encoding. But
resorting to multi-wide-character encoding contradicts the purpose of
the wchar_t type."

This will mean increasing complaints about Cygwin as the use of
non-BMP characters becomes more widespread. Windows itself of course
supports them reasonably well through UTF-16 surrogates. Another
possible issue is that with wchar_t being 32 bits wide and Unicode
only actually taking 21 bits, apps might be tempted to use the
remaining bits for private purposes.

Therefore I think long-term Cygwin's wchar_t will need to change to 32
bits for Linux compatibility. Of course that would require major,
ABI-breaking changes:

- Introduce a separate type for representing UTF-16, e.g. "vchar_t",
because 'v' is half a 'w' ;)
- Replace wchar_t with vchar_t throughout include/w32api
- Convert between wchar_t and vchar_t when calling Win32 functions
- Change internal Cygwin strings to vchar_t where that reduces the
number of necessary conversions
- Adapt Cygwin programs that directly invoke Win32 functions

Cygwin 2.1 anyone?

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple