X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.8 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 Date: Tue, 28 Jul 2009 12:37:35 +0100 Message-ID: <416096c60907280437ie8febfme33c238431fa7da8@mail.gmail.com> Subject: wchar_t width (was: bug in mbrtowc?) From: Andy Koppe To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com 2009/7/28 Corinna Vinschen: >> Trouble is, the hack will also only work correctly if the whole UTF-8 >> sequence for the non-BMP character is passed at once. If you pass the >> bytes one-by-one instead, and assuming the bug above wasn't there, >> you'd get this: > > Yes, I know. =C2=A0The real trouble is, I don't know how that can be fixed > in a still sort-of-POSIXy way. The way I'd suggested is sort-of-POSIXy, but perhaps not enough, because apps that check the mbrtowc() return code (and not the written wc) against zero will interpret a low surrogate as string end. An alternative might be to just return an error when there's no compliant way to return the low surrogate. Do you think either of these are worth pursuing? > Typical POSIX code doesn't know about > UTF-16 and expects the wchar returned to be complete. Indeed. In fact glibc explicitly guarantees that wchar_t is 32 bits wide: "for GNU systems wchar_t is always 32 bits wide and, therefore, capable of representing all UCS-4 values and, therefore, covering all of ISO 10646. Some Unix systems define wchar_t as a 16-bit type and thereby follow Unicode very strictly. This definition is perfectly fine with the standard, but it also means that to represent all characters from Unicode and ISO 10646 one has to use UTF-16 surrogate characters, which is in fact a multi-wide-character encoding. But resorting to multi-wide-character encoding contradicts the purpose of the wchar_t type." This will mean increasing complaints about Cygwin as the use of non-BMP characters becomes more widespread. Windows itself of course supports them reasonably well through UTF-16 surrogates. Another possible issue is that with wchar_t being 32 bits wide and Unicode only actually taking 21 bits, apps might be tempted to use the remaining bits for private purposes. Therefore I think long-term Cygwin's wchar_t will need to change to 32 bits for Linux compatibility. Of course that would require major, ABI-breaking changes: - Introduce a separate type for representing UTF-16, e.g. "vchar_t", because 'v' is half a 'w' ;) - Replace wchar_t with vchar_t throughout include/w32api - Convert between wchar_t and vchar_t when calling Win32 functions - Change internal Cygwin strings to vchar_t where that reduces the number of necessary conversions - Adapt Cygwin programs that directly invoke Win32 functions Cygwin 2.1 anyone? Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple