Mail Archives: cygwin/2011/01/29/13:12:28
[Duplicate message to honor the missing CC of bug-gnu-libiconv AT gnu DOT org]
On Jan 29 08:10, Eric Blake wrote:
> On 01/29/2011 05:30 AM, Corinna Vinschen wrote:
> >> But when characters outside the basic plane, such as
> >> U+12345 (CUNEIFORM SIGN URU TIMES KI), are encoded by 2 consecutive wchar_t
> >> values, values of type wchar_t don't correspond to ISO/IEC 10646 characters.
> >> (Or maybe I'm underestimating what "coded representations" means...?)
> >
> > I don't read that from your above quote. The core is that the *type*
> > wchar_t is a *coded* *representation* of the characters defined in
> > 10646. At no point it says that a single wchar_t value must represent a
> > single character from 10646. So I take it that UTF-16 is a valid, coded
> > representation of the characters from 10646.
>
> POSIX is clear that wchar_t must be wide enough so that 1 wchar_t is one
> character. Which limits a 2-byte wchar_t to just the Unicode basic
> plane. There's nothing cygwin can do about this other than break LOTS
> of ABI to support a 4-byte wchar_t to supply all of Unicode.
>
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_03
>
> "All wide-character codes in a given process consist of an equal number
> of bits. This is in contrast to characters, which can consist of a
> variable number of bytes. The byte or byte sequence that represents a
> character can also be represented as a wide-character code.
> Wide-character codes thus provide a uniform size for manipulating text
> data."
>
> So, using UTF-16 surrogate encodings for characters outside the basic
> plane violates POSIX, but it's the best we can do for those characters.
Right, and we discussed this already on this list. Or the developer
list, I don't remember. Maybe we should have stick to the base plane
and only use UCS-2 to be more POSIX compatible. I have to admit that
I was more interested to get all (or as much as possible) of Unicode
working than to follow POSIX to the last word in this regard. And I
was interested to make sure that east asian users would get all of the
characters used and there *are* the CJK idograpsh in the 0x2xxxx plane.
However, the POSIX definition doesn't contradict what I said about the
definition of __STDC_ISO_10646__ as far as I'm concerned.
> Someday when gcc has better support for C+1x 16- and 32-bit characters
> (regardless of the sizing of wchar_t), then we can add all the new
> 32-bit character APIs that use Unicode unimpeded, without breaking
> existing ones that use wchar_t.
Yeah, that's what I'm waiting for as well. But for the time being,
I'm confident that we have the best compromise possible at the time.
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -