Mailing-List: contact cygwin-apps-help AT cygwin DOT com; run by ezmlm Sender: cygwin-apps-owner AT cygwin DOT com List-Subscribe: List-Archive: List-Post: List-Help: , Mail-Followup-To: cygwin-apps AT cygwin DOT com Delivered-To: mailing list cygwin-apps AT cygwin DOT com Date: Wed, 1 May 2002 10:53:05 +0200 From: Pavel Tsekov Reply-To: Pavel Tsekov Organization: Syntrex, Inc. X-Priority: 3 (Normal) Message-ID: <10340394103.20020501105305@syntrex.com> To: "Robert Collins" CC: "Gary R. Van Sickle" , "Cygwin-Apps" Subject: Re[2]: libgetopt++ and setup and libstdc++ In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hello Robert, Wednesday, May 01, 2002, 10:22:03 AM, you wrote: >> -----Original Message----- >> From: Gary R. Van Sickle [mailto:g DOT r DOT vansickle AT worldnet DOT att DOT net] >> Sent: Monday, April 29, 2002 5:39 AM >> > Except that widechar != unicode. WCHAR is still an 0 terminated >> > string, but Unicode strings are not 0 terminated. >> >> Sure they are. A Unicode '\0' == 0x0000 (regardless of your >> byte order ;-)). >> Zero terminated strings (C style strings) has nothing to do with the basic_string template class. basic_string can contain any character including \0. Its much the same as the STL vector. The WCHAR here specifies the size of storage of a single character... I.e. you can have typedef basic_string SomeStrangeCharString; RC> Read http://www.unicode.org/unicode/uni2book/ch05.pdf section 5.2. RC> Also read http://www.unicode.org/unicode/uni2book/ch02.pdf which does RC> note that nul(U+0000) can be used as a string terminator. RC> Then http://www.unicode.org/unicode/reports/tr17/ RC> "C and C++ char* APIs use serialized bytes, which could represent a RC> variety of different character maps, including ISO Latin 1, UTF-8, RC> Windows 1252, as well as compound character maps such as Shift-JIS or RC> 2022-JP. A byte API could also handle UTF-16BE or UTF-16LE, which are RC> serialized forms of Unicode. However, these APIs must be allow for the RC> existence of any byte value, and typically use memcpy plus length RC> instead of strcpy for manipulating strings." (which is possibly RC> referring to a non-wchar_t aware strcpy, not sure here). RC> Anyway, things like UTF-8 can confuse the heck out of c-libraries RC> because of their multi-byte nature, where RC> a) a NULL may be part way through a chacter, not terminating, and RC> b) a NULL may be illegal at a given point, and the previous partial RC> character is invalid. RC> Finally, note that Unicde requires 21 bits of storage, so a 16 bit WCHAR RC> will still involve multi-byte sequence. Quote from "The C++ Programming Language": "A wide character - that is, an object of type wchar_t ($4.3) - is like a char, except that it take up two or more bytes." RC> Does the newlib && lib-gcc and libstdc++ string correctly RC> understand unicode (and what representation does it use?). Does it use RC> the same as Win32 WCHAR does? >> > (See the NT kernel defines for >> > UNICODE_STRING to see how unicode strings are represented.). Btw I read somewhere else that Windows does not support the full japanese characterset, but only the most used characters.