X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <4B42029C.2020503@towo.net> References: <2ECEEFBE44B2488C840CA73169D69A6D AT LeakyCauldron> <416096c61001040429m62b7d93cm5badf57619a8aea0 AT mail DOT gmail DOT com> <4B42029C DOT 2020503 AT towo DOT net> Date: Mon, 4 Jan 2010 19:01:23 +0000 Message-ID: <416096c61001041101g6237acd4jb96567856bbde111@mail.gmail.com> Subject: Re: Cygwin 1.7.1 sprintf() with format string having 8th bit set From: Andy Koppe To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com 2010/1/4 Thomas Wolff: > My assumption has been that *printf should be byte-transparent unless where > it uses explicit wide character arguments. What's that assumption based on? > After all, legacy applications that do not care about locales at all may > legitimately assume this since a C char [] is a byte sequence; Erm, the meaning of a byte sequence is up to each function. > this is not affected by the legacy casual usage of the word "character" > referring to a char value which does not automatically imply "wide > character". There is no casual usage of "byte" and "character" in the POSIX standard. See http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html. In particular: 3.84 Byte: An individually addressable unit of data storage that is exactly an octet, used to store a character or a portion of a character; see also Character. A byte is composed of a contiguous sequence of 8 bits. The least significant bit is called the "low-order" bit; the most significant is called the "high-order" bit. 3.87 Character: A sequence of one or more bytes representing a single graphic symbol or control code. 3.92 Character String: A contiguous sequence of characters terminated by and including the first null byte. 3.367 String: A contiguous sequence of bytes terminated by and including the first null byte. (And yep, a lot of confusion would go away if the 'char' type was called 'byte' instead, but of course that's out of the question.) > In that thread, someone had originally confused char * with wchar [] - the > issue resolves cleanly if these are properly distinguished. > > Comments on the EILSEQ clause from that thread: >> >> > It's talking about "characters" rather than "bytes" there, which I >> > think does leave the behaviour for invalid bytes undefined, That sentence had nothing to do with EILSEQ. Here it is in its original context: "I couldn't find specific text about invalid bytes in the POSIX printf spec, but it does say the following: "The format is a character string, beginning and ending in its initial shift state, if any. The format is composed of zero or more directives: ordinary characters, which are simply copied to the output stream, and conversion specifications, each of which shall result in the fetching of zero or more arguments." It's talking about "characters" rather than "bytes" there, which I think does leave the behaviour for invalid bytes undefined, so newlib's printf implementation is in its rights to just stop processing the string at one of those." To emphasise this again, the printf spec explictly says that "the format is a *character* string". > I don't think there is such a thing like an invalid multibyte character in a > char [] unless it is being interpreted with a multi-byte function, that's > what e.g. the mb* functions are for. Well, you're wrong. See the definition of 'character'. > In a legacy application, especially in an sprintf which may not even be > intended for printing, there is no intent to apply a multi-byte > interpretation. This is over-imposing semantics on a basic C type. No, it's necessary for printf to work correctly with all character sets. For example, the second byte in a double-byte SJIS character can actually be the same as the ASCII code for '%'. Hence, if printf blindly copied bytes until encountering a '%', it would not be possible to print such characters. > So I do not agree that printf is right here, and if it were, the third line > in the example would have had to fail as well, actually. Including invalid bytes in the format string is undefined behaviour. Anything can happen. And what likely happened is that the compiler replaced the third sprintf call with strcpy (which is specified on strings rather than character strings). The real discussion to be had here is whether "C" should continue to mean UTF-8 or return to ASCII for the sake of Linux compatibility. See http://cygwin.com/ml/cygwin-developers/2009-12/msg00112.html for that. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple