X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-2.3 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: sourceware.org Message-ID: <4B42029C.2020503@towo.net> Date: Mon, 04 Jan 2010 16:00:44 +0100 From: Thomas Wolff User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: Re: Cygwin 1.7.1 sprintf() with format string having 8th bit set References: <2ECEEFBE44B2488C840CA73169D69A6D AT LeakyCauldron> <416096c61001040429m62b7d93cm5badf57619a8aea0 AT mail DOT gmail DOT com> In-Reply-To: <416096c61001040429m62b7d93cm5badf57619a8aea0@mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Andy Koppe wrote: > 2010/1/4 Joseph Quinsey > >> In Cygwin 1,7.1, sprintf() with the format string having an 8th bit set >> appears to be broken. Sample code (where I've indicated the backslashes in >> the comments, in case they are stripped out by the mailer): >> >> #include >> >> int main (void) >> { >> unsigned char foo[30] = ""; >> unsigned char bar[30] = ""; >> unsigned char xxx[30] = ""; >> sprintf (foo, "\100%s", "ABCD"); /* this is backslash one zero zero */ >> sprintf (bar, "\300%s", "ABCD"); /* this is backslash three zero zero */ >> sprintf (xxx, "\300ABCD"); /* this is backslash three zero zero */ >> printf ("%d %d %d %d %d\n", foo[0],foo[1],foo[2],foo[3],foo[4]); >> printf ("%d %d %d %d %d\n", bar[0],bar[1],bar[2],bar[3],bar[4]); >> printf ("%d %d %d %d %d\n", xxx[0],xxx[1],xxx[2],xxx[3],xxx[4]); >> return 0; >> } >> >> gives: >> >> 64 65 66 67 68 >> 0 0 0 0 0 >> 192 65 66 67 68 >> >> The second line of the output should be the same as the third. >> > > The issue here is that the character set of the "C" locale in Cygwin > 1.7 is UTF-8 and that the \300 on its own is an invalid UTF-8 byte. My assumption has been that *printf should be byte-transparent unless where it uses explicit wide character arguments. After all, legacy applications that do not care about locales at all may legitimately assume this since a C char [] is a byte sequence; this is not affected by the legacy casual usage of the word "character" referring to a char value which does not automatically imply "wide character". Reading http://www.opengroup.org/onlinepubs/9699919799/functions/fprintf.html: [EILSEQ] A wide-character code that does not correspond to a valid character has been detected. this explicitly refers to "wide characters" which are mentioned elsewhere in this document only as argument values for the %lc and %ls flags. I don't think it needs to, or even should, be interpreted to refer to the format string. > To get well-defined behaviour, you need to invoke setlocale(LC_CTYPE, > ...) with the approriate locale. > > See the thread at http://cygwin.com/ml/cygwin/2009-12/msg00980.html > for more on this. > In that thread, someone had originally confused char * with wchar [] - the issue resolves cleanly if these are properly distinguished. Comments on the EILSEQ clause from that thread: > > It's talking about "characters" rather than "bytes" there, which I > > think does leave the behaviour for invalid bytes undefined, > No, it's talking about "wide character codes" and "valid characters", to be picky. > It's actually well-defined - non-characters in the format string MUST make > printf fail. I claim it's absolutely not well-defined and I strongly disagree here. > The issue wasn't with wide characters, but invalid multibyte chars. > But anyway, we're agreed that printf is right to bail out. I don't think there is such a thing like an invalid multibyte character in a char [] unless it is being interpreted with a multi-byte function, that's what e.g. the mb* functions are for. In a legacy application, especially in an sprintf which may not even be intended for printing, there is no intent to apply a multi-byte interpretation. This is over-imposing semantics on a basic C type. So I do not agree that printf is right here, and if it were, the third line in the example would have had to fail as well, actually. Thomas -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple