X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <416096c60912281437o16aec4cct8b64b7518d9a9a1@mail.gmail.com> References: <380-2200912128193944786 AT cantv DOT net> <416096c60912281437o16aec4cct8b64b7518d9a9a1 AT mail DOT gmail DOT com> Date: Tue, 29 Dec 2009 06:17:56 +0000 Message-ID: <416096c60912282217h57cf311h6af5d98ff9580f0@mail.gmail.com> Subject: Re: gcc4[1.7] printf treats differently a string constant and a character array From: Andy Koppe To: rodmedina AT cantv DOT net, cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com 2009/12/28 Andy Koppe: > 2009/12/28 Rodrigo Medina: >> Hi, >> I am moving from cygwin-1.5 and gcc3.4 to cygwin1.7 and gcc4. >> Some simple programs of mine fail. >> >> I am using LC_ALL=3Des_VE.ISO-8859-15. >> >> I have reduced the problem to this example >> >> -------------- >> #include >> main() >> { >> static char* line1 =3D >> " This letter has an accent -->=C3=A1, this one has no accent -->a\n\n"; >> static char* line2 =3D " ***** another line ******\n\n"; >> static char* line3 =3D >> " These letters have an accent -->=C3=83=C2=A1, these ones have no accen= t -->A!\n\n"; >> static char* line4 =3D >> " This letter has an accent -->=C3=83, this one has no accent -->A\n\n"; >> =C2=A0printf(" This letter has an accent -->=C3=A1, this one has no acce= nt >> -->a\n\n"); >> =C2=A0printf(line2); >> =C2=A0printf("%d %d %d\n\n",line1[29],line1[30],line1[31]); >> =C2=A0printf(line1); >> =C2=A0printf(line2); >> =C2=A0printf(" These letters have an accent -->=C3=83=C2=A1, these ones = have no accent >> -->A!\n\n"); >> =C2=A0printf(line2); >> =C2=A0printf("%d %d %d %d\n\n",line3[32],line3[33],line3[34],line3[35]); >> =C2=A0printf(line3); >> =C2=A0printf(line2); >> =C2=A0printf(" This letter has an accent -->=C3=83, this one has no acce= nt >> -->A\n\n"); >> =C2=A0printf(line2); >> =C2=A0printf("%d %d %d\n\n",line4[29],line4[30],line4[31]); >> =C2=A0printf(line4); >> =C2=A0printf(line2); >> =C2=A0printf(" ----- END ------"); >> }---------------- >> >> My output is: >> >> =C2=A0This letter has an accent -->=C3=A1, this one has no accent -->a >> >> =C2=A0***** another line ****** >> >> 62 -31 44 >> >> =C2=A0This letter has an accent --> ***** another line ****** >> >> =C2=A0These letters have an accent -->=C3=83=C2=A1, these ones have no a= ccent -->A! >> >> =C2=A0***** another line ****** >> >> 62 -61 -95 44 >> >> =C2=A0These letters have an accent -->=C3=83=C2=A1, these ones have no a= ccent -->A! >> >> =C2=A0***** another line ****** >> >> =C2=A0This letter has an accent -->=C3=83, this one has no accent -->A >> >> =C2=A0***** another line ****** >> >> 62 -61 44 >> >> =C2=A0This letter has an accent --> ***** another line ****** >> >> =C2=A0----- END ------ >> >> As you can see the output of printf(string_constant) is what >> I expected. The ouput of printf(char_array) is trucated at the non-ASCII >> character. > > Reproduced. Looking at the compiler's assembly output, some of the > printf() calls are replaced by calls to puts(), and those do work > correctly, whereas the remaining printf() calls with accented > characters misbehave. So printf()'s handling of non-ASCII characters > needs a closer look. Ah, the problem actually is that your program is missing a call to setlocale(LC_CTYPE, "") to switch to the locale and character set specified in the environment. In fact, since your program contains hard-coded ISO-8859-15 strings, you should probably do setlocale(LC_CTYPE, ".ISO-8859-15"). Without a setlocale call, programs use the "C" locale, and on Cygwin 1.7 that implies the UTF-8 character set. Those single accented ISO-8859-15 characters are invalid when interpreted as UTF-8, so printf halts there. The accented character pairs like "=C3=83=C2=A1", meanw= hile, happen to be valid UTF-8, so they get through. I couldn't find specific text about invalid bytes in the POSIX printf spec, but it does say the following: "The format is a character string, beginning and ending in its initial shift state, if any. The format is composed of zero or more directives: ordinary characters, which are simply copied to the output stream, and conversion specifications, each of which shall result in the fetching of zero or more arguments." It's talking about "characters" rather than "bytes" there, which I think does leave the behaviour for invalid bytes undefined, so newlib's printf implementation is in its rights to just stop processing the string at one of those. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple