delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/12/29/01:18:09

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <416096c60912281437o16aec4cct8b64b7518d9a9a1@mail.gmail.com>
References: <380-2200912128193944786 AT cantv DOT net> <416096c60912281437o16aec4cct8b64b7518d9a9a1 AT mail DOT gmail DOT com>
Date: Tue, 29 Dec 2009 06:17:56 +0000
Message-ID: <416096c60912282217h57cf311h6af5d98ff9580f0@mail.gmail.com>
Subject: Re: gcc4[1.7] printf treats differently a string constant and a character array
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: rodmedina AT cantv DOT net, cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

2009/12/28 Andy Koppe:
> 2009/12/28 Rodrigo Medina:
>> Hi,
>> I am moving from cygwin-1.5 and gcc3.4 to cygwin1.7 and gcc4.
>> Some simple programs of mine fail.
>>
>> I am using LC_ALL=3Des_VE.ISO-8859-15.
>>
>> I have reduced the problem to this example
>>
>> --------------
>> #include <stdio.h>
>> main()
>> {
>> static char* line1 =3D
>> " This letter has an accent -->=C3=A1, this one has no accent -->a\n\n";
>> static char* line2 =3D " ***** another line ******\n\n";
>> static char* line3 =3D
>> " These letters have an accent -->=C3=83=C2=A1, these ones have no accen=
t -->A!\n\n";
>> static char* line4 =3D
>> " This letter has an accent -->=C3=83, this one has no accent -->A\n\n";
>> =C2=A0printf(" This letter has an accent -->=C3=A1, this one has no acce=
nt
>> -->a\n\n");
>> =C2=A0printf(line2);
>> =C2=A0printf("%d %d %d\n\n",line1[29],line1[30],line1[31]);
>> =C2=A0printf(line1);
>> =C2=A0printf(line2);
>> =C2=A0printf(" These letters have an accent -->=C3=83=C2=A1, these ones =
have no accent
>> -->A!\n\n");
>> =C2=A0printf(line2);
>> =C2=A0printf("%d %d %d %d\n\n",line3[32],line3[33],line3[34],line3[35]);
>> =C2=A0printf(line3);
>> =C2=A0printf(line2);
>> =C2=A0printf(" This letter has an accent -->=C3=83, this one has no acce=
nt
>> -->A\n\n");
>> =C2=A0printf(line2);
>> =C2=A0printf("%d %d %d\n\n",line4[29],line4[30],line4[31]);
>> =C2=A0printf(line4);
>> =C2=A0printf(line2);
>> =C2=A0printf(" ----- END ------");
>> }----------------
>>
>> My output is:
>>
>> =C2=A0This letter has an accent -->=C3=A1, this one has no accent -->a
>>
>> =C2=A0***** another line ******
>>
>> 62 -31 44
>>
>> =C2=A0This letter has an accent --> ***** another line ******
>>
>> =C2=A0These letters have an accent -->=C3=83=C2=A1, these ones have no a=
ccent -->A!
>>
>> =C2=A0***** another line ******
>>
>> 62 -61 -95 44
>>
>> =C2=A0These letters have an accent -->=C3=83=C2=A1, these ones have no a=
ccent -->A!
>>
>> =C2=A0***** another line ******
>>
>> =C2=A0This letter has an accent -->=C3=83, this one has no accent -->A
>>
>> =C2=A0***** another line ******
>>
>> 62 -61 44
>>
>> =C2=A0This letter has an accent --> ***** another line ******
>>
>> =C2=A0----- END ------
>>
>> As you can see the output of printf(string_constant) is what
>> I expected. The ouput of printf(char_array) is trucated at the non-ASCII
>> character.
>
> Reproduced. Looking at the compiler's assembly output, some of the
> printf() calls are replaced by calls to puts(), and those do work
> correctly, whereas the remaining printf() calls with accented
> characters misbehave. So printf()'s handling of non-ASCII characters
> needs a closer look.

Ah, the problem actually is that your program is missing a call to
setlocale(LC_CTYPE, "") to switch to the locale and character set
specified in the environment. In fact, since your program contains
hard-coded ISO-8859-15 strings, you should probably do
setlocale(LC_CTYPE, "<whatever>.ISO-8859-15").

Without a setlocale call, programs use the "C" locale, and on Cygwin
1.7 that implies the UTF-8 character set. Those single accented
ISO-8859-15 characters are invalid when interpreted as UTF-8, so
printf halts there. The accented character pairs like "=C3=83=C2=A1", meanw=
hile,
happen to be valid UTF-8, so they get through.

I couldn't find specific text about invalid bytes in the POSIX printf
spec, but it does say the following: "The format is a character
string, beginning and ending in its initial shift state, if any. The
format is composed of zero or more directives: ordinary characters,
which are simply copied to the output stream, and conversion
specifications, each of which shall result in the fetching of zero or
more arguments."

It's talking about "characters" rather than "bytes" there, which I
think does leave the behaviour for invalid bytes undefined, so
newlib's printf implementation is in its rights to just stop
processing the string at one of those.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019