X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Mon, 24 Jan 2011 16:41:58 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Bug in libiconv? Message-ID: <20110124154158.GA15279@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Hi Chuck, hi everyone else, In a twisted turn of events, I'm trying to get the orphaned catgets package to work correctly on Cygwin 1.7. As you might know, the package is derived from the glibc package. Apart from other portability issues of this *very* glibc-centric piece of code, I found some problem which appears to point to two bugs in Cygwin's libiconv2. For some reason, the iconv conversion seems to be overly dependent on the usage of setlocale, and the returned value in the fourth parameter appears to be incorrect, if the output codeset is "WCHAR_T". Here's a simple testcase: ==== SNIP ==== #include #include #include #include #include #include #include iconv_t open_iconv () { iconv_t cd_towcp = iconv_open ("WCHAR_T", "UTF-8"); if (cd_towcp == (iconv_t) -1) { fprintf (stderr, "iconv_open: %d <%s>\n", errno, strerror (errno)); exit (1); } return cd_towcp; } void run_iconv (iconv_t cd_towcp, char *input) { wchar_t out[256]; char *inbuf = input; size_t inbytesleft = strlen (inbuf); char *outbuf = (char *) out; size_t outbytesleft = sizeof (out); size_t ret = iconv (cd_towcp, &inbuf, &inbytesleft, &outbuf, &outbytesleft); if (ret == (size_t) -1) fprintf (stderr, "iconv: %d <%s>\n", errno, strerror (errno)); printf ("in = <%s>, inbuf = <%s>, inbytesleft = %zd, outbytesleft = %zd\n", input, inbuf, inbytesleft, outbytesleft); } int main () { iconv_t cd_towcp; char *finnish = "Liian pitk\303\244 sana"; // Umlaut-a setlocale (LC_ALL, "C"); cd_towcp = open_iconv (); setlocale (LC_ALL, "C"); run_iconv (cd_towcp, finnish); setlocale (LC_ALL, "C.UTF-8"); run_iconv (cd_towcp, finnish); iconv_close (cd_towcp); setlocale (LC_ALL, "C.UTF-8"); cd_towcp = open_iconv (); setlocale (LC_ALL, "C"); run_iconv (cd_towcp, finnish); setlocale (LC_ALL, "C.UTF-8"); run_iconv (cd_towcp, finnish); iconv_close (cd_towcp); return 0; } ==== SNAP ==== Here are the important details: - The input string is a fixed finnish UTF-8 sentence containing a single non-ASCII char. - The testcase always calls setlocale before calling iconv_open(), then subsequently it sets setlocale before calling iconv(). - So the application tests to convert a UTF-8 to WCHAR_T string in four combinations of the current locale, in this order: - iconv_open "C", iconv "C" - iconv_open "C", iconv "C.UTF-8" - iconv_open "C.UTF-8", iconv "C" - iconv_open "C.UTF-8", iconv "C.UTF-8" Here's what happens in Linux: $ gcc -g -o ic ic.c $ ./ic in = , inbuf = <>, inbytesleft = 0, outbytesleft = 960 in = , inbuf = <>, inbytesleft = 0, outbytesleft = 960 in = , inbuf = <>, inbytesleft = 0, outbytesleft = 960 in = , inbuf = <>, inbytesleft = 0, outbytesleft = 960 Here's what happens on Cygwin: $ gcc -g -o ic ic.c -liconv $ ./ic iconv: 138 in = , inbuf = <ä sana>, inbytesleft = 7, outbytesleft = 492 iconv: 138 in = , inbuf = <ä sana>, inbytesleft = 7, outbytesleft = 492 iconv: 138 in = , inbuf = <ä sana>, inbytesleft = 7, outbytesleft = 492 in = , inbuf = <>, inbytesleft = 0, outbytesleft = 480 So, AFAICS, there are two problems: - Even though iconv_open has been opened explicitely with "UTF-8" as input string, the conversion still depends on the current application codeset. That dsoesn't make sense. - Even though the last parameter to iconv is defined in bytes, the value of outbytesleft after the conversion is the number of remaining wchar"t's, not the number of remaining bytes. That's contrary to what POSIX defines, see http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html Is this analyzes correct? Is there by any chance a newer version of libiconv2 which does not have these problems? Thanks, Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple