delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2011/01/24/10:42:32

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Mon, 24 Jan 2011 16:41:58 +0100
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Bug in libiconv?
Message-ID: <20110124154158.GA15279@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
MIME-Version: 1.0
User-Agent: Mutt/1.5.21 (2010-09-15)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

Hi Chuck,
hi everyone else,


In a twisted turn of events, I'm trying to get the orphaned catgets
package to work correctly on Cygwin 1.7.  As you might know, the package
is derived from the glibc package.  Apart from other portability issues
of this *very* glibc-centric piece of code, I found some problem which
appears to point to two bugs in Cygwin's libiconv2.

For some reason, the iconv conversion seems to be overly dependent on
the usage of setlocale, and the returned value in the fourth parameter
appears to be incorrect, if the output codeset is "WCHAR_T".

Here's a simple testcase:

==== SNIP ====
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <iconv.h>
#include <locale.h>
#include <wchar.h>

iconv_t
open_iconv ()
{
  iconv_t cd_towcp = iconv_open ("WCHAR_T", "UTF-8");
  if (cd_towcp == (iconv_t) -1)
    {
      fprintf (stderr, "iconv_open: %d <%s>\n", errno, strerror (errno));
      exit (1);
    }
  return cd_towcp;
}

void
run_iconv (iconv_t cd_towcp, char *input)
{
  wchar_t out[256];

  char *inbuf = input;
  size_t inbytesleft = strlen (inbuf);
  char *outbuf = (char *) out;
  size_t outbytesleft = sizeof (out);
  size_t ret = iconv (cd_towcp, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
  if (ret == (size_t) -1)
    fprintf (stderr, "iconv: %d <%s>\n", errno, strerror (errno));
  printf ("in = <%s>, inbuf = <%s>, inbytesleft = %zd, outbytesleft = %zd\n",
	  input, inbuf, inbytesleft, outbytesleft);
}

int
main ()
{
  iconv_t cd_towcp;
  char *finnish = "Liian pitk\303\244 sana";  // Umlaut-a
  
  setlocale (LC_ALL, "C");
  cd_towcp = open_iconv ();
  setlocale (LC_ALL, "C");
  run_iconv (cd_towcp, finnish);
  setlocale (LC_ALL, "C.UTF-8");
  run_iconv (cd_towcp, finnish);
  iconv_close (cd_towcp);
  
  setlocale (LC_ALL, "C.UTF-8");
  cd_towcp = open_iconv ();
  setlocale (LC_ALL, "C");
  run_iconv (cd_towcp, finnish);
  setlocale (LC_ALL, "C.UTF-8");
  run_iconv (cd_towcp, finnish);
  iconv_close (cd_towcp);

  return 0;
}
==== SNAP ====

Here are the important details:

- The input string is a fixed finnish UTF-8 sentence containing a
  single non-ASCII char.

- The testcase always calls setlocale before calling iconv_open(),
  then subsequently it sets setlocale before calling iconv().

- So the application tests to convert a UTF-8 to WCHAR_T string in four
  combinations of the current locale, in this order:

  - iconv_open "C",       iconv "C"
  - iconv_open "C",       iconv "C.UTF-8"
  - iconv_open "C.UTF-8", iconv "C"
  - iconv_open "C.UTF-8", iconv "C.UTF-8"

Here's what happens in Linux:

  $ gcc -g -o ic ic.c
  $ ./ic
  in = <Liian pitkä sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
  in = <Liian pitkä sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
  in = <Liian pitkä sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
  in = <Liian pitkä sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960

Here's what happens on Cygwin:

  $ gcc -g -o ic ic.c -liconv
  $ ./ic
  iconv: 138 <Invalid or incomplete multibyte or wide character>
  in = <Liian pitkä sana>, inbuf = <ä sana>, inbytesleft = 7, outbytesleft = 492
  iconv: 138 <Invalid or incomplete multibyte or wide character>
  in = <Liian pitkä sana>, inbuf = <ä sana>, inbytesleft = 7, outbytesleft = 492
  iconv: 138 <Invalid or incomplete multibyte or wide character>
  in = <Liian pitkä sana>, inbuf = <ä sana>, inbytesleft = 7, outbytesleft = 492
  in = <Liian pitkä sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 480

So, AFAICS, there are two problems:

  - Even though iconv_open has been opened explicitely with "UTF-8" as
    input string, the conversion still depends on the current application
    codeset.  That dsoesn't make sense.

  - Even though the last parameter to iconv is defined in bytes, the
    value of outbytesleft after the conversion is the number of remaining
    wchar"t's, not the number of remaining bytes.  That's contrary to what
    POSIX defines, see
    http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html

Is this analyzes correct?  Is there by any chance a newer version of
libiconv2 which does not have these problems?


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019