X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.4 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,RCVD_IN_DNSWL_NONE X-Spam-Check-By: sourceware.org X-RZG-AUTH: :Ln4Re0+Ic/6oZXR1YgKryK8brksyK8dozXDwHXjf9hj/zDNRbfA44+iwyQ== X-RZG-CLASS-ID: mo00 From: Bruno Haible <bruno AT clisp DOT org> To: cygwin AT cygwin DOT com, Corinna Vinschen <corinna-cygwin AT cygwin DOT com>, Charles Wilson <cygwin AT cwilson DOT fastmail DOT fm>, bug-gnu-libiconv AT gnu DOT org Subject: Re: Bug in libiconv? Date: Fri, 28 Jan 2011 23:12:48 +0100 User-Agent: KMail/1.9.9 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <201101282312.50298.bruno@clisp.org> Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: <cygwin.cygwin.com> List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com> List-Archive: <http://sourceware.org/ml/cygwin/> List-Post: <mailto:cygwin AT cygwin DOT com> List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs> Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Hi Corinna and Chuck, Please CC the bug-gnu-libiconv mailing list when discussing possible bugs in GNU libiconv. Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00292.html>: > the application tests to convert a UTF-8 to WCHAR_T string in four > combinations of the current locale, in this order: >=20 > - iconv_open "C", iconv "C" > - iconv_open "C", iconv "C.UTF-8" > - iconv_open "C.UTF-8", iconv "C" > - iconv_open "C.UTF-8", iconv "C.UTF-8" >=20 > Here's what happens in Linux: >=20 > $ gcc -g -o ic ic.c > $ ./ic > in =3D <Liian pitk=C3=83 sana>, inbuf =3D <>, inbytesleft =3D 0, outbyt= esleft =3D 960 > in =3D <Liian pitk=C3=83 sana>, inbuf =3D <>, inbytesleft =3D 0, outbyt= esleft =3D 960 > in =3D <Liian pitk=C3=83 sana>, inbuf =3D <>, inbytesleft =3D 0, outbyt= esleft =3D 960 > in =3D <Liian pitk=C3=83 sana>, inbuf =3D <>, inbytesleft =3D 0, outbyt= esleft =3D 960 >=20 > Here's what happens on Cygwin: >=20 > $ gcc -g -o ic ic.c -liconv > $ ./ic > iconv: 138 <Invalid or incomplete multibyte or wide character> > in =3D <Liian pitk=C3=83 sana>, inbuf =3D <=C3=83 sana>, inbytesleft = =3D 7, outbytesleft =3D 492 > iconv: 138 <Invalid or incomplete multibyte or wide character> > in =3D <Liian pitk=C3=83 sana>, inbuf =3D <=C3=83 sana>, inbytesleft = =3D 7, outbytesleft =3D 492 > iconv: 138 <Invalid or incomplete multibyte or wide character> > in =3D <Liian pitk=C3=83 sana>, inbuf =3D <=C3=83 sana>, inbytesleft = =3D 7, outbytesleft =3D 492 > in =3D <Liian pitk=C3=83 sana>, inbuf =3D <>, inbytesleft =3D 0, outbyt= esleft =3D 480 On glibc systems, the encoding "WCHAR_T" is equivalent to "UCS-4" with mach= ine dependent endianness and alignment. In particular it is independent of the locale. That explains the first set of results. In libiconv, on systems which don't define __STDC_ISO_10646__, the encoding "WCHAR_T" is equivalent to wchar_t[], that is, dependent on the locale. Changing the locale encoding after allocating an iconv_t from or to "WCHAR_= T" yields undefined behaviour. That explains the second set of results. Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00299.html>: > I defined __STDC_ISO_10646__ for Cygwin 1.7.8 yesterday. What is the Cygwin wchar_t[] encoding? Is it UTF-16, like on Win32? The documentation is silent about it. I had expected to find some word about it in <http://cygwin.com/cygwin-api/compatibility.html#std-susv4> or <http://cygwin.com/cygwin-api/std-notes.html>. In any case, sizeof (wchar_t) =3D=3D 2. I don't think defining __STDC_ISO_1= 0646__ is compliant with ISO C 99 in this situation. ISO C 99 section 6.10.8.(2) s= ays: __STDC_ISO_10646__ An integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month. But when characters outside the basic plane, such as U+12345 (CUNEIFORM SIGN URU TIMES KI), are encoded by 2 consecutive wchar_t values, values of type wchar_t don't correspond to ISO/IEC 10646 characters. (Or maybe I'm underestimating what "coded representations" means...?) Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00357.html>: > #if __STDC_ISO_10646__ || ((defined _WIN32 || defined __WIN32__) && !de= fined __CYGWIN__) > This should be > ... > #if __STDC_ISO_10646__ || defined _WIN32 || defined __WIN32__ || define= d __CYGWIN__ That makes sense if Cygwin guarantees that from now on and in the future, the wchar_t encoding will always be UTF-16. Is this the case? Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00299.html>: > Why on earth is libiconv on Cygwin using Windows functions in some > places? So that I could reuse the essentially same code on Cygwin as on native Win3= 2. Charles has submitted a patch on this topic to bug-gnulib; I will handle it. > the old cygwin_conv_to_posix_path function as well. Is cygwin_conv_to_posix_path deprecated? Does it introduce limitations of some kind? > The usage of a fixed table instaed of the charset.alias file in > libcharset/lib/localcharset.c, function get_charset_aliases() is > not good, not good at all. The alternative is to have this table stored in a file charset.alias; but then every package that includes the module 'localcharset' from gnulib (that is, libiconv, gettext, coreutils, and many others) will want to modify this file during "make install". And this causes a lot of headaches to packaging systems. Therefore, on platforms which have widely used packaging systems (Linux, MacOS X, Cygwin), it's better to avoid the need for this file. Additionally, on Win32 systems relocatability is a must, and the code to compute the location of charset.alias from the location of libiconv.dll would be overkill. Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00303.html>: > It looks like there's been some bitrot with respect > to some of the "&& !CYGWIN" guards on WIN32. Both libiconv and gettext, > IIRC, jump thru hoops to ensure that [_]*WIN32 is defined for both > "regular" win32 and for cygwin...which means defined(CYGWIN) guards are > necessary. The reason for these "&& !defined __CYGWIN__" clauses is that - at least in Cygwin 1.5.x - gcc has an option that will define _WIN32 or __WIN32__. So, when _WIN32 || __WIN32__ may evaluate to true on Cygwin, or it may evaluate to false on Cygwin. Since I don't want libiconv or gettext to be compiled in two possible ways on Cygwin, I add "&& !defined __CYGWIN__". Neither libiconv nor gettext defines or undefines _WIN32 or __WIN32__. But they are prepared to either setting. Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00332.html>: > there ARE still bugs in libiconv on Cygwin -- specifically: > - Even though iconv_open has been opened explicitely with "UTF-8" as > input string, the conversion still depends on the current application > codeset. That doesn't make sense. If the other argument to iconv_open is "CHAR" or "WCHAR_T", hence locale dependent, and you change the locale in between, the result is undefined behaviour. > - 'iconv_close ((iconv_t) -1);' crashes the application with a SEGV. It's not a bug. From POSIX:2008 <http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv_open.html> you can infer that (iconv_t) -1 is not a "conversion descriptor". It's a return value used from iconv_open(), nothing more. From <http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv_close.html> you can see that the argument of iconv_close() has to be a conversion descriptor. From the ERRORS section in the same page you can see that iconv_close() is not required to catch a faulty argument. Note the word "may", not "shall". Bruno -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple