delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/06/02/16:55:06

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Tue, 2 Jun 2009 22:54:40 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line
Message-ID: <20090602205440.GF23519@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <e2480c70905281131u37651a2eoba946637bd414516 AT mail DOT gmail DOT com> <4A1EF2CE DOT 2060509 AT sidefx DOT com> <3f0ad08d0905290813m39999f81q918e94e3c960eb3f AT mail DOT gmail DOT com> <4A200287 DOT 8030403 AT sidefx DOT com> <3f0ad08d0905290852xe41338alfda89c622f92f677 AT mail DOT gmail DOT com> <4A200BC0 DOT 9010704 AT sidefx DOT com> <e2480c70905291142o2bcc65ccw2287d175dbd09dd5 AT mail DOT gmail DOT com> <4A204149 DOT 2050009 AT sidefx DOT com> <e2480c70905291337g6c8bcca7xd0baba79c84629db AT mail DOT gmail DOT com> <4A2051E5 DOT 6060600 AT sidefx DOT com>
MIME-Version: 1.0
In-Reply-To: <4A2051E5.6060600@sidefx.com>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On May 29 17:21, Edward Lam wrote:
>
> Alexey Borzenkov wrote:
> > No, the bug is not that it gets wrong number of arguments. In fact,
> > Windows has no concept of arguments, only C runtime does, which parses
> > the command line. If command line is truncated, then C runtime will
> > have missing arguments when it tries to parse it.
>
> Sorry, I had meant to comment on this previously but hit send too soon.
>
> I think the problem I'm running into is:
> - I give cygwin 1.7's bash a string that is in my system default code page.
> - cygwin 1.7 thinks the string is actually UTF-8 and tries to convert it  
> as UTF-8 into UTF-16, resulting in a truncated command line that is  
> passed to child process.
>
> Here's some more investigation:
>
> $ cat bug.c
> #include <stdio.h>
>
> int wmain(int argc, wchar_t *argv[], wchar_t *envp[])
> {
>     int i;
>     for (i = 0; i < argc; i++)
>         wprintf(L"%d: %s\n", i, argv[i]);
>     return 0;
> }
>
> ... and compiled using MSVC ....
>
> $ ./bug arg1 "before `cat copyright.txt` after" arg3
> 0: E:\cygwin1.7\tmp\bug.exe
> 1: arg1
> 2: before
>
> So note that even when I'm seems to be an UNICODE-AWARE child process,  
> I'm still getting a truncated command line. In fact, call  
> GetCommandLineW() directly seems to give a truncated command line
> as well.

The question is, what do you expect?  I know, you expect that it "just
works", but that's not as easy as you might assume, unfortunately.

Let's assume you're doing all this in a Windows console.  The character
we're talking about is a singlebyte or multibyte character with the
value 0xa9.  What exactly is this character 0xa9?

- It's the "Copyright" sign in Windows codepage 1252, the default GUI
  (ANSI) codepage for many western languages and, incidentally, in
  ISO-8859-1 and ISO-8859-15.  The Unicode value of this character is
  0xa9.

- It's the "reverse not sign" in Windows codepage 437, the default
  console (OEM) codepage on US systems.  The Unicode value is 0x2310.

- It's the "Registered trademark" sign in Windows codepage 850, the
  default OEM codepage in a couple of western european languages
  (French, German, Italian, ...).  The Unicode value is 0xae.

- It's the Cyrillic capital letter IE in Windows codepage 855, the
  default OEM codepage for languages using cyrillic characters.  The
  Unicode value is 0x0415.

Yoy get the idea.  The character 0xa9 has no meaning in itself.  It only
has a meaning when you consider the character set or codepage in which
you use this character.

When converting this character to UTF-16, the converting function has to
know the charset in which the character has been given.  The problem is,
how is Cygwin supposed to know in which codepage or charset the
character has been created?  In your case it's even more weird.  How is
anybody supposed to know that the file which consists of the single byte
0xa9 has *any* meaning at all?  Why should it be the copyright sign, of
all things?

Cygwin now defaults to UTF-8.  In UTF-8 the character value 0xa9 is an
invalid character.  The conversion function which converts the command
line fails due to an invalid character value.  Whether this is good or
bad is another problem, but fact is, Cygwin doesn't know what to do with
this value in the first place.  It doesn't know anything about the
charset used to generate the character with the value 0xa9.  So, even if
you take Cygwin out of the picture, if you create a console application
which writes the multibyte character with value 0xa9 to the console, it
will in all likelihood not be the copyright sign.  If you're printing on
a US system, the default console codepage is 437 and you get the reverse
not sign.  If you call `chcp 1252' and print again, you get the
copyright sign.

The bottom line is, whatever default we use, we're screwed in some way,
because it will cause inconvenience for one part of the users and help
the others.  That was already the case for the old
CYGWIN=codepage:{oem|ansi} environment variable setting.

If we default to the OEM charset, you will not get the expected result
for characters created using the ANSI codepage and get problems
interacting with applications using the ANSI codepage.

If we default to the ANSI codepage, you will have the same problem, just
upside down.  In both cases you will have even more problems if you
start using characters not available in your default codepage.

If we default to UTF-8, we have no problem in Cygwin to work with any
Unicode character, but you will have to take some care when interacting
with Windows applications when using non-ASCII characters.  In your case,
in which only you know that 0xa9 is meant to be the copyright char, you
should tell Cygwin which charset you want to use.  Try setting LANG to
en_US.CP1252.  Your example should work then.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019