X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Tue, 2 Jun 2009 22:54:40 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line Message-ID: <20090602205440.GF23519@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <4A1EF2CE DOT 2060509 AT sidefx DOT com> <3f0ad08d0905290813m39999f81q918e94e3c960eb3f AT mail DOT gmail DOT com> <4A200287 DOT 8030403 AT sidefx DOT com> <3f0ad08d0905290852xe41338alfda89c622f92f677 AT mail DOT gmail DOT com> <4A200BC0 DOT 9010704 AT sidefx DOT com> <4A204149 DOT 2050009 AT sidefx DOT com> <4A2051E5 DOT 6060600 AT sidefx DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A2051E5.6060600@sidefx.com> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On May 29 17:21, Edward Lam wrote: > > Alexey Borzenkov wrote: > > No, the bug is not that it gets wrong number of arguments. In fact, > > Windows has no concept of arguments, only C runtime does, which parses > > the command line. If command line is truncated, then C runtime will > > have missing arguments when it tries to parse it. > > Sorry, I had meant to comment on this previously but hit send too soon. > > I think the problem I'm running into is: > - I give cygwin 1.7's bash a string that is in my system default code page. > - cygwin 1.7 thinks the string is actually UTF-8 and tries to convert it > as UTF-8 into UTF-16, resulting in a truncated command line that is > passed to child process. > > Here's some more investigation: > > $ cat bug.c > #include > > int wmain(int argc, wchar_t *argv[], wchar_t *envp[]) > { > int i; > for (i = 0; i < argc; i++) > wprintf(L"%d: %s\n", i, argv[i]); > return 0; > } > > ... and compiled using MSVC .... > > $ ./bug arg1 "before `cat copyright.txt` after" arg3 > 0: E:\cygwin1.7\tmp\bug.exe > 1: arg1 > 2: before > > So note that even when I'm seems to be an UNICODE-AWARE child process, > I'm still getting a truncated command line. In fact, call > GetCommandLineW() directly seems to give a truncated command line > as well. The question is, what do you expect? I know, you expect that it "just works", but that's not as easy as you might assume, unfortunately. Let's assume you're doing all this in a Windows console. The character we're talking about is a singlebyte or multibyte character with the value 0xa9. What exactly is this character 0xa9? - It's the "Copyright" sign in Windows codepage 1252, the default GUI (ANSI) codepage for many western languages and, incidentally, in ISO-8859-1 and ISO-8859-15. The Unicode value of this character is 0xa9. - It's the "reverse not sign" in Windows codepage 437, the default console (OEM) codepage on US systems. The Unicode value is 0x2310. - It's the "Registered trademark" sign in Windows codepage 850, the default OEM codepage in a couple of western european languages (French, German, Italian, ...). The Unicode value is 0xae. - It's the Cyrillic capital letter IE in Windows codepage 855, the default OEM codepage for languages using cyrillic characters. The Unicode value is 0x0415. Yoy get the idea. The character 0xa9 has no meaning in itself. It only has a meaning when you consider the character set or codepage in which you use this character. When converting this character to UTF-16, the converting function has to know the charset in which the character has been given. The problem is, how is Cygwin supposed to know in which codepage or charset the character has been created? In your case it's even more weird. How is anybody supposed to know that the file which consists of the single byte 0xa9 has *any* meaning at all? Why should it be the copyright sign, of all things? Cygwin now defaults to UTF-8. In UTF-8 the character value 0xa9 is an invalid character. The conversion function which converts the command line fails due to an invalid character value. Whether this is good or bad is another problem, but fact is, Cygwin doesn't know what to do with this value in the first place. It doesn't know anything about the charset used to generate the character with the value 0xa9. So, even if you take Cygwin out of the picture, if you create a console application which writes the multibyte character with value 0xa9 to the console, it will in all likelihood not be the copyright sign. If you're printing on a US system, the default console codepage is 437 and you get the reverse not sign. If you call `chcp 1252' and print again, you get the copyright sign. The bottom line is, whatever default we use, we're screwed in some way, because it will cause inconvenience for one part of the users and help the others. That was already the case for the old CYGWIN=codepage:{oem|ansi} environment variable setting. If we default to the OEM charset, you will not get the expected result for characters created using the ANSI codepage and get problems interacting with applications using the ANSI codepage. If we default to the ANSI codepage, you will have the same problem, just upside down. In both cases you will have even more problems if you start using characters not available in your default codepage. If we default to UTF-8, we have no problem in Cygwin to work with any Unicode character, but you will have to take some care when interacting with Windows applications when using non-ASCII characters. In your case, in which only you know that 0xa9 is meant to be the copyright char, you should tell Cygwin which charset you want to use. Try setting LANG to en_US.CP1252. Your example should work then. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/