X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.2 required=5.0 tests=AWL,BAYES_00,SPF_PASS X-Spam-Check-By: sourceware.org Message-ID: <4A26782C.9040207@sidefx.com> Date: Wed, 03 Jun 2009 09:18:36 -0400 From: Edward Lam User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line References: <4A1EF2CE DOT 2060509 AT sidefx DOT com> <3f0ad08d0905290813m39999f81q918e94e3c960eb3f AT mail DOT gmail DOT com> <4A200287 DOT 8030403 AT sidefx DOT com> <3f0ad08d0905290852xe41338alfda89c622f92f677 AT mail DOT gmail DOT com> <4A200BC0 DOT 9010704 AT sidefx DOT com> <4A204149 DOT 2050009 AT sidefx DOT com> <4A2051E5 DOT 6060600 AT sidefx DOT com> <20090602205440 DOT GF23519 AT calimero DOT vinschen DOT de> In-Reply-To: <20090602205440.GF23519@calimero.vinschen.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Corinna Vinschen wrote: > On May 29 17:21, Edward Lam wrote: >> >> I think the problem I'm running into is: - I give cygwin 1.7's bash >> a string that is in my system default code page. - cygwin 1.7 >> thinks the string is actually UTF-8 and tries to convert it as >> UTF-8 into UTF-16, resulting in a truncated command line that is >> passed to child process. > > The question is, what do you expect? I know, you expect that it > "just works", but that's not as easy as you might assume, > unfortunately. Yes, Alexey and I had a lengthy argument on this thread already. Disagreements on the default LANG behaviour notwithstanding, I think that it still should NOT truncate, substituting the invalid character with something else instead. Here's a quote from Alexey previously on this thread: "In my opinion: truncation is a bug (should use replacement character, or fail exec altogether), expecting utf-8 is not" Wikipedia has several suggestions on how to handle invalid UTF-8 byte sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the rule that uses the replacement character. > Yoy get the idea. The character 0xa9 has no meaning in itself. It > only has a meaning when you consider the character set or codepage in > which you use this character. ... > How is anybody supposed to know that the file which consists > of the single byte 0xa9 has *any* meaning at all? Why should it be > the copyright sign, of all things? What I was attempting to do was to have NO conversion. In the real case that I into this, the "bug.exe" was the one to properly interpret what the byte 0xA9 meant from the command line. Yes, I know there are several workarounds. > If we default to the ANSI codepage, you will have the same problem, > just upside down. In both cases you will have even more problems if > you start using characters not available in your default codepage. This is where I disagreed with Alexey. What we're really arguing here is whether which default will run into the least problems for the most common usage. This is subjective of course. -Edward -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/