X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.5 required=5.0 tests=AWL,BAYES_00,SPF_PASS X-Spam-Check-By: sourceware.org Message-ID: <4A26AB1D.1090404@sidefx.com> Date: Wed, 03 Jun 2009 12:55:57 -0400 From: Edward Lam User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line References: <3f0ad08d0905290852xe41338alfda89c622f92f677 AT mail DOT gmail DOT com> <4A200BC0 DOT 9010704 AT sidefx DOT com> <4A204149 DOT 2050009 AT sidefx DOT com> <4A2051E5 DOT 6060600 AT sidefx DOT com> <20090602205440 DOT GF23519 AT calimero DOT vinschen DOT de> <4A26782C DOT 9040207 AT sidefx DOT com> <20090603142755 DOT GM23519 AT calimero DOT vinschen DOT de> <20090603160225 DOT GA27039 AT ednor DOT casa DOT cgf DOT cx> <20090603161158 DOT GB23419 AT calimero DOT vinschen DOT de> In-Reply-To: <20090603161158.GB23419@calimero.vinschen.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Corinna Vinschen wrote: > On Jun 3 12:02, Christopher Faylor wrote: >> On Wed, Jun 03, 2009 at 04:27:55PM +0200, Corinna Vinschen wrote: >>> On Jun 3 09:18, Edward Lam wrote: >>>> Corinna Vinschen wrote: >>>>> The question is, what do you expect? [...] >>>> [...] >>>> Wikipedia has several suggestions on how to handle invalid UTF-8 byte >>>> sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the >>>> rule that uses the replacement character. >>> Chris implemented using the invalid code point solution. The discussion >>> in http://www.mail-archive.com/linux-utf8 AT nl DOT linux DOT org/msg00080.html >>> supports this solution. What's missing so far is the way back, from >>> an invalid single second half of a surrogate pair in the 0xDCxx range >>> back to the correct byte value. I'm just looking into that. >> The way back was not, AFAIK, needed for Cygwin programs. I don't think >> there is a valid way back for Windows programs. > > The way back is not needed for the argv handling in Cygwin, but it > gets necessary if you converted to UTF-16 in other circumstances. > It's not much of a problem since the way back is a no-brainer, in > contrast to the conversion to UTF-16. What is the current state of affairs in cygwin 1.7.0-48? Is the invalid code point solution currently being used when converting the command line to UTF-16 when spawning non-cygwin processes? What I'm trying to understand is where the command line truncation is taking place, in the parent or child process. If the truncation is happening in the child process because of the invalid code point, then perhaps we should consider using the replacement character solution when spawning non-cygwin child processes. IMHO, having a bad character is better than having a truncated command line. At least, the problem (invalid UTF-8) then becomes more obvious. -Edward -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/