delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2010/02/20/03:00:52

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=BAYES_00,SARE_MSGID_LONG40,SPF_PASS
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <drmun5969k15jlm1ji2auh5cojrnakc6uu@4ax.com>
References: <t94sn59ntooeal9hc0a25hkk7ntphg99cf AT 4ax DOT com> <c6fsn5ln6bdtgr86bp3ri44ui48kf57ica AT 4ax DOT com> <416096c61002191229x670cbb63gf5c693056af727a2 AT mail DOT gmail DOT com> <drmun5969k15jlm1ji2auh5cojrnakc6uu AT 4ax DOT com>
Date: Sat, 20 Feb 2010 08:00:35 +0000
Message-ID: <416096c61002200000r549264c4tfdf46a9b71700bc@mail.gmail.com>
Subject: Re: 1.7.1: unable to run the a bash script resides in chinese path using: c:\cygwin\bin\bash --login script.
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

Hongyi Zhao:
>>Looks like there's some sort of GBK vs UTF-8 mixup going on, because
>>'=E9=8F=82=E7=89=88=E7=85=A1=E9=8F=82=E5=9B=A9=E5=B0=9E' is the same byte=
 sequence in GBK as '=E6=96=B0=E6=9F=A5=E6=96=87=E7=8C=AE' is in UTF-8:
>>\xE6\x96\xB0\xE6\x9F\xA5\xE6\x96\x87\xE7\x8C\xAE
>
> Could you please give me some hints on the tools
> used by you to obtain this conclusion?

That was just a hunch based on the length of the two strings, and I
confirmed it by pasting the strings into mintty running a utility for
echoing keycodes, switching charset as appropriate.

Anyway, I had a look into why the dosfilewarning prints the wrong
filename: it calls small_sprintf to print the message, and
small_sprintf uses the ANSI version of WriteFile to write to
STD_ERROR_HANDLE, so it ends up interpreting a UTF-8 string as GBK.
Seems sys_mbstowcs and WriteFileW are needed there.


>=C2=A0The actual directory name is '=E6=96=B0=E6=9F=A5=E6=96=87=E7=8C=AE'.
>>
>>Do you know what the encoding of your batch file is?
>
> GB2312
>
>> And have you got
>>any locale variables (LC_ALL, LC_CTYPE, LANG) set when invoking it?
>
> I'use the following settings in the same batch file:
>
> set LC_ALL=3Dzh_CN.UTF-8
> set LC_CTYPE=3D"zh_CN.UTF-8"
> set LANG=3Dzh_CN.UTF-8

Thanks for the info. (Btw, setting all three locale variables is
overkill; just LC_ALL or LANG will do. Doesn't make a difference here
though.)


>>>>@echo off
>>>>C:\cygwin\bin\bash --login "%~dp0myscript"
>>>
>>> I've found a more strange thing: If I change the batch file into the
>>> following form, then it will be run smoothly:
>>>
>>> @echo off
>>> C:\cygwin\bin\bash --login %~dp0myscript
>>>
>>> The QUOTATION MARK in the former is used to deal with the whitespaces
>>> appearing in the myscript's pathname, though this is relatively rare
>>> case. ?But in the latter case, if there're whitespaces in the
>>> myscript's pathname, the batch will fail to run.
>>
>>Hmm, perhaps the argument mangling at program startup is using the
>>ANSI codepage (i.e. GBK in this case) when it should be using UTF-8?
>
> But, if I convert my batch file into UTF-8 (without BOM, CR/LF line
> endings) format, I'll meet the following error:
>
> /usr/bin/bash:
> "F:/zhaohs/Desktop/=E9=8F=82=E7=89=88=E7=85=A1=E9=8F=82=E5=9B=A9=E5=B0=9E=
/RestoreName4Elsevier.sh": No such
> =C2=A0file or directory

That's easily explained: batch files are assumed (by Windows) to be
encoded in the OEM codepage, which on your system will be the same as
the ANSI codepage, i.e. GBK (aka CP936). So if your batch file
actually contains UTF-8, you get mojibake.

I'm stumped though as to why your example works without quotation when
it does without them, especially considering that bash actually
reports the correct filename.

> /usr/bin/bash: "F:\zhaohs\Desktop\=E6=96=B0=E6=9F=A5=E6=96=87=E7=8C=AE\my=
script": No such file or
directory

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019