Mail Archives: cygwin/2010/02/20/03:00:52
Hongyi Zhao:
>>Looks like there's some sort of GBK vs UTF-8 mixup going on, because
>>'=E9=8F=82=E7=89=88=E7=85=A1=E9=8F=82=E5=9B=A9=E5=B0=9E' is the same byte=
sequence in GBK as '=E6=96=B0=E6=9F=A5=E6=96=87=E7=8C=AE' is in UTF-8:
>>\xE6\x96\xB0\xE6\x9F\xA5\xE6\x96\x87\xE7\x8C\xAE
>
> Could you please give me some hints on the tools
> used by you to obtain this conclusion?
That was just a hunch based on the length of the two strings, and I
confirmed it by pasting the strings into mintty running a utility for
echoing keycodes, switching charset as appropriate.
Anyway, I had a look into why the dosfilewarning prints the wrong
filename: it calls small_sprintf to print the message, and
small_sprintf uses the ANSI version of WriteFile to write to
STD_ERROR_HANDLE, so it ends up interpreting a UTF-8 string as GBK.
Seems sys_mbstowcs and WriteFileW are needed there.
>=C2=A0The actual directory name is '=E6=96=B0=E6=9F=A5=E6=96=87=E7=8C=AE'.
>>
>>Do you know what the encoding of your batch file is?
>
> GB2312
>
>> And have you got
>>any locale variables (LC_ALL, LC_CTYPE, LANG) set when invoking it?
>
> I'use the following settings in the same batch file:
>
> set LC_ALL=3Dzh_CN.UTF-8
> set LC_CTYPE=3D"zh_CN.UTF-8"
> set LANG=3Dzh_CN.UTF-8
Thanks for the info. (Btw, setting all three locale variables is
overkill; just LC_ALL or LANG will do. Doesn't make a difference here
though.)
>>>>@echo off
>>>>C:\cygwin\bin\bash --login "%~dp0myscript"
>>>
>>> I've found a more strange thing: If I change the batch file into the
>>> following form, then it will be run smoothly:
>>>
>>> @echo off
>>> C:\cygwin\bin\bash --login %~dp0myscript
>>>
>>> The QUOTATION MARK in the former is used to deal with the whitespaces
>>> appearing in the myscript's pathname, though this is relatively rare
>>> case. ?But in the latter case, if there're whitespaces in the
>>> myscript's pathname, the batch will fail to run.
>>
>>Hmm, perhaps the argument mangling at program startup is using the
>>ANSI codepage (i.e. GBK in this case) when it should be using UTF-8?
>
> But, if I convert my batch file into UTF-8 (without BOM, CR/LF line
> endings) format, I'll meet the following error:
>
> /usr/bin/bash:
> "F:/zhaohs/Desktop/=E9=8F=82=E7=89=88=E7=85=A1=E9=8F=82=E5=9B=A9=E5=B0=9E=
/RestoreName4Elsevier.sh": No such
> =C2=A0file or directory
That's easily explained: batch files are assumed (by Windows) to be
encoded in the OEM codepage, which on your system will be the same as
the ANSI codepage, i.e. GBK (aka CP936). So if your batch file
actually contains UTF-8, you get mojibake.
I'm stumped though as to why your example works without quotation when
it does without them, especially considering that bash actually
reports the correct filename.
> /usr/bin/bash: "F:\zhaohs\Desktop\=E6=96=B0=E6=9F=A5=E6=96=87=E7=8C=AE\my=
script": No such file or
directory
Andy
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -