delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/05/09/11:12:53

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=0.0 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,RCVD_NUMERIC_HELO,SPF_HELO_PASS,SPF_PASS
X-Spam-Check-By: sourceware.org
To: cygwin AT cygwin DOT com
From: Lenik <lenik AT bodz DOT net>
Subject: Re: Cygwin programs doesn't support non-ASCII filenames
Date: Sat, 09 May 2009 23:12:06 +0800
Lines: 103
Message-ID: <gu46gf$3tf$1@ger.gmane.org>
References: <gu2u4o$f2i$3 AT ger DOT gmane DOT org> <20090509100231 DOT GR21324 AT calimero DOT vinschen DOT de>
Mime-Version: 1.0
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1b3pre) Gecko/20090223 Thunderbird/3.0b2
In-Reply-To: <20090509100231.GR21324@calimero.vinschen.de>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Note-from-DJ: This may be spam

(This mail is encoded in utf-8)

On 2009-5-9 18:02, Corinna Vinschen wrote:
> [Repeated and additional question.  I accidentally sent this as PM.
>   Sorry about that.  Let's keep this on the list, please]
>
> On May  9 11:43, Lenik wrote:
>> (My system locale is zh_CN)
>
> What ANSI codepage is that?
>
> And what OEM codepage uses the console Window by default?
`chcp' shows codepage is 937
I don't know what's difference between ANSI codepage and OEM codepage.

>
>> 1, test path
>>      >>>  set LANG=&  cygpath -am .
>>      C:/Profiles/Shecti/??????
>>
>>      >>>  set LANG=zh_CN.GBK&  cygpath -am .
>>      C:/Profiles/Shecti/??????
>>
>>      >>>  set LANG=C&  cygpath -am .
>>      C:/Profiles/Shecti/×ÀÃæ
>
> Can you please give us the exact name of the directory in either
> UTF-8 or UTF-16 notation?
The two chinese characters encoding in:
GB2312: d7 c0 c3 e6
UTF-8: e6 a1 8c e9 9d a2
Unicode: \u684c \u9762

>
>> 2, the `test' utility
>>      >>>  set LANG=&  bash -c "D=$(cygpath -am .); if [ -d $D ]; then echo
>> ok $D; else echo fail $D; fi"
>>      fail C:/Profiles/Shecti/??????
>
> What you're actually testing here all the time is cygpath in the first
> place.  If you stop using cygpath, start a bash shell and use the Cygwin
> commands with the paths in POSIX notation, you would have much less
> trouble.  Cygwin is a POSIX emulation layer, after all.
>
Well, I test the pathnames using cygpath because I want to get absolute 
path so the chinese characters will be included in this test, and I 
can't type these characters in the console window. The second reason is, 
I associated .sh file type with bash, as:
   .sh=C:\lam\sys\cygwin-1.7\bin\bash -c "$(cygpath -u '%0') %*"

This is a new test don't use cygpath:
     C:\Profiles\Shecti> set LANG=& bash -c "cat 你好"
     cat: 你好: No such file or directory

     C:\Profiles\Shecti> set LANG=zh_CN.GB2312& bash -c "cat 你好"
     cat: 你好: No such file or directory

     C:\Profiles\Shecti> set LANG=zh_CN.GBK& bash -c "cat 你好"
     123

     C:\Profiles\Shecti> set LANG=zh_CN.UTF-8& bash -c "cat 你好"
     123

     C:\Profiles\Shecti> set LANG=& bash -c "d 你好"
     /mnt/c/Profiles/Shecti/你好 doesn't exist!

     C:\Profiles\Shecti> set LANG=zh_CN.GBK& bash -c "d 你好"
     /mnt/c/Profiles/Shecti/你好 doesn't exist!

     C:\Profiles\Shecti> set LANG=zh_CN.UTF-8& bash -c "d 你好"
     /mnt/c/Profiles/Shecti/你好 doesn't exist!

The same result, it shows that `cat' from binutils can support locale 
well, while `d' isn't.

> If you give me the above information I'll look into fixing cygpath.
>
>>      The GB2312 charset is a subset of GBK charset, and the characters `
>> ??????' is included in GB2312 charset. So in this example, GB2312 SHOULD
>> WORK.
>
> Sorry, no.  It's documented that GBK is supported, GB2312 isn't.  From
> what I read about GB2312 it's not actually a subset of GBK in terms
> of character definitions, it's just a subset in terms of supported
> characters.  AFAICS, GB2312 uses chars<  0x7f in multibyte sequences
> which is not feasible for Cygwin.  We could support EUC-CN, which
> seems to be another way to encode GB2312 chars, but I'm not exactly
> willing to add that now.  I'd rather stabilize what we have now and
> add further charset support in a later, official 1.7 release.
>
> So you can use LANG=zh_CN.GBK, but not LANG=zh_CN.GB2312.  It's just
> treated as invalid input.  Better: Use LANG=zh_CN.UTF-8.
>
Yes, GB2312 is a subset in terms of supported characters. Is there 
anyway to know the default locale of current cygwin installation? From 
the test I found that `unset LANG' and `set LANG=zh_CN.GB2312' just get 
the same results, so I thought that GB2312 is the default locale.

And, I'd like to use UTF-8 too, but I won't chcp to 65001, this will 
introduce a lot of new problems when deploy to customers' machines. 
while most programs and files are encoded in GB2312 in the real world.

Lenik


--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019