X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=0.0 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,RCVD_NUMERIC_HELO,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: sourceware.org To: cygwin AT cygwin DOT com From: Lenik Subject: Re: Cygwin programs doesn't support non-ASCII filenames Date: Sat, 09 May 2009 23:12:06 +0800 Lines: 103 Message-ID: References: <20090509100231 DOT GR21324 AT calimero DOT vinschen DOT de> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1b3pre) Gecko/20090223 Thunderbird/3.0b2 In-Reply-To: <20090509100231.GR21324@calimero.vinschen.de> X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Note-from-DJ: This may be spam (This mail is encoded in utf-8) On 2009-5-9 18:02, Corinna Vinschen wrote: > [Repeated and additional question. I accidentally sent this as PM. > Sorry about that. Let's keep this on the list, please] > > On May 9 11:43, Lenik wrote: >> (My system locale is zh_CN) > > What ANSI codepage is that? > > And what OEM codepage uses the console Window by default? `chcp' shows codepage is 937 I don't know what's difference between ANSI codepage and OEM codepage. > >> 1, test path >> >>> set LANG=& cygpath -am . >> C:/Profiles/Shecti/?????? >> >> >>> set LANG=zh_CN.GBK& cygpath -am . >> C:/Profiles/Shecti/?????? >> >> >>> set LANG=C& cygpath -am . >> C:/Profiles/Shecti/×ÀÃæ > > Can you please give us the exact name of the directory in either > UTF-8 or UTF-16 notation? The two chinese characters encoding in: GB2312: d7 c0 c3 e6 UTF-8: e6 a1 8c e9 9d a2 Unicode: \u684c \u9762 > >> 2, the `test' utility >> >>> set LANG=& bash -c "D=$(cygpath -am .); if [ -d $D ]; then echo >> ok $D; else echo fail $D; fi" >> fail C:/Profiles/Shecti/?????? > > What you're actually testing here all the time is cygpath in the first > place. If you stop using cygpath, start a bash shell and use the Cygwin > commands with the paths in POSIX notation, you would have much less > trouble. Cygwin is a POSIX emulation layer, after all. > Well, I test the pathnames using cygpath because I want to get absolute path so the chinese characters will be included in this test, and I can't type these characters in the console window. The second reason is, I associated .sh file type with bash, as: .sh=C:\lam\sys\cygwin-1.7\bin\bash -c "$(cygpath -u '%0') %*" This is a new test don't use cygpath: C:\Profiles\Shecti> set LANG=& bash -c "cat 你好" cat: 你好: No such file or directory C:\Profiles\Shecti> set LANG=zh_CN.GB2312& bash -c "cat 你好" cat: 你好: No such file or directory C:\Profiles\Shecti> set LANG=zh_CN.GBK& bash -c "cat 你好" 123 C:\Profiles\Shecti> set LANG=zh_CN.UTF-8& bash -c "cat 你好" 123 C:\Profiles\Shecti> set LANG=& bash -c "d 你好" /mnt/c/Profiles/Shecti/你好 doesn't exist! C:\Profiles\Shecti> set LANG=zh_CN.GBK& bash -c "d 你好" /mnt/c/Profiles/Shecti/你好 doesn't exist! C:\Profiles\Shecti> set LANG=zh_CN.UTF-8& bash -c "d 你好" /mnt/c/Profiles/Shecti/你好 doesn't exist! The same result, it shows that `cat' from binutils can support locale well, while `d' isn't. > If you give me the above information I'll look into fixing cygpath. > >> The GB2312 charset is a subset of GBK charset, and the characters ` >> ??????' is included in GB2312 charset. So in this example, GB2312 SHOULD >> WORK. > > Sorry, no. It's documented that GBK is supported, GB2312 isn't. From > what I read about GB2312 it's not actually a subset of GBK in terms > of character definitions, it's just a subset in terms of supported > characters. AFAICS, GB2312 uses chars< 0x7f in multibyte sequences > which is not feasible for Cygwin. We could support EUC-CN, which > seems to be another way to encode GB2312 chars, but I'm not exactly > willing to add that now. I'd rather stabilize what we have now and > add further charset support in a later, official 1.7 release. > > So you can use LANG=zh_CN.GBK, but not LANG=zh_CN.GB2312. It's just > treated as invalid input. Better: Use LANG=zh_CN.UTF-8. > Yes, GB2312 is a subset in terms of supported characters. Is there anyway to know the default locale of current cygwin installation? From the test I found that `unset LANG' and `set LANG=zh_CN.GB2312' just get the same results, so I thought that GB2312 is the default locale. And, I'd like to use UTF-8 too, but I won't chcp to 65001, this will introduce a lot of new problems when deploy to customers' machines. while most programs and files are encoded in GB2312 in the real world. Lenik -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/