delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/05/09/11:44:44

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Sat, 9 May 2009 17:44:00 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: Cygwin programs doesn't support non-ASCII filenames
Message-ID: <20090509154400.GS21324@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <gu2u4o$f2i$3 AT ger DOT gmane DOT org> <20090509100231 DOT GR21324 AT calimero DOT vinschen DOT de> <gu46gf$3tf$1 AT ger DOT gmane DOT org>
MIME-Version: 1.0
In-Reply-To: <gu46gf$3tf$1@ger.gmane.org>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On May  9 23:12, Lenik wrote:
> (This mail is encoded in utf-8)
>
> On 2009-5-9 18:02, Corinna Vinschen wrote:
>> [Repeated and additional question.  I accidentally sent this as PM.
>>   Sorry about that.  Let's keep this on the list, please]
>>
>> On May  9 11:43, Lenik wrote:
>>> (My system locale is zh_CN)
>>
>> What ANSI codepage is that?
>>
>> And what OEM codepage uses the console Window by default?
> `chcp' shows codepage is 937

937?!?  Per MSDN there's no 937 codepage, rather a 936 codepage
which is used as ANSI and OEM codepage for the chinese language.
Dependent where you look it's either called GBK or gb2312.  However,
it looks like GBK is more correct.

> I don't know what's difference between ANSI codepage and OEM codepage.

Well, basically ANSI is the codepage used by Windows GUI tools, OEM
is the codepage used by the Windows console by default.  A full
explanation is going a bit over the top in this mailing list.  And it
doesn't actually affect you since, as I wrote above, the 936 CP is used
for both areas.

>> Can you please give us the exact name of the directory in either
>> UTF-8 or UTF-16 notation?
> The two chinese characters encoding in:
> GB2312: d7 c0 c3 e6
> UTF-8: e6 a1 8c e9 9d a2
> Unicode: \u684c \u9762

Thanks, I'll use that for testing next week.

>     C:\Profiles\Shecti> set LANG=& bash -c "cat ??????"
>     cat: ??????: No such file or directory
>
>     C:\Profiles\Shecti> set LANG=zh_CN.GBK& bash -c "cat ??????"
>     123
>
>     C:\Profiles\Shecti> set LANG=zh_CN.UTF-8& bash -c "cat ??????"
>     123
>
>     C:\Profiles\Shecti> set LANG=& bash -c "d ??????"
>     /mnt/c/Profiles/Shecti/?????? doesn't exist!
>
>     C:\Profiles\Shecti> set LANG=zh_CN.GBK& bash -c "d ??????"
>     /mnt/c/Profiles/Shecti/?????? doesn't exist!
>
>     C:\Profiles\Shecti> set LANG=zh_CN.UTF-8& bash -c "d ??????"
>     /mnt/c/Profiles/Shecti/?????? doesn't exist!
>
> The same result, it shows that `cat' from binutils can support locale  
> well, while `d' isn't.

Ok, but that's not Cygwin's problem, just the d tool would need an
update at one point, perhaps.  OTOH, what you're doing is a bit
borderline.  When you start this stuff from cmd, you will have to enter
the filename in the notation valid for the locale in which the
application works.  For d, which only works in the C locale, you would
have to give the pathname using the SO/UTF-8 sequences.  Right now I
have no idea if there's a workaround for that, but keep in mind that
we're at the beginning of real native language support.  Unfortunately
it's all a bit more complicated than on non-Windows systems, given the
UTF-16-ness of the underlying system.

>> So you can use LANG=zh_CN.GBK, but not LANG=zh_CN.GB2312.  It's just
>> treated as invalid input.  Better: Use LANG=zh_CN.UTF-8.
>>
> Yes, GB2312 is a subset in terms of supported characters. Is there  
> anyway to know the default locale of current cygwin installation? From  
> the test I found that `unset LANG' and `set LANG=zh_CN.GB2312' just get  
> the same results, so I thought that GB2312 is the default locale.

The default lcoale is "C", as demanded by POSIX.  Everything else is
in responsibility of the application.  Please read
http://cygwin.com/1.7/cygwin-ug-net/setup-locale.html
and
http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual

> And, I'd like to use UTF-8 too, but I won't chcp to 65001, this will  
> introduce a lot of new problems when deploy to customers' machines.  
> while most programs and files are encoded in GB2312 in the real world.

Cygwin 1.7 doesn't require you to use chcp.  Since all internal file I/O
and console I/O uses UTF-16 in Cygwin and the conversion from singlebyte
or multibyte charset to UTF-16 is done in Cygwin itself, the console
codepage has no meaning for Cygwin.  However, in your examples above
it gets a meaning since you enter the filenames while running in cmd,
and cmd of course *does* rely on the console codepage.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019