delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/05/29/17:22:26

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-0.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <4A204DEE.1060004@sidefx.com>
References: <200905281541 DOT 33404 DOT michael DOT renner AT gmx DOT de> <4A1EF2CE DOT 2060509 AT sidefx DOT com> <3f0ad08d0905290813m39999f81q918e94e3c960eb3f AT mail DOT gmail DOT com> <4A200287 DOT 8030403 AT sidefx DOT com> <3f0ad08d0905290852xe41338alfda89c622f92f677 AT mail DOT gmail DOT com> <4A200BC0 DOT 9010704 AT sidefx DOT com> <e2480c70905291142o2bcc65ccw2287d175dbd09dd5 AT mail DOT gmail DOT com> <4A204149 DOT 2050009 AT sidefx DOT com> <e2480c70905291337g6c8bcca7xd0baba79c84629db AT mail DOT gmail DOT com> <4A204DEE DOT 1060004 AT sidefx DOT com>
Date: Sat, 30 May 2009 01:22:07 +0400
Message-ID: <e2480c70905291422t43baf9bet83ed23759f6b0265@mail.gmail.com>
Subject: Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line
From: Alexey Borzenkov <snaury AT gmail DOT com>
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Sat, May 30, 2009 at 1:04 AM, Edward Lam <edward AT sidefx DOT com> wrote:
> Alexey Borzenkov wrote:
>> It might be safe for you, but not for other people. If you have a
>> Russian default codepage and ever need to work with chineese/japanese
>> filenames and cygwin uses default codepage for filesystem operations
>> (as in 1.5 right now), then you are really screwed. In my opinion
>> utf-8 is a silver bullet here, and I'm very glad it went that way.
> I must be missing something here. Suppose you have a default Russian code
> page, with LANG unset (ie. cygwin 1.7 uses UTF-8). Now, if you're using any
> non-Unicode, non-CodePage aware, native application to create a Russian
> filename, isn't Windows going to convert the filename from the Russian code
> page into UTF-16 for storage in NTFS? If that is the case, and then you do
> an ls from cygwin 1.7, aren't you going to get the wrong filename displayed?
> ie. interoperability with non-Unicode, non-CodePage aware native
> applications will be broken for you too with the current default cygwin 1.7
> behaviour.
>
> Or is this, not a case that you care about and you *only* use cygwin
> applications?

No, it is precisely that I care about both ends of interoperability.
Here is a hypotetical situation:

for filename in `ls`; do
  someprogram $filename
done

Here, when I use russian Windows and I don't have LANG set (or when I
have LANG=en_US.UTF-8), filename will be utf-8 multibyte string. So
both, russian and european/chinese/japanese filenames will be valid.
Now there are three possibilities:

1) someprogram is a cygwin application, then it must be that $filename
will be passed as is, without any conversions
2) someprogram is a unicode application, then it will have a correct
unicode argument
3) someprogram is an ansi application, then Windows (cygwin has
nothing to do with it) will convert its unicode arguments to system's
codepage (cp1251 for Russian) and any character that can't be encoded
will be replaced with question marks. This is solely someprogram's
fault and cygwin has nothing to do with it.

All I'm trying to say is that on Windows (since WinNT) arguments are
always in unicode. It just so happens that when ansi applications call
other ansi applications with a sequence of bytes, it first gets
converted to unicode, then back to ansi, and you get the same sequence
of bytes. But the arguments are always characters, not bytes.

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019