X-Recipient: archive-cygwin@delorie.com
X-SWARE-Spam-Status: No, hits=1.2 required=5.0 	tests=AWL,BAYES_00,J_CHICKENPOX_14,J_CHICKENPOX_42,SARE_MSGID_LONG40,SARE_SUB_ENC_UTF8,SPF_PASS
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <e2480c70905140614w427eb5bcpf1482512e43f70a@mail.gmail.com>
References: <e2480c70905140614w427eb5bcpf1482512e43f70a@mail.gmail.com>
Date: Thu, 14 May 2009 17:35:58 +0400
Message-ID: <e2480c70905140635q4fdcd53bt9db497f81477205@mail.gmail.com>
Subject: [1.7] Problem with national characters in directory names when using  	UTF-8 charset
From: Alexey Borzenkov <snaury@gmail.com>
To: cygwin@cygwin.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
Precedence: bulk
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie.com@cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
Delivered-To: mailing list cygwin@cygwin.com

There is something strange going on with national characters in
directory names when using Cygwin 1.7 with UTF-8. Here's a sample
session:

# test.rb
# -*- coding: utf-8 -*-
filename =3D File.expand_path("test.txt")
puts filename
puts File.open(filename) { |f| f.read }

# test.txt
This is a test

C:\cygwin\home\aborzenkov> set LANG=3Den_US.UTF-8

C:\cygwin\home\aborzenkov> mkdir =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=
=D0=B0

C:\cygwin\home\aborzenkov> copy test.rb =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=
=80=D0=BA=D0=B0

C:\cygwin\home\aborzenkov> copy test.txt =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=
=80=D0=BA=D0=B0

C:\cygwin\home\aborzenkov> C:\cygwin\bin\ruby =D0=BF=D1=80=D0=BE=D0=B2=D0=
=B5=D1=80=D0=BA=D0=B0/test.rb
/usr/bin/ruby: No such file or directory -- =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=
=D1=80=D0=BA=D0=B0/test.rb (LoadError)

C:\cygwin\home\aborzenkov> C:\cygwin\bin\cat =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=
=D1=80=D0=BA=D0=B0/test.txt
This is a test

C:\cygwin\home\aborzenkov> cd =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=
=B0

C:\cygwin\home\aborzenkov\=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B0>=
 C:\cygwin\bin\ruby test.rb
/home/aborzenkov/=E2=96=92??=E2=96=92N?=E2=96=92??=E2=96=92??=E2=96=92?=D1=
=87=E2=96=92N?=E2=96=92??=E2=96=92?=C2=B0/test.txt
This is a test

C:\cygwin\home\aborzenkov\=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B0>=
 C:\cygwin\bin\cat test.txt
/usr/bin/cat: test.txt: No such file or directory

C:\cygwin\home\aborzenkov\=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B0>=
 C:\cygwin\bin\ls -al
/usr/bin/ls: cannot open directory .: No such file or directory

Why is it that some commands can't accept russian character in
filenames, yet work within russian directories, and other can open
filenames with russian paths, but can't work within russian
directories? It seems extremely weird to me. :-/ Also, I'm wondering
about this discrepancy:

C:\cygwin\home\aborzenkov> C:\cygwin\bin\ruby /bin/irb
irb(main):001:0> Dir.chdir("=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=
=B0")
=3D> 0
irb(main):002:0> File.expand_path("*")
=3D> "/home/aborzenkov/\320\277\321\200\320\276\320\262\320\265\321\200\320=
\272\320\260/*"

C:\cygwin\home\aborzenkov\=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B0>=
 C:\cygwin\bin\ruby /bin/irb
irb(main):001:0> File.expand_path("*")
=3D> "/home/aborzenkov/\016\320\277\016\321\200\016\320\276\016\320\262\016=
\320\265\016\321\200\016\320\272\016\320\260/*"

Notice how for the same current directory (one where cygwin session
has done chdir to russian directory on its own, another where cygwin
session was started in russian directory) give different results for
File.expand_path in ruby. If I understood cygwin documentation
correctly, \016 is supposed to appear only for character that cannot
be represented with current charset (which is utf-8), yet in second
case they appear all over the place. The same thing is happening with,
for example, bash, which shows garbled pwd output when started from
within russian directory, yet works well when I chdir to that
directory manually.

What's going on?

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

