delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2010/03/16/11:58:04

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-1.5 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS
X-Spam-Check-By: sourceware.org
Reply-To: <dbyron AT dbyron DOT com>
From: "David Byron" <dbyron AT dbyron DOT com>
To: "'Andy Koppe'" <andy DOT koppe AT gmail DOT com>, <cygwin AT cygwin DOT com>
References: <493F5820D3F64434A76F433604C79D4A AT pleaset> <416096c61003160019p24e58433x4a969c0f99068fa6 AT mail DOT gmail DOT com>
Subject: RE: filenames with characters that have the high bit set
Date: Tue, 16 Mar 2010 09:58:02 -0700
Message-ID: <6C05DF4D85804B3A865E7FE549B0475E@pleaset>
MIME-Version: 1.0
In-Reply-To: <416096c61003160019p24e58433x4a969c0f99068fa6@mail.gmail.com>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

> > And my ~/.inputrc contains:
> >
> > set meta-flag on
> > set convert-meta off
> > set input-meta on
> > set output-meta on
>=20
> Makes plenty of sense. But note that meta-flag is a synonym for
> input-meta, so you can remove one of them.

I was just following the instructions at
http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode

> > $ echo $LC_ALL
> > en_US
>=20
> Hang on, where did that come from?

When my cygwin.bat has set LANG=3Den_US.UTF-8, I get LANG=3Den_US.UTF-8 and
LC_ALL=3Den_US in bash.  When my cygwin.bat doesn't set LANG, I get
LC_ALL=3Den_US and LANG isn't set.

> LC_ALL overrides any other locale variables including
> LANG. Specifying a locale without a charset means that
> Cygwin 1.7.1 looks up your ANSI codepage. Assuming you're
> on a US system, this means you're getting CP1252, not
> UTF-8. (Note besides: Cygwin 1.7.2 changes to a
> Linux-compatible scheme for locales without explicit
> charset instead, where you'd get ISO-8859-1 instead.)

I unset LC_ALL and...

> > $ ls foo<tab>
> >
> > adds the actual accented character to the command line
> > (whether set show-all-if-ambiguous on is in ~/.inputrc
> > or not). =A0Then I press return and ls prints the
> > filename.

Now ls foo<tab> adds the actual accented character to the command line, but
when I press return I get:

ls: cannot access foo<a gray box>: No such file or directory

when I pipe the error message to od -c, the gray box is octal 351 or 0xE9.

> >=A0Then if I go through command history and change "ls" to
> > "test -f" and add the "; echo $?" I get the right answer
> > from test.

I still get the right answer from test -f, when using the shell builtin.
/usr/bin/test tells me the file doesn't exist.

I think I can make the above more clear with the steps below.

> The \x18 scheme is only used for codepoints that can not be
> represented in the selected character set, yet U+00E9 can be
> represented CP1252. By definition, any Unicode codepoint can be
> represented in UTF-8, so the \x18 scheme is never used when that is
> selected.
>
> To enable C-style backslash interpretation, you need to use=20
> $'...' quoting.

I now see the bash man page explains this.  Must have missed it the first
time.  The above paragraphs with some examples (where \x18 is needed and
where it isn't) added to
http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual
would have gotten me farther before posting.

> > $ touch "\x18"; echo $?
> > 0
>=20
> Have a look in your root directory. There should be a file=20
> called x18 there.

I don't see anything in my cygwin root (/) but I do see x18 in the root of
my C drive.  Thanks.

> > Can someone give me a hand coming up with a command line
> > where I can build up filenames that contain characters
> > that have the high bit set (as well as any non-ascii
> > character really)?
>=20
> Just type them in. The 'US International' keyboard layout might be
> useful here. See
> http://en.wikipedia.org/wiki/Keyboard_layout#US-International.
>=20
> Otherwise, use $'...', and lose the unnecessary \x18s.

And finally here are the steps that illustrate what's going on.

$ touch $'\x18'; echo $?
0

ls shows a file named up-arrow (0x18):

$ ls $'\x18' | od -c
0000000 030  \n
0000002

but if I type

$ ls<tab>
^X

which seems inconsistent.

Now for more interesting tests.  In an empty directory:

$ touch foo$'\xC3\xA9'; echo $?
0

$ touch bar$'\xE9'; echo $?
0

$ ls | od -c
0000000   b   a   r 351  \n   f   o   o 303 251  \n
0000013

$ ls foo<tab> (displays foo with an accented e)
ls: <gray box>: No such file or directory

$ ls bar<tab> (displays bar with an accented e)
bar<gray box>

$ ls bar$'\xE9'
bar<gray box>

$ ls foo$'\xE9'
ls: <gray box>: No such file or directory

where <gray box> is octal 351 (0xE9)

$ ls foo$'\xC3\xA9'
foo<accented e>

$ ls bar$'\xC3\xA9'
ls: cannot access bar<accented e>: No such file or directory

where <accented e> is octal 303 351 (0xC3A9)

All of the above sort of makes sense, though it sort of seems like both \xE9
and \xC3\xA9 could work to find both foo and bar.

$ type test
test is a shell builtin

$ test -f foo$'\xC3\xA9'; echo $?
1

$ test -f bar$'\xE9'; echo $?
1

builtin test doesn't seem to be doing the right thing here.

$ /usr/bin/test -f foo$'\xC3\xA9'; echo $?
0

$ /usr/bin/test -f bar$'\xE9'; echo $?
0

And then using the wrong encoding:

$ test -f foo$'\xE9'; echo $?
0

$ test -f bar$'\xC3\xA9'; echo $?
1

$ /usr/bin/test -f foo$'\xE9'; echo $?
1

$ /usr/bin/test -f bar$'\xC3\xA9'; echo $?
1

So there's some inconsistency here too in the builtin test.

Changing the subject here a bit, but getting to the thing that's actually
holding me up now:

$ mkshortcut -n shortcut$'\xC3\xA9' plain; echo $?
0

$ ls | od -c
0000000   s   h   o   r   t   c   u   t 303 203 302 251   .   l   n   k
0000020  \n
0000021

This doesn't seem right.

$ mkshortcut -n shortcut$'\xE9' plain; echo $?
0

$ ls | od -c
0000000   s   h   o   r   t   c   u   t 303 251   .   l   n   k  \n
0000017

And then

$ readshortcut shortcut$'\xE9'
/home/dbyron/foo/plain

$ readshortcut shortcut$'\xC3\xA9'
readshortcut: Load failed on
C:\utils\cygwin\home\dbyron\foo\shortcut<accented e>.lnk

where <accented e> is octal 303 251 (0xC3A9)

Am I right to expect readshortcut to read the shortcut when given the UTF-8
encoding in this environment?

Thanks for your help.

-DB


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019