delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2010/03/16/15:16:00

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <6C05DF4D85804B3A865E7FE549B0475E@pleaset>
References: <493F5820D3F64434A76F433604C79D4A AT pleaset> <416096c61003160019p24e58433x4a969c0f99068fa6 AT mail DOT gmail DOT com> <6C05DF4D85804B3A865E7FE549B0475E AT pleaset>
Date: Tue, 16 Mar 2010 20:15:47 +0000
Message-ID: <416096c61003161315p504dff5dn7d1e847db01754c8@mail.gmail.com>
Subject: Re: filenames with characters that have the high bit set
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: dbyron AT dbyron DOT com, cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

David Byron:
>> > And my ~/.inputrc contains:
>> >
>> > set meta-flag on
>> > set convert-meta off
>> > set input-meta on
>> > set output-meta on
>>
>> Makes plenty of sense. But note that meta-flag is a synonym for
>> input-meta, so you can remove one of them.
>
> I was just following the instructions at
> http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode

I see. FAQ maintainers, can we have the meta-flag removed?

[time passes]

Actually, it appears that bash/readline automatically sets those flags
as shown if the locale is
anything but "C". So since the default locale is "C.UTF-8" and
non-ASCII stuff can't be expected to work in the "C" locale anyway, I
think the whole FAQ entry could just be removed.

Similarly, the commented out settings of those flags in
/etc/skel/.inputrc could go.


>> > $ echo $LC_ALL
>> > en_US
>>
>> Hang on, where did that come from?
>
> When my cygwin.bat has set LANG=3Den_US.UTF-8, I get LANG=3Den_US.UTF-8 a=
nd
> LC_ALL=3Den_US in bash. =C2=A0When my cygwin.bat doesn't set LANG, I get
> LC_ALL=3Den_US and LANG isn't set.

So where does LC_ALL get set? In the system-wide environment (in
Computer->Properties->Advanced->Environment Variabes)? Or in one of
the bash startup files?

> I unset LC_ALL and...

Where? I'm asking because if it's set to 'en_US' at the point bash is
invoked, but unset afterwards, then bash will be using CP1252 while
programs invoked by it will use UTF-8, which of course is bound to
cause trouble ...


> Now ls foo<tab> adds the actual accented character to the command line, b=
ut
> when I press return I get:
>
> ls: cannot access foo<a gray box>: No such file or directory

... like that ...

> when I pipe the error message to od -c, the gray box is octal 351 or 0xE9.
>
> I still get the right answer from test -f, when using the shell builtin.
> /usr/bin/test tells me the file doesn't exist.

.. and that.


>> The \x18 scheme is only used for codepoints that can not be
>> represented in the selected character set, yet U+00E9 can be
>> represented CP1252. By definition, any Unicode codepoint can be
>> represented in UTF-8, so the \x18 scheme is never used when that is
>> selected.
>>
>> To enable C-style backslash interpretation, you need to use
>> $'...' quoting.
>
> I now see the bash man page explains this. =C2=A0Must have missed it the =
first
> time. =C2=A0The above paragraphs with some examples (where \x18 is needed=
 and
> where it isn't) added to
> http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual
> would have gotten me farther before posting.

But what I said is explained there already:

"If you don't want or can't use UTF-8 as character set for whatever
reason, you will nevertheless be able to access the file. How does
that work? When Cygwin converts the filename from UTF-16 to your
character set, it recognizes characters which can't be converted. If
that occurs, Cygwin replaces the non-convertible character with a
special character sequence. The sequence starts with an ASCII CAN
character (hex code 0x18, equivalent Control-X), followed by the UTF-8
representation of the character. The result is a filename containing
some ugly looking characters. While it doesn't look nice, it is nice,
because Cygwin knows how to convert this filename back to UTF-16. The
filename will be converted using your usual character set. However,
when Cygwin recognizes an ASCII CAN character, it skips over the ASCII
CAN and handles the following bytes as a UTF-8 character. Thus, the
filename is symmetrically converted back to UTF-16 and you can access
the file."

Best to use UTF-8, though, and forget that you've ever heard about the
^X scheme. You're certainly not expected to have to enter \x18 on the
command line to access non-ASCII filenames.


>> Have a look in your root directory. There should be a file
>> called x18 there.
>
> I don't see anything in my cygwin root (/) but I do see x18 in the root of
> my C drive. =C2=A0Thanks.

Ah yes, '\x18' is interpreted as a DOS path, so you get the root of
your system drive rather than the Cygwin root.


> And finally here are the steps that illustrate what's going on.
>
> $ touch $'\x18'; echo $?
> 0
>
> ls shows a file named up-arrow (0x18):

What do you mean by up-arrow? I'm getting a question mark, because
that's what ls prints for non-printable characters by default. You can
choose various quoting styles using the --quoting style option.

> $ ls<tab>
> ^X
>
> which seems inconsistent.

Yep, but that's a bash vs ls issue rather than a Cygwin one. You'd get
the same on Linux. But if you use control characters in filenames, you
better know what you're doing anyway. Some argue that it shouldn't be
allowed in the first place, e.g.
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html


> $ mkshortcut -n shortcut$'\xC3\xA9' plain; echo $?
> $ readshortcut shortcut$'\xE9'

I'm afraid these aren't yet Unicode-ready, i.e. they still use Windows
"ANSI" APIs.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019