X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40 X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <6C05DF4D85804B3A865E7FE549B0475E@pleaset> References: <493F5820D3F64434A76F433604C79D4A AT pleaset> <416096c61003160019p24e58433x4a969c0f99068fa6 AT mail DOT gmail DOT com> <6C05DF4D85804B3A865E7FE549B0475E AT pleaset> Date: Tue, 16 Mar 2010 20:15:47 +0000 Message-ID: <416096c61003161315p504dff5dn7d1e847db01754c8@mail.gmail.com> Subject: Re: filenames with characters that have the high bit set From: Andy Koppe To: dbyron AT dbyron DOT com, cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com David Byron: >> > And my ~/.inputrc contains: >> > >> > set meta-flag on >> > set convert-meta off >> > set input-meta on >> > set output-meta on >> >> Makes plenty of sense. But note that meta-flag is a synonym for >> input-meta, so you can remove one of them. > > I was just following the instructions at > http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode I see. FAQ maintainers, can we have the meta-flag removed? [time passes] Actually, it appears that bash/readline automatically sets those flags as shown if the locale is anything but "C". So since the default locale is "C.UTF-8" and non-ASCII stuff can't be expected to work in the "C" locale anyway, I think the whole FAQ entry could just be removed. Similarly, the commented out settings of those flags in /etc/skel/.inputrc could go. >> > $ echo $LC_ALL >> > en_US >> >> Hang on, where did that come from? > > When my cygwin.bat has set LANG=3Den_US.UTF-8, I get LANG=3Den_US.UTF-8 a= nd > LC_ALL=3Den_US in bash. =C2=A0When my cygwin.bat doesn't set LANG, I get > LC_ALL=3Den_US and LANG isn't set. So where does LC_ALL get set? In the system-wide environment (in Computer->Properties->Advanced->Environment Variabes)? Or in one of the bash startup files? > I unset LC_ALL and... Where? I'm asking because if it's set to 'en_US' at the point bash is invoked, but unset afterwards, then bash will be using CP1252 while programs invoked by it will use UTF-8, which of course is bound to cause trouble ... > Now ls foo adds the actual accented character to the command line, b= ut > when I press return I get: > > ls: cannot access foo: No such file or directory ... like that ... > when I pipe the error message to od -c, the gray box is octal 351 or 0xE9. > > I still get the right answer from test -f, when using the shell builtin. > /usr/bin/test tells me the file doesn't exist. .. and that. >> The \x18 scheme is only used for codepoints that can not be >> represented in the selected character set, yet U+00E9 can be >> represented CP1252. By definition, any Unicode codepoint can be >> represented in UTF-8, so the \x18 scheme is never used when that is >> selected. >> >> To enable C-style backslash interpretation, you need to use >> $'...' quoting. > > I now see the bash man page explains this. =C2=A0Must have missed it the = first > time. =C2=A0The above paragraphs with some examples (where \x18 is needed= and > where it isn't) added to > http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual > would have gotten me farther before posting. But what I said is explained there already: "If you don't want or can't use UTF-8 as character set for whatever reason, you will nevertheless be able to access the file. How does that work? When Cygwin converts the filename from UTF-16 to your character set, it recognizes characters which can't be converted. If that occurs, Cygwin replaces the non-convertible character with a special character sequence. The sequence starts with an ASCII CAN character (hex code 0x18, equivalent Control-X), followed by the UTF-8 representation of the character. The result is a filename containing some ugly looking characters. While it doesn't look nice, it is nice, because Cygwin knows how to convert this filename back to UTF-16. The filename will be converted using your usual character set. However, when Cygwin recognizes an ASCII CAN character, it skips over the ASCII CAN and handles the following bytes as a UTF-8 character. Thus, the filename is symmetrically converted back to UTF-16 and you can access the file." Best to use UTF-8, though, and forget that you've ever heard about the ^X scheme. You're certainly not expected to have to enter \x18 on the command line to access non-ASCII filenames. >> Have a look in your root directory. There should be a file >> called x18 there. > > I don't see anything in my cygwin root (/) but I do see x18 in the root of > my C drive. =C2=A0Thanks. Ah yes, '\x18' is interpreted as a DOS path, so you get the root of your system drive rather than the Cygwin root. > And finally here are the steps that illustrate what's going on. > > $ touch $'\x18'; echo $? > 0 > > ls shows a file named up-arrow (0x18): What do you mean by up-arrow? I'm getting a question mark, because that's what ls prints for non-printable characters by default. You can choose various quoting styles using the --quoting style option. > $ ls > ^X > > which seems inconsistent. Yep, but that's a bash vs ls issue rather than a Cygwin one. You'd get the same on Linux. But if you use control characters in filenames, you better know what you're doing anyway. Some argue that it shouldn't be allowed in the first place, e.g. http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html > $ mkshortcut -n shortcut$'\xC3\xA9' plain; echo $? > $ readshortcut shortcut$'\xE9' I'm afraid these aren't yet Unicode-ready, i.e. they still use Windows "ANSI" APIs. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple