X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.5 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS X-Spam-Check-By: sourceware.org Reply-To: From: "David Byron" To: "'Andy Koppe'" , References: <493F5820D3F64434A76F433604C79D4A AT pleaset> <416096c61003160019p24e58433x4a969c0f99068fa6 AT mail DOT gmail DOT com> Subject: RE: filenames with characters that have the high bit set Date: Tue, 16 Mar 2010 09:58:02 -0700 Message-ID: <6C05DF4D85804B3A865E7FE549B0475E@pleaset> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable In-Reply-To: <416096c61003160019p24e58433x4a969c0f99068fa6@mail.gmail.com> X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com > > And my ~/.inputrc contains: > > > > set meta-flag on > > set convert-meta off > > set input-meta on > > set output-meta on >=20 > Makes plenty of sense. But note that meta-flag is a synonym for > input-meta, so you can remove one of them. I was just following the instructions at http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode > > $ echo $LC_ALL > > en_US >=20 > Hang on, where did that come from? When my cygwin.bat has set LANG=3Den_US.UTF-8, I get LANG=3Den_US.UTF-8 and LC_ALL=3Den_US in bash. When my cygwin.bat doesn't set LANG, I get LC_ALL=3Den_US and LANG isn't set. > LC_ALL overrides any other locale variables including > LANG. Specifying a locale without a charset means that > Cygwin 1.7.1 looks up your ANSI codepage. Assuming you're > on a US system, this means you're getting CP1252, not > UTF-8. (Note besides: Cygwin 1.7.2 changes to a > Linux-compatible scheme for locales without explicit > charset instead, where you'd get ISO-8859-1 instead.) I unset LC_ALL and... > > $ ls foo > > > > adds the actual accented character to the command line > > (whether set show-all-if-ambiguous on is in ~/.inputrc > > or not). =A0Then I press return and ls prints the > > filename. Now ls foo adds the actual accented character to the command line, but when I press return I get: ls: cannot access foo: No such file or directory when I pipe the error message to od -c, the gray box is octal 351 or 0xE9. > >=A0Then if I go through command history and change "ls" to > > "test -f" and add the "; echo $?" I get the right answer > > from test. I still get the right answer from test -f, when using the shell builtin. /usr/bin/test tells me the file doesn't exist. I think I can make the above more clear with the steps below. > The \x18 scheme is only used for codepoints that can not be > represented in the selected character set, yet U+00E9 can be > represented CP1252. By definition, any Unicode codepoint can be > represented in UTF-8, so the \x18 scheme is never used when that is > selected. > > To enable C-style backslash interpretation, you need to use=20 > $'...' quoting. I now see the bash man page explains this. Must have missed it the first time. The above paragraphs with some examples (where \x18 is needed and where it isn't) added to http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual would have gotten me farther before posting. > > $ touch "\x18"; echo $? > > 0 >=20 > Have a look in your root directory. There should be a file=20 > called x18 there. I don't see anything in my cygwin root (/) but I do see x18 in the root of my C drive. Thanks. > > Can someone give me a hand coming up with a command line > > where I can build up filenames that contain characters > > that have the high bit set (as well as any non-ascii > > character really)? >=20 > Just type them in. The 'US International' keyboard layout might be > useful here. See > http://en.wikipedia.org/wiki/Keyboard_layout#US-International. >=20 > Otherwise, use $'...', and lose the unnecessary \x18s. And finally here are the steps that illustrate what's going on. $ touch $'\x18'; echo $? 0 ls shows a file named up-arrow (0x18): $ ls $'\x18' | od -c 0000000 030 \n 0000002 but if I type $ ls ^X which seems inconsistent. Now for more interesting tests. In an empty directory: $ touch foo$'\xC3\xA9'; echo $? 0 $ touch bar$'\xE9'; echo $? 0 $ ls | od -c 0000000 b a r 351 \n f o o 303 251 \n 0000013 $ ls foo (displays foo with an accented e) ls: : No such file or directory $ ls bar (displays bar with an accented e) bar $ ls bar$'\xE9' bar $ ls foo$'\xE9' ls: : No such file or directory where is octal 351 (0xE9) $ ls foo$'\xC3\xA9' foo $ ls bar$'\xC3\xA9' ls: cannot access bar: No such file or directory where is octal 303 351 (0xC3A9) All of the above sort of makes sense, though it sort of seems like both \xE9 and \xC3\xA9 could work to find both foo and bar. $ type test test is a shell builtin $ test -f foo$'\xC3\xA9'; echo $? 1 $ test -f bar$'\xE9'; echo $? 1 builtin test doesn't seem to be doing the right thing here. $ /usr/bin/test -f foo$'\xC3\xA9'; echo $? 0 $ /usr/bin/test -f bar$'\xE9'; echo $? 0 And then using the wrong encoding: $ test -f foo$'\xE9'; echo $? 0 $ test -f bar$'\xC3\xA9'; echo $? 1 $ /usr/bin/test -f foo$'\xE9'; echo $? 1 $ /usr/bin/test -f bar$'\xC3\xA9'; echo $? 1 So there's some inconsistency here too in the builtin test. Changing the subject here a bit, but getting to the thing that's actually holding me up now: $ mkshortcut -n shortcut$'\xC3\xA9' plain; echo $? 0 $ ls | od -c 0000000 s h o r t c u t 303 203 302 251 . l n k 0000020 \n 0000021 This doesn't seem right. $ mkshortcut -n shortcut$'\xE9' plain; echo $? 0 $ ls | od -c 0000000 s h o r t c u t 303 251 . l n k \n 0000017 And then $ readshortcut shortcut$'\xE9' /home/dbyron/foo/plain $ readshortcut shortcut$'\xC3\xA9' readshortcut: Load failed on C:\utils\cygwin\home\dbyron\foo\shortcut.lnk where is octal 303 251 (0xC3A9) Am I right to expect readshortcut to read the shortcut when given the UTF-8 encoding in this environment? Thanks for your help. -DB -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple