X-Recipient: archive-cygwin@delorie.com X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_41,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: sourceware.org To: cygwin@cygwin.com From: Lapo Luchini Subject: Re: The C locale Date: Tue, 22 Sep 2009 10:43:04 +0200 Lines: 50 Message-ID: References: <416096c60908300959i1e0084b1xc8f6e65e792b035d@mail.gmail.com> <20090831005258.GG2068@ednor.casa.cgf.cx> <416096c60909012329l2f25e735yc07145b8d6698cda@mail.gmail.com> <3f0ad08d0909020656v7d9fce6ft4afea63ed363b9a9@mail.gmail.com> <416096c60909071308qc5ff057sbe9cb1dbc270554f@mail.gmail.com> <20090908193456.GC17515@calimero.vinschen.de> <416096c60909081449r1fe024dbm7b82a3719be05e9e@mail.gmail.com> <20090921103758.GE20981@calimero.vinschen.de> <416096c60909211420g4ac8ea93l80fc1f00dcd5c0f3@mail.gmail.com> <416096c60909212347r7e03a4f3q7d518ff7e8bce55d@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User-Agent: Thunderbird 2.0.0.23 (X11/20090831) In-Reply-To: <416096c60909212347r7e03a4f3q7d518ff7e8bce55d@mail.gmail.com> OpenPGP: id=C8F252FB X-IsSubscribed: yes Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Delivered-To: mailing list cygwin@cygwin.com Andy Koppe wrote: > No, it isn't. UTF-16 filename characters that can't be represented in > the current charset are encoded by a ^N followed by the character's > UTF-8 representation. OK, right. > For example, a Windows filename "bäh" turns into "bŤh" in the C locale, > while it shows up correctly with explicitly set ISO-8859-1 or CP1252. Uh? Doesn't seem so to me: if I create "bäh" in WindowsExplorer, then open up an UTF-8 mintty console I have a consistent output with both LANG=C and LANG=it_IT.UTF-8 (of course, since right now C is UTF-8): % LANG=C ls -l|egrep b.h -rw-r--r-- 1 lapo None 0 Sep 22 09:53 bäh % LANG=it_IT.UTF-8 ls -l|egrep b.h -rw-r--r-- 1 lapo None 0 22 Sep 09:53 bäh So I'm not sure what do you mean with 'a Windows filename "bäh" turns into "bŤh" in the C locale'... you mean that a script sees it as 62C3A468 as opposed as 62E468? Or that actual "bŤh" is shown somewhere? As "bŤh" is just a representation, and it depends on the charset the console expects (and in fact in this UTF-8-encoded message, it will be probably represented with 62C385C2A468)... if the console is UTF-8, what's currently shown is what I'd expect. If OTOH we're talking what it is in raw form and not of what is shown (i.e. about "3 bytes" vs a "4 bytes" string) well, that's a different issue, and I'm not sure why a program should prefer a 3-byte representations as opposed to a 4-byte one...? But OTOH as far as "not caring" goes, it sure can be a nice feature to be retro-compatible in that single case, since the behavior is not well-defined anyways... But again, if a script creates a filename that happens to contain Japanese characters (or even umlauts or r-quotes/l-quotes) I would expect to see that on the filesystem too, and not some random-looking escaped-sequence... > Btw, are you actually using the C locale? Not usually, but it happens from time to time (mostly in script, or in cases such as the monotone "make check" unit tests; one which tries to create UTF-8 filenames and then ISO-8859-1 filenames currently fail). -- Lapo Luchini - http://lapo.it/ “Endure. In enduring, grow strong.” (Dak'kon, videogame "Torment", 1999) -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple