X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <20090831005258.GG2068@ednor.casa.cgf.cx> References: <416096c60908300959i1e0084b1xc8f6e65e792b035d AT mail DOT gmail DOT com> <20090831005258 DOT GG2068 AT ednor DOT casa DOT cgf DOT cx> Date: Wed, 2 Sep 2009 07:29:34 +0100 Message-ID: <416096c60909012329l2f25e735yc07145b8d6698cda@mail.gmail.com> Subject: Re: The C locale From: Andy Koppe To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Christopher Faylor: >Andy Koppe: >>Trying to reply to [banned]'s post about locale issues, I got >>rather confused about the C locale. The manual and the POSIX standard >>say that it supports ASCII only, so in theory anything above 0x7F >>should be rejected. In practice though, both Cygwin 1.5 and 1.7 do >>support characters above 0x7F in the C locale, which could be quite >>useful. Trouble is, they do so rather inconsistenly. >> >>Both in 1.5 and 1.7, the mb conversion functions treat such characters >>as ISO-8859-1. In other words, conversion between chars and wchars are >>simple casts (except that wchars above 0xFF can't be converted). This >>makes some sense. >> >>Filename handling is different though. Cygwin 1.5 translates filenames >>according to the system's ANSI codepage. I guess the inconsistency >>with the mb functions didn't really matter, as the mb functions were >>pretty much useless anyway, and supporting the system codepage was >>more important. >> >>So, with Cygwin 1.7, I'd have expected filename handling in the C >>locale to either use ISO-8859-1 for consistency with the mb functions, >>or the ANSI codepage for compatibility with 1.5. In actual fact >>though, it uses UTF-8. >> >>Is this on purpose? If so, shouldn't the multibyte conversions >>functions in the C locale use UTF-8 as well? > >Since Cygwin has a clear system that it is supposed to be emulating, >the real question is "What does Linux do?" Tried it on Debian and Suse: the multibyte conversion functions are strict ASCII, i.e. anything beyond 0x7F is considered an encoding error. POSIX requires that ASCII is supported in the C locale, but does not actually outlaw ASCII-compatible extensions beyond that. Locales don't affect filenames on Linux, i.e. any sequence of bytes passed to open() goes straight to disk (except for the path separator). This effectively means that filenames are encoded in whatever charset happened to be active at the time the file was created. Hence anyone accessing it with a different charset setting will get gibberish. POSIX is impressively unhelpful on the topic of filenames. All it guarantees for filenames is the "portable filename character set": ASCII letters and digits, plus the hyphen, dot, and underscore. So altogether we've got no fewer than four choices here: - strict ASCII (as with Linux mb functions) - ISO-8859-1 (as with newlib mb functions) - Default Windows ANSI/OEM codepage (as with Cygwin 1.5 filenames) - UTF-8 (as with Cygwin 1.7 filenames) In Cygwin 1.5, both file operations and the console use the default Windows codepage, which often contains all the characters a user cares about. If you set up readline for 8-bit I/O and change the console font to something useful, this works reasonably well, including Cygwin-created filenames showing up correctly in Explorer. A rather important exception is 'ls', which seems to have its own hardcoded limitation to 7 bits for the C locale: anything non-ASCII is shown as '? there'. Things do work correctly elsewhere though, e.g. in bash tab completion or Midnight Commander. A user with such a setup who upgrades to 1.7 will find that things will no longer work as before, since filenames are translated to UTF-8 whereas the console now seems to use ISO-8859-1 (presumably via the mb functions) by default. Hence a file called 'b=C3=A4h' in Explorer (with a-umlaut in the middle), will show as 'b=C3=83=C2=A4h' instead. And if you try to create 'b=C3=A4h' in Cygwin 1.7, you actually get a file called 'b', because the '=C3=A4' (0xE4) in ISO-8859-1 turns into an encoding error when interpreted as UTF-8, and the name simply seems to be truncated at that point. I see two good solutions: - Use the default Windows codepage for filenames, console, and multibyte functions. This is what happens already if you specifiy a locale with a language but no charset, e.g. "en". Maximum 1.5 compatibility. - Use UTF-8 throughout. Full Unicode support out-of-the box. And a cheap'n'nasty one: - Restrict the multibyte functions and console to 7-bit ASCII. Still means it's inconsistent with the filename conversions, but at least non-ASCII characters wouldn't show up wrongly. Instead, they wouldn't show at all. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple