X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Thu, 19 Mar 2009 19:13:23 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: Q: Is anybody here using the CYGWIN=codepage:oem setting? Message-ID: <20090319181323.GB1868@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <20090319130909 DOT GZ9322 AT calimero DOT vinschen DOT de> <49C281F7 DOT 6080602 AT acm DOT org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <49C281F7.6080602@acm.org> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Mar 19 10:33, David Rothenberger wrote: > On 3/19/2009 6:09 AM, Corinna Vinschen wrote: > > If you've set $LANG to, say, "en_US.UTF-8", Cygwin would use the UTF-8 > > charset *iff* the application switched the codepage by calling something > > along the lines of `setlocale(LC_ALL, "");'. > > An application which does not call setlocale (which means, it's not > > native language aware anyway) would still use the default ANSI codepage. > > First, please forgive my ignorance about LC_ALL, LANG, etc. > > I ran into an issue yesterday where I was trying to "du -sh" a directory > that contained files whose names included UTF characters, I think. > Without CYGWIN=codepage:utf8, this failed. It worked fine when I added > CYGWIN=codepage:utf8. Yes, sure. As described in the User's Guide. That's exactly what bugs me right now. To get UTF-8 support you have to set LANG or LC_ALL or whatever, *and* CYGWIN=codepage:utf8. I *think* we can get rid of the codepage setting in favor of the $LANG/$LC_foo setting, but we couldn't support both, ANSI and OEM codepages anymore in this case. In the long run I'm looking into not using the ANSI/OEM codepages at all, though, but instead have real, full locale support. But that's a dream of the future. > So my question is, will this work if codepage is dropped and I set LANG > to en_US.UTF-8? Is there anything in the Cygwin DLL itself that uses > codepage that might be valuable to enable even for applications that > aren't native language aware and don't call setlocale()? Not exactly. However, assuming you have a file using characters which are not in your current ANSI codeset, then you could only manipulate that file when setting LANG="xx_YY.UTF-8", and only in applications which call setlocale(). In contrast to UNIX systems, we have the problem that the underlying filesystems are using the UTF-16 charset for filenames. So we must convert from the used singlebyte or multibyte charset to wide character. Other systems don't care, the filename is just a byte stream. On Windows, you always have a conversion step which requires to know the multibyte character set. There's no way to convert a wide character string into a multibyte string without knowing that charset. Of course, what we could do is to call setlocale from within Cygwin so we always have a base for the conversion, whether or not the application calls it again. In theory this should not affect applications which don't call setlocale since these applications are like other OSes; they handle the filename as a simple bytestream. The problem: I'm not really sure calling setlocale in Cygwin is a good idea. Maybe there's some downside I just don't see right now. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/