X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Wed, 13 May 2009 17:17:54 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8 Message-ID: <20090513151753.GJ21324@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <3f0ad08d0905121029j119c8a7ep41d3a261d8bea338 AT mail DOT gmail DOT com> <20090512173741 DOT GZ21324 AT calimero DOT vinschen DOT de> <20090513142953 DOT GI21324 AT calimero DOT vinschen DOT de> <416096c60905130754s3ffaae9dl8d6df4c4184b95e6 AT mail DOT gmail DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <416096c60905130754s3ffaae9dl8d6df4c4184b95e6@mail.gmail.com> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On May 13 15:54, Andy Koppe wrote: > > - why do you need to touch the filename at all? I haven't read all of it. Is > > the UTF-16 on disk and we need to work around UTF-16 being intractable as C > > string? > > Yes. If you simply treated each UTF-16 symbol as two chars, you'd get > unintended NULs and slashes. For starters, the upper halves of all > ISO-8859-1 characters are NUL in UTF-16. And even without that, the > resulting filenames would be completely unusable. Right. That's the crux when using UTF-16 filenames but many different multibyte codepages. In contrast to a system in which the filename is just a byte stream, we have to perform widechar to multibyte conversion and outside of the UTF-8 domain, every other conversion is lossy. For the time being, I applied a patch to Cygwin which should ease the pain. I followed the suggestion to use UTF-8 for internal conversions when the locale is set to "C". This will also be used as default conversion when converting the Windows environment from UTF-16 to multibyte, unless the environment contains a valid LC_ALL/LC_CTYPE/LANG setting. The current working directory was also potentially unusable, if an application switched the locale. Now the CWD is re-evaluated after a setlocale call. I'm sure this change doesn't fix all problems, but this worked much better in my environment when using japanese and chinese characters in filenames. There are a few other changes to the Cygwin DLL in the loop, but I will update Cygwin 1.7 end of the week. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/