X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Wed, 13 May 2009 16:29:53 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8 Message-ID: <20090513142953.GI21324@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <3f0ad08d0905121029j119c8a7ep41d3a261d8bea338 AT mail DOT gmail DOT com> <20090512173741 DOT GZ21324 AT calimero DOT vinschen DOT de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090512173741.GZ21324@calimero.vinschen.de> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On May 12 19:37, Corinna Vinschen wrote: > On May 13 02:29, IWAMURO Motonori wrote: > > I propose that the filename encoding in C locale uses UTF-8 instead of SO/UTF-8. > > > > There are three reasons: > > That's an interesting thought. Do you have a patch and, if so, did you > try it? Does it, for instance, help for the issue reported in the > thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html? After examining the issue Lenik reported in the above thread, I'm at a loss how to solve this problem in a generic way. The problem is that the filename changes dependent on the character set used in $LANG. The reason is that every time a multibyte filename has to be generated, it has to be converted from UTF-16 to multibyte. For instance, taking one of the filename from Lenik's example. It's stored on the filesystem as the UTF-16 sequence \u684c \u9762. If I set LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2 If I set LANG to en_US.GBK, `ls' returns the filename 0xd7 0xc0 0xc3 0xe6 And in case LANG=C, `ls' returns 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2 So, dependent on the character set setting in the application, the idea of the filename differs. That's not exactly helpful for interoperability between different applications. I can think of two potential solutions to fix this problem: (1) Always return filenames in UTF-8 encoding and pretend that UTF-8 is the way files are stored on disk. That results in unchangable filenames which are always valid. But what if an application sets LANG="xxxx.SJIS" and tries to create a file using SJIS character encoding? Should the file be created using the SJIS->UTF-16 conversion or should open fail with EILSEQ? That's not good. (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then Cygwin uses the LC_CTYPE setting which corresponds to the current codepage. If one of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, Cygwin uses that to convert pathnames. If the application uses setlocale, Cygwin uses that setting to convert pathnames. One problem can't be solved this way: If an application fetches and stores a filename, then switches the locale, and then tries to use the filename in another system call, the filename is potentially broken. Any better ideas? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/