delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/05/13/11:39:55

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=0.8 required=5.0 tests=BAYES_50,J_CHICKENPOX_41,SARE_SUB_ENC_UTF8x2
X-Spam-Check-By: sourceware.org
From: "Jason Pyeron" <jpyeron AT pdinc DOT us>
To: <cygwin AT cygwin DOT com>
References: <3f0ad08d0905121029j119c8a7ep41d3a261d8bea338 AT mail DOT gmail DOT com> <20090512173741 DOT GZ21324 AT calimero DOT vinschen DOT de> <20090513142953 DOT GI21324 AT calimero DOT vinschen DOT de>
Subject: RE: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
Date: Wed, 13 May 2009 11:41:14 -0400
Message-ID: <FF12C2AA0B1E4F1489F94CD16C0B0728@phoenix>
MIME-Version: 1.0
In-Reply-To: <20090513142953.GI21324@calimero.vinschen.de>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

Corinna Vinschen wrote on Wednesday, May 13, 2009 10:30:
> On May 12 19:37, Corinna Vinschen wrote:
>> On May 13 02:29, IWAMURO Motonori wrote:
>>> I propose that the filename encoding in C locale uses UTF-8 instead
>>> of SO/UTF-8. 
>>> 
>>> There are three reasons:
>> 
>> That's an interesting thought.  Do you have a patch and, if so, did
>> you try it?  Does it, for instance, help for the issue reported in
>> the thread starting at
> http://cygwin.com/ml/cygwin/2009-05/msg00245.html?
> 
> After examining the issue Lenik reported in the above thread,
> I'm at a loss how to solve this problem in a generic way.
> 

I may be dense, as all of my internationlization experience was from the late
90's. But in my experience the only solution for this is a cognizant effort on
behalf of the user (or admin).

> The problem is that the filename changes dependent on the
> character set used in $LANG.  The reason is that every time a
> multibyte filename has to be generated, it has to be
> converted from UTF-16 to multibyte.
> 
> For instance, taking one of the filename from Lenik's
> example.  It's stored on the filesystem as the UTF-16
> sequence \u684c \u9762.  If I set LANG to en_US.UTF-8, a
> readdir(2) call returns the multibyte sequence
> 
>  0xe6 0xa1 0x8c 0xe9 0x9d 0xa2
> 
> If I set LANG to en_US.GBK, `ls' returns the filename
> 
>  0xd7 0xc0 0xc3 0xe6
> 
> And in case LANG=C, `ls' returns
> 
>  0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2
> 
> So, dependent on the character set setting in the
> application, the idea of the filename differs.  That's not
> exactly helpful for interoperability between different applications.
> 
> I can think of two potential solutions to fix this problem:
> 
> (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
>     is the way files are stored on disk.  That results in unchangable
>     filenames which are always valid.
> 
>     But what if an application sets LANG="xxxx.SJIS" and
> tries to create
>     a file using SJIS character encoding?  Should the file be created
>     using the SJIS->UTF-16 conversion or should open fail with
> EILSEQ?     That's not good. 
> 
> (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
>     Cygwin uses the LC_CTYPE setting which corresponds to the current
>     codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in
> the environment,

If nothing is set use UTF-8 as it will work in existing code.

>     Cygwin uses that to convert pathnames.  If the application uses
>     setlocale, Cygwin uses that setting to convert pathnames.
> 
>     One problem can't be solved this way:  If an application fetches
>     and stores a filename, then switches the locale, and then tries
>     to use the filename in another system call, the filename is    
> potentially broken. 

This is the user's problem to resolve.

> 
> Any better ideas?
> 

Not necessarily better, but here is a chart:

Sys:	App:	function expects/returns
NULL:	NULL:	UTF-8
C/UA:	NULL:	UTF-8
NULL:	C/UA:	UTF-8
C/UA:	C/UA:	UTF-8
SPEC:	NULL:	System Locale
SPEC:	C/UA:	UTF-8
NULL	SPEC:	Application Locale
C/UA:	SPEC:	Application Locale
SPEC:	SPEC:	Application Locale


Key:

Sys= System's current locale
App= Application's current locale
NULL= No setting
C/UA= C or any Unicode aware locale
SPEC= Some other locale (i.e. SJIS)

-jason

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Principal Consultant              10 West 24th Street #100    -
- +1 (443) 269-1555 x333            Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.



--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019