delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/05/13/11:18:36

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Wed, 13 May 2009 17:17:54 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
Message-ID: <20090513151753.GJ21324@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <3f0ad08d0905121029j119c8a7ep41d3a261d8bea338 AT mail DOT gmail DOT com> <20090512173741 DOT GZ21324 AT calimero DOT vinschen DOT de> <20090513142953 DOT GI21324 AT calimero DOT vinschen DOT de> <op DOT utvhnyxl1e62zd AT balu> <416096c60905130754s3ffaae9dl8d6df4c4184b95e6 AT mail DOT gmail DOT com>
MIME-Version: 1.0
In-Reply-To: <416096c60905130754s3ffaae9dl8d6df4c4184b95e6@mail.gmail.com>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On May 13 15:54, Andy Koppe wrote:
> > - why do you need to touch the filename at all? I haven't read all of it. Is
> > the UTF-16 on disk and we need to work around UTF-16 being intractable as C
> > string?
> 
> Yes. If you simply treated each UTF-16 symbol as two chars, you'd get
> unintended NULs and slashes. For starters, the upper halves of all
> ISO-8859-1 characters are NUL in UTF-16. And even without that, the
> resulting filenames would be completely unusable.

Right.  That's the crux when using UTF-16 filenames but many different
multibyte codepages.  In contrast to a system in which the filename is
just a byte stream, we have to perform widechar to multibyte conversion
and outside of the UTF-8 domain, every other conversion is lossy.

For the time being, I applied a patch to Cygwin which should ease the
pain.

I followed the suggestion to use UTF-8 for internal conversions when the
locale is set to "C".  This will also be used as default conversion when
converting the Windows environment from UTF-16 to multibyte, unless the
environment contains a valid LC_ALL/LC_CTYPE/LANG setting.  The current
working directory was also potentially unusable, if an application
switched the locale.  Now the CWD is re-evaluated after a setlocale call.

I'm sure this change doesn't fix all problems, but this worked much better
in my environment when using japanese and chinese characters in filenames.

There are a few other changes to the Cygwin DLL in the loop, but I will
update Cygwin 1.7 end of the week.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019