delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/04/02/03:41:01

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Thu, 2 Apr 2009 10:40:38 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Cc: Jason Tishler <jason AT tishler DOT net>
Subject: Re: [1.7] codepage:utf removal and python
Message-ID: <20090402084037.GA15006@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com, Jason Tishler <jason AT tishler DOT net>
References: <49D3EB8D DOT 3040802 AT acm DOT org>
MIME-Version: 1.0
In-Reply-To: <49D3EB8D.3040802@acm.org>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Apr  1 15:32, David Rothenberger wrote:
> When codepage:utf was supported, this worked fine. Now, it fails, even  
> when I have LANG=en_US.UTF-8 in my environment. It all boils down to  
> this python code:
>
>   import os
>   os.listdir('.')
>
> (That's an example I run from within the directory.) This fails with an  
> error
>
>   OSError: [Errno 138] Invalid or incomplete multibyte or wide  
> character: '.'
>
> unless one does this first:
>
>   import locale
>   locale.setlocale(locale.LC_ALL, '')

That's always the better approach, otherwise the application works
in the C locale.

> I've patched rdiff-backup to do this, but I'm still wondering if this is  
> the correct thing to do. I know that on my Linux machine, I don't have  
> to do this, but I'm not sure if that's because there's some default  
> locale that's being picked up by Python from somewhere other than the  
> environment.

The basic problem is that Windows stores filenames in UTF-16 while Linux
and other OSes store the filename as a simple, zero-terminated
bytestream.  A simple bytestream is always valid.  OTOH, a UTF-16 to
singlebyte conversion has always characters which can't be converted.

To workaround that I created the filename conversion method explained in
http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual

I'm not sure why this doesn't work in your simple case.  The locale is C
because the application didn't use setlocale.  The resulting charset is
ASCII.  The filename should have been converted to use the ASCII SO/UTF-8
sequence for the non-readable characters.

[...time passes...]

And it works as designed in your above testcase.

I tested with a filename containing a Euro sign (Unicode 0x20ac), in
HTML speak "qq&euro;".  Cygwin converted it to "qq\016\342\202\254"

The strace looks perfectly normal.  I have no idea what python complains
about!

Jason, can you shed some light on this problem?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019