X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Thu, 2 Apr 2009 10:40:38 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Cc: Jason Tishler Subject: Re: [1.7] codepage:utf removal and python Message-ID: <20090402084037.GA15006@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com, Jason Tishler References: <49D3EB8D DOT 3040802 AT acm DOT org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <49D3EB8D.3040802@acm.org> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Apr 1 15:32, David Rothenberger wrote: > When codepage:utf was supported, this worked fine. Now, it fails, even > when I have LANG=en_US.UTF-8 in my environment. It all boils down to > this python code: > > import os > os.listdir('.') > > (That's an example I run from within the directory.) This fails with an > error > > OSError: [Errno 138] Invalid or incomplete multibyte or wide > character: '.' > > unless one does this first: > > import locale > locale.setlocale(locale.LC_ALL, '') That's always the better approach, otherwise the application works in the C locale. > I've patched rdiff-backup to do this, but I'm still wondering if this is > the correct thing to do. I know that on my Linux machine, I don't have > to do this, but I'm not sure if that's because there's some default > locale that's being picked up by Python from somewhere other than the > environment. The basic problem is that Windows stores filenames in UTF-16 while Linux and other OSes store the filename as a simple, zero-terminated bytestream. A simple bytestream is always valid. OTOH, a UTF-16 to singlebyte conversion has always characters which can't be converted. To workaround that I created the filename conversion method explained in http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual I'm not sure why this doesn't work in your simple case. The locale is C because the application didn't use setlocale. The resulting charset is ASCII. The filename should have been converted to use the ASCII SO/UTF-8 sequence for the non-readable characters. [...time passes...] And it works as designed in your above testcase. I tested with a filename containing a Euro sign (Unicode 0x20ac), in HTML speak "qq€". Cygwin converted it to "qq\016\342\202\254" The strace looks perfectly normal. I have no idea what python complains about! Jason, can you shed some light on this problem? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/