delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/22/13:07:29

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Tue, 22 Sep 2009 19:07:09 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
Message-ID: <20090922170709.GV20981@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <h8bk5a$big$1 AT ger DOT gmane DOT org> <416096c60909101512l6e42ab72l4ba5fd792363eefd AT mail DOT gmail DOT com> <h8p50e$im8$1 AT ger DOT gmane DOT org> <20090921161014 DOT GI20981 AT calimero DOT vinschen DOT de> <416096c60909211154u5ddd5869v986011aa4ee13d57 AT mail DOT gmail DOT com> <20090922094523 DOT GR20981 AT calimero DOT vinschen DOT de> <416096c60909220912s5dd749bh5cfeb670b0e78c7a AT mail DOT gmail DOT com>
MIME-Version: 1.0
In-Reply-To: <416096c60909220912s5dd749bh5cfeb670b0e78c7a@mail.gmail.com>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Sep 22 17:12, Andy Koppe wrote:
> 2009/9/22 Corinna Vinschen:
> >> Therefore, when converting a UTF-16 Windows filename to the current
> >> charset, 0xDC?? words should be treated like any other UTF-16 word
> >> that can't be represented in the current charset: it should be encoded
> >> as a ^N sequence.
> >
> > How?  Just like the incoming multibyte character didn't represent a valid
> > UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char.
> > Therefore, the ^N conversion will fail since U+DCxx can't be converted
> > to valid UTF-8.
> 
> True, but that's an implementation issue rather than a design issue,
> i.e. the ^N conversion needs to do the UTF-8 conversion itself rather
> than invoke the __utf8 functions. Shall I look into creating a patch?

Well, sure I'm interested to see that patch (lazy me), but please note
that we need a snail mailed copyright assignment per
http://cygwin.com/assign.txt from you before we can apply any significant
patches.  Sorry for the hassle.

Hmm... maybe it's not that complicated.  The ^N case checks for a valid
UTF-8 lead byte right now.  The U+DCxx case could be handled by
generating (in sys_cp_wcstombs) and recognizing (in sys_cp_mbstowcs) a
non-valid lead byte, like 0xff.

> >> This won't work correctly, because different POSIX filenames will map
> >> to the same Windows filename. For example, the filenames "\xC3\xA4"
> >> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that
> >> represents a-umlaut in 8859-1), will both map to Windows filename
> >> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file
> >> called "\xC4", a readdir() would show that file as "\xC3\xA4".
> >
> > Right, but using your above suggestion will also lead to another filename
> > in readdir, it would just be \x0e\xsome\xthing.
> 
> I don't think the suggestion above is directly relevant to the problem
> I tried to highlight here.
> 
> Currently, with UTF-8 filename encodings, "\xC3xA4" turns into U+00C4
> on disk, while "\xC4" turns into U+DCC4, and converting back yields
> the original separate filenames.

Well, right now it doesn't exactly.

> If I understand your proposal
> correctly, both "\xC3\xA4" and "\xC4" would turn into U+00C4, hence
> converting back would yield "\xC3\xA4" for both. This is wrong. Those
> filenames shouldn't be clobbering each other, and a filename shouldn't
> change between open() and readdir(), certainly not without switching
> charset inbetween.

I see your point.  I was more thinking along the lines of how likely
that clobbering is, apart from pathological testcases.

> Having said that, if you did switch charset from UTF-8 e.g. to
> ISO-8859-1, the on-disk U+DCC4 would indeed turn into
> "\x0E\xsome\xthing". However, that issue applies to any UTF-16

You don't have to switch the charset.  Assume you're using any
non-singlebyte charset in which \xC4 is the start of a double- or
multibyte sequence.  open ("\xC4"); close; readdir(); will return
"\x0E\xsome\xthing" on readdir.

Only singlebyte charsets are off the hook.  So, your proposal to switch
to the default ANSI codepage for the C locale would be good for most
western languages, but it would still leave the eastern language users
with double-byte charsets behind.

Note that I'm not as opposed to your proposal to use the ANSI codepage
as before this discussion.  But I would like to see that the solution
works for most eastern language users as well.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019