X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Tue, 22 Sep 2009 19:07:09 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete? Message-ID: <20090922170709.GV20981@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <416096c60909101512l6e42ab72l4ba5fd792363eefd AT mail DOT gmail DOT com> <20090921161014 DOT GI20981 AT calimero DOT vinschen DOT de> <416096c60909211154u5ddd5869v986011aa4ee13d57 AT mail DOT gmail DOT com> <20090922094523 DOT GR20981 AT calimero DOT vinschen DOT de> <416096c60909220912s5dd749bh5cfeb670b0e78c7a AT mail DOT gmail DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <416096c60909220912s5dd749bh5cfeb670b0e78c7a@mail.gmail.com> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Sep 22 17:12, Andy Koppe wrote: > 2009/9/22 Corinna Vinschen: > >> Therefore, when converting a UTF-16 Windows filename to the current > >> charset, 0xDC?? words should be treated like any other UTF-16 word > >> that can't be represented in the current charset: it should be encoded > >> as a ^N sequence. > > > > How?  Just like the incoming multibyte character didn't represent a valid > > UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char. > > Therefore, the ^N conversion will fail since U+DCxx can't be converted > > to valid UTF-8. > > True, but that's an implementation issue rather than a design issue, > i.e. the ^N conversion needs to do the UTF-8 conversion itself rather > than invoke the __utf8 functions. Shall I look into creating a patch? Well, sure I'm interested to see that patch (lazy me), but please note that we need a snail mailed copyright assignment per http://cygwin.com/assign.txt from you before we can apply any significant patches. Sorry for the hassle. Hmm... maybe it's not that complicated. The ^N case checks for a valid UTF-8 lead byte right now. The U+DCxx case could be handled by generating (in sys_cp_wcstombs) and recognizing (in sys_cp_mbstowcs) a non-valid lead byte, like 0xff. > >> This won't work correctly, because different POSIX filenames will map > >> to the same Windows filename. For example, the filenames "\xC3\xA4" > >> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that > >> represents a-umlaut in 8859-1), will both map to Windows filename > >> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file > >> called "\xC4", a readdir() would show that file as "\xC3\xA4". > > > > Right, but using your above suggestion will also lead to another filename > > in readdir, it would just be \x0e\xsome\xthing. > > I don't think the suggestion above is directly relevant to the problem > I tried to highlight here. > > Currently, with UTF-8 filename encodings, "\xC3xA4" turns into U+00C4 > on disk, while "\xC4" turns into U+DCC4, and converting back yields > the original separate filenames. Well, right now it doesn't exactly. > If I understand your proposal > correctly, both "\xC3\xA4" and "\xC4" would turn into U+00C4, hence > converting back would yield "\xC3\xA4" for both. This is wrong. Those > filenames shouldn't be clobbering each other, and a filename shouldn't > change between open() and readdir(), certainly not without switching > charset inbetween. I see your point. I was more thinking along the lines of how likely that clobbering is, apart from pathological testcases. > Having said that, if you did switch charset from UTF-8 e.g. to > ISO-8859-1, the on-disk U+DCC4 would indeed turn into > "\x0E\xsome\xthing". However, that issue applies to any UTF-16 You don't have to switch the charset. Assume you're using any non-singlebyte charset in which \xC4 is the start of a double- or multibyte sequence. open ("\xC4"); close; readdir(); will return "\x0E\xsome\xthing" on readdir. Only singlebyte charsets are off the hook. So, your proposal to switch to the default ANSI codepage for the C locale would be good for most western languages, but it would still leave the eastern language users with double-byte charsets behind. Note that I'm not as opposed to your proposal to use the ANSI codepage as before this discussion. But I would like to see that the solution works for most eastern language users as well. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple