X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Tue, 22 Sep 2009 11:45:23 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete? Message-ID: <20090922094523.GR20981@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <416096c60909101512l6e42ab72l4ba5fd792363eefd AT mail DOT gmail DOT com> <20090921161014 DOT GI20981 AT calimero DOT vinschen DOT de> <416096c60909211154u5ddd5869v986011aa4ee13d57 AT mail DOT gmail DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <416096c60909211154u5ddd5869v986011aa4ee13d57@mail.gmail.com> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Sep 21 19:54, Andy Koppe wrote: > 2009/9/21 Corinna Vinschen: > > As you might know, invalid bytes >= 0x80 are translated to UTF-16 by > > transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00. > > The problem now is that readdir() will return the transposed characters > > as if they are the original characters. > > Yep, that's where the bug is. Those 0xDC?? words represent invalid > UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters. > > Therefore, when converting a UTF-16 Windows filename to the current > charset, 0xDC?? words should be treated like any other UTF-16 word > that can't be represented in the current charset: it should be encoded > as a ^N sequence. How? Just like the incoming multibyte character didn't represent a valid UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char. Therefore, the ^N conversion will fail since U+DCxx can't be converted to valid UTF-8. > > So it looks like the current mechanism to handle invalid multibyte > > sequences is too complicated for us.  As far as I can see, it would be > > much simpler and less error prone to translate the invalid bytes simply > > to the equivalent UTF-16 value.  That creates filenames with UTF-16 > > values from the ISO-8859-1 range. > > This won't work correctly, because different POSIX filenames will map > to the same Windows filename. For example, the filenames "\xC3\xA4" > (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that > represents a-umlaut in 8859-1), will both map to Windows filename > "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file > called "\xC4", a readdir() would show that file as "\xC3\xA4". Right, but using your above suggestion will also lead to another filename in readdir, it would just be \x0e\xsome\xthing. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple