X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_41,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: sourceware.org To: cygwin AT cygwin DOT com From: Lapo Luchini Subject: [1.7] Invalid UTF8 while creating a file -> cannot delete? Date: Thu, 10 Sep 2009 21:30:46 +0200 Lines: 81 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.23) Gecko/20090812 Thunderbird/2.0.0.23 Mnenhy/0.7.5.0 OpenPGP: id=C8F252FB; url=http://www.lapo.it/pgpkey.txt X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Note-from-DJ: This may be spam After a few problems with monotone's unit tests on Cygwin-1.7, I began searching and experimenting a bit with new 1.7 support for wide chars. I also read the full thread about its last change: http://www.cygwin.com/ml/cygwin/2009-05/msg00344.html which really makes some sense to me (when I create a file from the console I want "ls" to show back that file to me with same encoding). Problem is, that unit test assumes filenames are "raw data" and tries to create three types of filenames: ISO-8859-1, EUC-JP and UTF-8. Except on OSX where it only tries UTF-8 as that's the disk format. Now we have an UTF-16 disk format, except the library is using LANG-value-from-process-start to initialize some LANG-to-UTF16 conversion as far as I understoof so there's not really one "correct" format: it depends on the LANG env value when the test unit is launched. OK, that's a side issue since I can probably modify the tests to always be launched with LANG=C instead of using the current value so that at least it is consitent. And then maybe remove the creation of ISO-8859-1 and EUC-JP tests just like on OSX. Which could be correct... but a bit less so than on OSX itself, when that is really "the format" and not the "the DEFAULT format which could be overridden with a correct setlocale". But the real problem with that test is not really what shows and how, the biggest problem is that it seems that filenames created with a "wrong" filename are quite limited in usage and can't seemingly be deleted. % export LANG=en_EN.UTF-8 % cat t.c #include int main() { fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1 fopen("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F", "w"); //UTF-8 return 0; } % gcc -o t t.c % mkdir test ; cd test ; ../t ; cd .. % ls -l test ls: cannot access test/a-▒▒▒: No such file or directory total 0 -????????? ? ? ? ? ? a-▒▒▒ -rw-r--r-- 1 lapo None 0 2009-09-10 21:19 b-öäüß % find test test test/a-??? test/b-öäüß % find test -delete find: cannot delete `test/a-\366\344\374': No such file or directory find: cannot delete `test': Directory not empty % find test test test/a-??? Now... I don't know how exactly `find` works but it seems strange to me it isn't capable of deleting something it is capable of listing. Also seems strange `ls` is not capable of stat-ing something it's capable of listing. Yep, I do know that filename is "broken" in the first place, but since in the Unix world such stuff can happen as filenames are really raw data, I think probably an error on file creation would be better than creating a file that can't be consequently stat-ed or even unlinked. % cat u.c #include int main() { remove("a-\xF6\xE4\xFC\xDF"); remove("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F"); return 0; } % gcc -o u u.c OK, a program using a similarly-broken filename can delete it, but the fact it can't be deleted with "normal" tools is a bit of an inconvenience... -- Lapo Luchini - http://lapo.it/ “Premature optimisation is the root of all evil in programming.” (C. A. R. Hoare) -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple