delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/10/15:31:29

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_41,SPF_HELO_PASS,SPF_PASS
X-Spam-Check-By: sourceware.org
To: cygwin AT cygwin DOT com
From: Lapo Luchini <lapo AT lapo DOT it>
Subject: [1.7] Invalid UTF8 while creating a file -> cannot delete?
Date: Thu, 10 Sep 2009 21:30:46 +0200
Lines: 81
Message-ID: <h8bk5a$big$1@ger.gmane.org>
Mime-Version: 1.0
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.23) Gecko/20090812 Thunderbird/2.0.0.23 Mnenhy/0.7.5.0
OpenPGP: id=C8F252FB; url=http://www.lapo.it/pgpkey.txt
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Note-from-DJ: This may be spam

After a few problems with monotone's unit tests on Cygwin-1.7, I began
searching and experimenting a bit with new 1.7 support for wide chars.

I also read the full thread about its last change:
http://www.cygwin.com/ml/cygwin/2009-05/msg00344.html
which really makes some sense to me (when I create a file from the
console I want "ls" to show back that file to me with same encoding).

Problem is, that unit test assumes filenames are "raw data" and tries to
create three types of filenames: ISO-8859-1, EUC-JP and UTF-8.
Except on OSX where it only tries UTF-8 as that's the disk format.

Now we have an UTF-16 disk format, except the library is using
LANG-value-from-process-start to initialize some LANG-to-UTF16
conversion as far as I understoof so there's not really one "correct"
format: it depends on the LANG env value when the test unit is launched.

OK, that's a side issue since I can probably modify the tests to always
be launched with LANG=C instead of using the current value so that at
least it is consitent. And then maybe remove the creation of ISO-8859-1
and EUC-JP tests just like on OSX. Which could be correct... but a bit
less so than on OSX itself, when that is really "the format" and not the
"the DEFAULT format which could be overridden with a correct setlocale".

But the real problem with that test is not really what shows and how,
the biggest problem is that it seems that filenames created with a
"wrong" filename are quite limited in usage and can't seemingly be deleted.

% export LANG=en_EN.UTF-8
% cat t.c
#include <stdio.h>
int main() {
    fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1
    fopen("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F", "w"); //UTF-8
    return 0;
}
% gcc -o t t.c
% mkdir test ; cd test ; ../t ; cd ..
% ls -l test
ls: cannot access test/a-▒▒▒: No such file or directory
total 0
-????????? ? ?    ?    ?                ? a-▒▒▒
-rw-r--r-- 1 lapo None 0 2009-09-10 21:19 b-öäüß
% find test
test
test/a-???
test/b-öäüß
% find test -delete
find: cannot delete `test/a-\366\344\374': No such file or directory
find: cannot delete `test': Directory not empty
% find test
test
test/a-???

Now... I don't know how exactly `find` works but it seems strange to me
it isn't capable of deleting something it is capable of listing.
Also seems strange `ls` is not capable of stat-ing something it's
capable of listing.

Yep, I do know that filename is "broken" in the first place, but since
in the Unix world such stuff can happen as filenames are really raw
data, I think probably an error on file creation would be better than
creating a file that can't be consequently stat-ed or even unlinked.

% cat u.c
#include <stdio.h>
int main() {
    remove("a-\xF6\xE4\xFC\xDF");
    remove("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F");
    return 0;
}
% gcc -o u u.c

OK, a program using a similarly-broken filename can delete it, but the
fact it can't be deleted with "normal" tools is a bit of an inconvenience...

-- 
Lapo Luchini - http://lapo.it/

“Premature optimisation is the root of all evil in programming.” (C. A.
R. Hoare)


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright 2019   by DJ Delorie     Updated Jul 2019