delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/22/12:12:35

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-0.9 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_12,J_CHICKENPOX_14,J_CHICKENPOX_23,SARE_MSGID_LONG40,SPF_PASS
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <20090922094523.GR20981@calimero.vinschen.de>
References: <h8bk5a$big$1 AT ger DOT gmane DOT org> <416096c60909101512l6e42ab72l4ba5fd792363eefd AT mail DOT gmail DOT com> <h8p50e$im8$1 AT ger DOT gmane DOT org> <20090921161014 DOT GI20981 AT calimero DOT vinschen DOT de> <416096c60909211154u5ddd5869v986011aa4ee13d57 AT mail DOT gmail DOT com> <20090922094523 DOT GR20981 AT calimero DOT vinschen DOT de>
Date: Tue, 22 Sep 2009 17:12:21 +0100
Message-ID: <416096c60909220912s5dd749bh5cfeb670b0e78c7a@mail.gmail.com>
Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

2009/9/22 Corinna Vinschen:
>> > As you might know, invalid bytes >=3D 0x80 are translated to UTF-16 by
>> > transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
>> > The problem now is that readdir() will return the transposed characters
>> > as if they are the original characters.
>>
>> Yep, that's where the bug is. Those 0xDC?? words represent invalid
>> UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters.
>>
>> Therefore, when converting a UTF-16 Windows filename to the current
>> charset, 0xDC?? words should be treated like any other UTF-16 word
>> that can't be represented in the current charset: it should be encoded
>> as a ^N sequence.
>
> How? =C2=A0Just like the incoming multibyte character didn't represent a =
valid
> UTF-8 char, a single U+DCxx value does not represent a valid UTF-16 char.
> Therefore, the ^N conversion will fail since U+DCxx can't be converted
> to valid UTF-8.

True, but that's an implementation issue rather than a design issue,
i.e. the ^N conversion needs to do the UTF-8 conversion itself rather
than invoke the __utf8 functions. Shall I look into creating a patch?


>> > So it looks like the current mechanism to handle invalid multibyte
>> > sequences is too complicated for us. =C2=A0As far as I can see, it wou=
ld be
>> > much simpler and less error prone to translate the invalid bytes simply
>> > to the equivalent UTF-16 value. =C2=A0That creates filenames with UTF-=
16
>> > values from the ISO-8859-1 range.
>>
>> This won't work correctly, because different POSIX filenames will map
>> to the same Windows filename. For example, the filenames "\xC3\xA4"
>> (valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that
>> represents a-umlaut in 8859-1), will both map to Windows filename
>> "U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file
>> called "\xC4", a readdir() would show that file as "\xC3\xA4".
>
> Right, but using your above suggestion will also lead to another filename
> in readdir, it would just be \x0e\xsome\xthing.

I don't think the suggestion above is directly relevant to the problem
I tried to highlight here.

Currently, with UTF-8 filename encodings, "\xC3xA4" turns into U+00C4
on disk, while "\xC4" turns into U+DCC4, and converting back yields
the original separate filenames. If I understand your proposal
correctly, both "\xC3\xA4" and "\xC4" would turn into U+00C4, hence
converting back would yield "\xC3\xA4" for both. This is wrong. Those
filenames shouldn't be clobbering each other, and a filename shouldn't
change between open() and readdir(), certainly not without switching
charset inbetween.

Having said that, if you did switch charset from UTF-8 e.g. to
ISO-8859-1, the on-disk U+DCC4 would indeed turn into
"\x0E\xsome\xthing". However, that issue applies to any UTF-16
character not in the target charset, not just those funny U+DC?? codes
for representing invalid UTF-8 bytes.

The only way to avoid the POSIX filenames changing depending on locale
would be to assume UTF-8 for filenames no matter the locale charset.
That's an entirely different can of worms though, extending the
compatibility problems discussed on the "The C locale" thread to all
non-UTF-8 locales, and putting the onus for converting filenames on
applications.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019