DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 48JHVAnN2745366
Authentication-Results: delorie.com;
	dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=WCmDj/EO
X-Recipient: archive-cygwin@delorie.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6957F3858414
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
	s=default; t=1726767069;
	bh=moJx6IJ8bGRgXQGlcHEIkqpwEKHj14pqqE+cl6DH8yg=;
	h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=WCmDj/EO8NYfBDnM4b+KKJL/uaTLf31mP5Pm1Xp3Ac2w87FjxAcH07Daoi70H4nzF
	 lj9wt8pcQz50rZDLKjAC1c3Z6q/z7JrTQOvUg9n3xms7GcGyDZKXBpl3rtCFmuOrFL
	 H822o1PEdeZ0m5EkHBFBp9P+4gu13GsTNoHy5+k8=
X-Original-To: cygwin@cygwin.com
Delivered-To: cygwin@cygwin.com
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1DDF03858D29
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1DDF03858D29
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726767011; cv=none;
 b=wLJNXAx4mf5S2mCba1EaoFCKPvYnrk3z/2uEzrTfBi/mD+Kzv1VMHZaxfGPpVwid8lFIoEmnDRKdO/Ww6KIMvHCkOWvFWgwST70XwyoPWFXy/75+m/e4WG5a0MBfpz7PTTFlnREKQfOYvDfPsTixqtJ3lrSg291tS+025VT8fI8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1726767011; c=relaxed/simple;
 bh=e6m9Xrh0gTBMOhCAMnKk+GR6WurOBlIZzgsyG/wHhz8=;
 h=Subject:To:From:Message-ID:Date:MIME-Version;
 b=qI87FVrmDePJdabF7WAaAqusEt05w44A+QOB2XaW4eccD4eTa0Zgw5anOBJgDwzcRmnYpSQnT9T1OX6BlW0VWnxBdPzHPmItCgXoK3asind272+gZORccWTgB5iVMXlaNHIzGViTTeqNygHKMDEVIXmpU4dqq0A8/gORiLND1eU=
ARC-Authentication-Results: i=1; server2.sourceware.org
Subject: Re: readdir() returns inaccessible name if file was created with
 invalid UTF-8
To: cygwin@cygwin.com
References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1@t-online.de>
 <bc8bd61c-818e-424f-bb42-52f4fecd4849@towo.net>
 <b6ab074b-919e-4514-8276-72a30c36ab58@towo.net>
 <de4767e2-85b7-ead2-df9a-64e1f24f4e8f@t-online.de>
 <6451a249-adcd-9c56-b76e-1b00886cea80@t-online.de>
 <CAN0SSYx+g4JE6AA6krNAzG6QXrve52TBv0d3VM0SODV-tzZQSQ@mail.gmail.com>
 <66051d82-e2c3-684f-d13f-d1301170b0d4@t-online.de>
 <984103a4-ab2d-4337-9964-cc1e3208155d@SystematicSW.ab.ca>
Message-ID: <11036733-c4f3-c2e9-37c9-959c9e99edab@t-online.de>
Date: Thu, 19 Sep 2024 19:30:05 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 SeaMonkey/2.53.18.2
MIME-Version: 1.0
In-Reply-To: <984103a4-ab2d-4337-9964-cc1e3208155d@SystematicSW.ab.ca>
X-TOI-EXPURGATEID: 150726::1726767006-16FFB4ED-FB442723/0/0 CLEAN NORMAL
X-TOI-MSGID: 1a0df370-8824-4a13-8990-f9ca76f4421e
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, FREEMAIL_FROM,
 KAM_DMARC_STATUS, KAM_NUMSUBJECT, NICE_REPLY_A, RCVD_IN_DNSWL_NONE,
 RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: cygwin@cygwin.com
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=subscribe>
From: Christian Franke via Cygwin <cygwin@cygwin.com>
Reply-To: cygwin@cygwin.com
Cc: Christian Franke <Christian.Franke@t-online.de>
Content-Type: text/plain; charset="utf-8"; Format="flowed"
Errors-To: cygwin-bounces~archive-cygwin=delorie.com@cygwin.com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie.com@cygwin.com>
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 48JHVAnN2745366

Brian Inglis via Cygwin wrote:
> On 2024-09-19 07:27, Christian Franke via Cygwin wrote:
>> Mark Liam Brown via Cygwin wrote:
>>> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin
>>> <cygwin@cygwin.com> wrote:
>>>> Christian Franke via Cygwin wrote:
>>>>> Thomas Wolff via Cygwin wrote:
>>>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
>>>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
>>>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, 
>>>>>>>> open()
>>>>>>>> does not refuse to create the file. Later readdir() returns a
>>>>>>>> different name which could not be used to access the file.
>>>>>>>>
>>>>>>>> Testcase with U+1F321 (Thermometer):
>>>>>>>>
>>>>>>>> $ uname -r
>>>>>>>> 3.5.4-1.x86_64
>>>>>>>>
>>>>>>>> $ printf $'\U0001F321' | od -A none -t x1
>>>>>>>>   f0 9f 8c a1
>>>>>>>>
>>>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
>>>>>>>>
>>>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>>>>>
>>>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
>>>>>>>>
>>>>>>>> $ ls -1
>>>>>>>> ls: cannot access 'file2-.?ext': No such file or directory
>>>>>>>> ls: cannot access 'file3-': No such file or directory
>>>>>>>> 'file1-'$'\360\237\214\241''.ext'
>>>>>>>> file2-.?ext
>>>>>>>> file3-
>>>>>>> I don't reproduce this.
>>>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto'
>>>>> which needs to call stat(). Plain 'ls' does not, so the errors do not
>>>>> occur then.
>>>>>
>>>>>
>>>>>>> While the file name gets mangled, all resulting file names are 
>>>>>>> valid
>>>>>>> and
>>>>>>> listed:
>>>>>>> In file2 the sequence is turned into U+17B3 but exchanged with 
>>>>>>> the dot.
>>>>>>> In file3 the same sequence is just dropped.
>>>>>>> $ ls -1|cat
>>>>>>> file1-🌡.ext
>>>>>>> file2-.ឳext
>>>>>>> file3-
>>>>>>>
>>>>>>> However, ls file2* fails, as does ls *.
>>>>>> On the other hand, ls file3- fails too, so some mapping error occurs
>>>>>> internally.
>>>>>> Also, the files cannot be deleted from cygwin (need to use cmd).
>>>>> 'rm' using the original names works for file2-..., but not for 
>>>>> file3-...
>>>>>
>>>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>> removed 'file2-'$'\360\237\214''.ext'
>>>>>
>>>>> $ rm -v 'file3-'$'\xf0\x9f\x8c'
>>>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
>>>>>
>>>> Further tests suggest that the problem only occurs with:
>>>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
>>>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16
>>>> 'high surrogate' range (0xD800..0xDBFF).
>>> Makes perfect sense, the Windows kernel uses UTF16 internally.
>>
>>
>> Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> 
>> UTF-16 mappings. This makes no sense:
>>
>> $ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on 
>> NTFS
>>
>> $ strace ls -F
>> ...
>> ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" 
>> > "file-\xE2\x9E\xB3.ext")
>> ...
>>   ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...)
>> ...
>> ls: cannot access 'file-?.ext': No such file or directory
>> file-?.ext
>>
>> $ rm -v 'file-'$'\xed\xa0\x80''.ext'
>> removed 'file-'$'\355\240\200''.ext'
>>
>> The UTF-8 sequence returned by readdir() decodes to U+27B3 
>> (White-Feathered Rightwards Arrow).
>>
>>
>> This could be fixed by handling UTF-8 of the surrogate range similar 
>> to other invalid sequences: Map each invalid byte to unicode range 
>> U+FF80 to U+FFFF. This works as expected if the above UTF-8 sequence 
>> is truncated:
>>
>> $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" 
>> on NTFS
>>
>> $ ls -F
>> 'file-'$'\355\240''.ext'
>
> Surrogates halves are invalid for UTF-8 encoding; they should be first 
> be encoded as a valid UTF-16 code point.
> The encoder should just fail if it encounters any invalid sequence!
> Handling surrogates or other invalid values as anything other than 
> invalid turns the encoding into what has been called WTF-8 where W may 
> be for Windows! ;^>

:-)

I guess the idea behind Cygwin's filename mapping was to emulate Linux 
behavior as far as possible. AFAICS, Linux accepts any nonempty byte 
string without slash as a plain filename and leaves the interpretation 
(UTF-8?) to the userland.

Cygwin maps 0x20..0x7f and valid UTF-8 sequences to UTF-16. Control 
chars and bytes from invalid UTF-8 sequences are mapped to the U+F0xx 
range. It should handle UTF-8 sequences which lead to the surrogate 
range the same way but currently does not.


-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

