DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 48JDS1nW2596588 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=dgrf3qn9 X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org ABDAF3858C33 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1726752480; bh=XC9Ask1h+e61zxg53QB/GKm0j2zI+dzzTSdvY/Xdo4w=; h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=dgrf3qn9zawI5HULNrn1tABDzZx12fmFXOTtgmq88bGMS3Mjle36z/RCtFF15ebJ1 Tu6T/Nntd/bPctsFuUQd3HdVHrHNTXUNCBIx04v6cDi+t30Dg2pq2yvB1wcaPrO/6d zfU7Eh8yM4EswnjWSIAnd1038Db2BvGu1jYcCy78= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 28C7C3858D28 ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 28C7C3858D28 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726752460; cv=none; b=CkaZCqijraYtT5fpUSGlp//T5YrC/SjLwuxyM9x31oPaZ6PqBJauKv38K/CKtCQ0B7BPhBBDWf/OwXbNquZ7HqZeQYXL0FqXbZlUdW9jf7rE4OFN3KzLsMtnBUp5SG30O9EpPAMrjkG4WuaQaM9r6NY/zvic0bDuZLremqN3t6I= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726752460; c=relaxed/simple; bh=V2kZzwcUiODsyrGPgpQL6Qq4dv4b2ZK0dMhIi0YNqOc=; h=Subject:To:From:Message-ID:Date:MIME-Version; b=lPia/ay7MX2vxnpjHIC4wqNTawyaMIYqq+gEfSsdyzJYyziF7LmDsFqTu9uDZk237KE9ruMJibadUmbLWRYAQRtjbeEJQQq2hxRAsuolzm7Y1VA31HZlzwMfVme4AthpdpcSnYtvnabTsTTap1Img+WIEsZ/LCo4kQhKXFKXDOA= ARC-Authentication-Results: i=1; server2.sourceware.org Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 To: cygwin AT cygwin DOT com References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de> <6451a249-adcd-9c56-b76e-1b00886cea80 AT t-online DOT de> Message-ID: <66051d82-e2c3-684f-d13f-d1301170b0d4@t-online.de> Date: Thu, 19 Sep 2024 15:27:29 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 SeaMonkey/2.53.18.2 MIME-Version: 1.0 In-Reply-To: X-TOI-EXPURGATEID: 150726::1726752449-53FFD40D-7B2292C9/0/0 CLEAN NORMAL X-TOI-MSGID: 3401b2a8-4b9b-49be-964c-e1ad07067514 X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00, FREEMAIL_FROM, KAM_DMARC_STATUS, KAM_NUMSUBJECT, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Christian Franke via Cygwin Reply-To: cygwin AT cygwin DOT com Cc: Christian Franke Content-Type: text/plain; charset="utf-8"; Format="flowed" Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 48JDS1nW2596588 Mark Liam Brown via Cygwin wrote: > On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin > wrote: >> Christian Franke via Cygwin wrote: >>> Thomas Wolff via Cygwin wrote: >>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin: >>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin: >>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open() >>>>>> does not refuse to create the file. Later readdir() returns a >>>>>> different name which could not be used to access the file. >>>>>> >>>>>> Testcase with U+1F321 (Thermometer): >>>>>> >>>>>> $ uname -r >>>>>> 3.5.4-1.x86_64 >>>>>> >>>>>> $ printf $'\U0001F321' | od -A none -t x1 >>>>>> f0 9f 8c a1 >>>>>> >>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext' >>>>>> >>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext' >>>>>> >>>>>> $ touch 'file3-'$'\xf0\x9f\x8c' >>>>>> >>>>>> $ ls -1 >>>>>> ls: cannot access 'file2-.?ext': No such file or directory >>>>>> ls: cannot access 'file3-': No such file or directory >>>>>> 'file1-'$'\360\237\214\241''.ext' >>>>>> file2-.?ext >>>>>> file3- >>>>> I don't reproduce this. >>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto' >>> which needs to call stat(). Plain 'ls' does not, so the errors do not >>> occur then. >>> >>> >>>>> While the file name gets mangled, all resulting file names are valid >>>>> and >>>>> listed: >>>>> In file2 the sequence is turned into U+17B3 but exchanged with the dot. >>>>> In file3 the same sequence is just dropped. >>>>> $ ls -1|cat >>>>> file1-🌡.ext >>>>> file2-.ឳext >>>>> file3- >>>>> >>>>> However, ls file2* fails, as does ls *. >>>> On the other hand, ls file3- fails too, so some mapping error occurs >>>> internally. >>>> Also, the files cannot be deleted from cygwin (need to use cmd). >>> 'rm' using the original names works for file2-..., but not for file3-... >>> >>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext' >>> removed 'file2-'$'\360\237\214''.ext' >>> >>> $ rm -v 'file3-'$'\xf0\x9f\x8c' >>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory >>> >> Further tests suggest that the problem only occurs with: >> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit) >> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16 >> 'high surrogate' range (0xD800..0xDBFF). > Makes perfect sense, the Windows kernel uses UTF16 internally. Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> UTF-16 mappings. This makes no sense: $ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on NTFS $ strace ls -F ... ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" > "file-\xE2\x9E\xB3.ext") ...  ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...) ... ls: cannot access 'file-?.ext': No such file or directory file-?.ext $ rm -v 'file-'$'\xed\xa0\x80''.ext' removed 'file-'$'\355\240\200''.ext' The UTF-8 sequence returned by readdir() decodes to U+27B3 (White-Feathered Rightwards Arrow). This could be fixed by handling UTF-8 of the surrogate range similar to other invalid sequences: Map each invalid byte to unicode range U+FF80 to U+FFFF. This works as expected if the above UTF-8 sequence is truncated: $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" on NTFS $ ls -F 'file-'$'\355\240''.ext' -- Regards, Christian -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple