DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 48JHVAnN2745366 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=WCmDj/EO X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6957F3858414 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1726767069; bh=moJx6IJ8bGRgXQGlcHEIkqpwEKHj14pqqE+cl6DH8yg=; h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=WCmDj/EO8NYfBDnM4b+KKJL/uaTLf31mP5Pm1Xp3Ac2w87FjxAcH07Daoi70H4nzF lj9wt8pcQz50rZDLKjAC1c3Z6q/z7JrTQOvUg9n3xms7GcGyDZKXBpl3rtCFmuOrFL H822o1PEdeZ0m5EkHBFBp9P+4gu13GsTNoHy5+k8= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1DDF03858D29 ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1DDF03858D29 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726767011; cv=none; b=wLJNXAx4mf5S2mCba1EaoFCKPvYnrk3z/2uEzrTfBi/mD+Kzv1VMHZaxfGPpVwid8lFIoEmnDRKdO/Ww6KIMvHCkOWvFWgwST70XwyoPWFXy/75+m/e4WG5a0MBfpz7PTTFlnREKQfOYvDfPsTixqtJ3lrSg291tS+025VT8fI8= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726767011; c=relaxed/simple; bh=e6m9Xrh0gTBMOhCAMnKk+GR6WurOBlIZzgsyG/wHhz8=; h=Subject:To:From:Message-ID:Date:MIME-Version; b=qI87FVrmDePJdabF7WAaAqusEt05w44A+QOB2XaW4eccD4eTa0Zgw5anOBJgDwzcRmnYpSQnT9T1OX6BlW0VWnxBdPzHPmItCgXoK3asind272+gZORccWTgB5iVMXlaNHIzGViTTeqNygHKMDEVIXmpU4dqq0A8/gORiLND1eU= ARC-Authentication-Results: i=1; server2.sourceware.org Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 To: cygwin AT cygwin DOT com References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de> <6451a249-adcd-9c56-b76e-1b00886cea80 AT t-online DOT de> <66051d82-e2c3-684f-d13f-d1301170b0d4 AT t-online DOT de> <984103a4-ab2d-4337-9964-cc1e3208155d AT SystematicSW DOT ab DOT ca> Message-ID: <11036733-c4f3-c2e9-37c9-959c9e99edab@t-online.de> Date: Thu, 19 Sep 2024 19:30:05 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 SeaMonkey/2.53.18.2 MIME-Version: 1.0 In-Reply-To: <984103a4-ab2d-4337-9964-cc1e3208155d@SystematicSW.ab.ca> X-TOI-EXPURGATEID: 150726::1726767006-16FFB4ED-FB442723/0/0 CLEAN NORMAL X-TOI-MSGID: 1a0df370-8824-4a13-8990-f9ca76f4421e X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, FREEMAIL_FROM, KAM_DMARC_STATUS, KAM_NUMSUBJECT, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Christian Franke via Cygwin Reply-To: cygwin AT cygwin DOT com Cc: Christian Franke Content-Type: text/plain; charset="utf-8"; Format="flowed" Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 48JHVAnN2745366 Brian Inglis via Cygwin wrote: > On 2024-09-19 07:27, Christian Franke via Cygwin wrote: >> Mark Liam Brown via Cygwin wrote: >>> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin >>> wrote: >>>> Christian Franke via Cygwin wrote: >>>>> Thomas Wolff via Cygwin wrote: >>>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin: >>>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin: >>>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, >>>>>>>> open() >>>>>>>> does not refuse to create the file. Later readdir() returns a >>>>>>>> different name which could not be used to access the file. >>>>>>>> >>>>>>>> Testcase with U+1F321 (Thermometer): >>>>>>>> >>>>>>>> $ uname -r >>>>>>>> 3.5.4-1.x86_64 >>>>>>>> >>>>>>>> $ printf $'\U0001F321' | od -A none -t x1 >>>>>>>>   f0 9f 8c a1 >>>>>>>> >>>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext' >>>>>>>> >>>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext' >>>>>>>> >>>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c' >>>>>>>> >>>>>>>> $ ls -1 >>>>>>>> ls: cannot access 'file2-.?ext': No such file or directory >>>>>>>> ls: cannot access 'file3-': No such file or directory >>>>>>>> 'file1-'$'\360\237\214\241''.ext' >>>>>>>> file2-.?ext >>>>>>>> file3- >>>>>>> I don't reproduce this. >>>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto' >>>>> which needs to call stat(). Plain 'ls' does not, so the errors do not >>>>> occur then. >>>>> >>>>> >>>>>>> While the file name gets mangled, all resulting file names are >>>>>>> valid >>>>>>> and >>>>>>> listed: >>>>>>> In file2 the sequence is turned into U+17B3 but exchanged with >>>>>>> the dot. >>>>>>> In file3 the same sequence is just dropped. >>>>>>> $ ls -1|cat >>>>>>> file1-🌡.ext >>>>>>> file2-.ឳext >>>>>>> file3- >>>>>>> >>>>>>> However, ls file2* fails, as does ls *. >>>>>> On the other hand, ls file3- fails too, so some mapping error occurs >>>>>> internally. >>>>>> Also, the files cannot be deleted from cygwin (need to use cmd). >>>>> 'rm' using the original names works for file2-..., but not for >>>>> file3-... >>>>> >>>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext' >>>>> removed 'file2-'$'\360\237\214''.ext' >>>>> >>>>> $ rm -v 'file3-'$'\xf0\x9f\x8c' >>>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory >>>>> >>>> Further tests suggest that the problem only occurs with: >>>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit) >>>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16 >>>> 'high surrogate' range (0xD800..0xDBFF). >>> Makes perfect sense, the Windows kernel uses UTF16 internally. >> >> >> Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> >> UTF-16 mappings. This makes no sense: >> >> $ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on >> NTFS >> >> $ strace ls -F >> ... >> ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" >> > "file-\xE2\x9E\xB3.ext") >> ... >>   ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...) >> ... >> ls: cannot access 'file-?.ext': No such file or directory >> file-?.ext >> >> $ rm -v 'file-'$'\xed\xa0\x80''.ext' >> removed 'file-'$'\355\240\200''.ext' >> >> The UTF-8 sequence returned by readdir() decodes to U+27B3 >> (White-Feathered Rightwards Arrow). >> >> >> This could be fixed by handling UTF-8 of the surrogate range similar >> to other invalid sequences: Map each invalid byte to unicode range >> U+FF80 to U+FFFF. This works as expected if the above UTF-8 sequence >> is truncated: >> >> $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" >> on NTFS >> >> $ ls -F >> 'file-'$'\355\240''.ext' > > Surrogates halves are invalid for UTF-8 encoding; they should be first > be encoded as a valid UTF-16 code point. > The encoder should just fail if it encounters any invalid sequence! > Handling surrogates or other invalid values as anything other than > invalid turns the encoding into what has been called WTF-8 where W may > be for Windows! ;^> :-) I guess the idea behind Cygwin's filename mapping was to emulate Linux behavior as far as possible. AFAICS, Linux accepts any nonempty byte string without slash as a plain filename and leaves the interpretation (UTF-8?) to the userland. Cygwin maps 0x20..0x7f and valid UTF-8 sequences to UTF-16. Control chars and bytes from invalid UTF-8 sequences are mapped to the U+F0xx range. It should handle UTF-8 sequences which lead to the surrogate range the same way but currently does not. -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple