DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 48JJwpD52846556 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=xHtGKDj1 X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4C1F03858420 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1726775928; bh=3IC0BrnZqGMoovih5YIA2YQySghqwFBJpimb+m1WLYw=; h=Date:Subject:To:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=xHtGKDj1WwavskhQZvfKNHgkyQAutVKJtceo4bVKVR8gEfgNQI/CRsQ4L06LFWUFt y8Gzl6uqAJgWErQTInYksoMx544IviAX+X3uCSusAlChp1+a0NSyAeWqQ/ByuvtKzH Hr5R+wcUt4IGTt7/DwQgyBO0S/6wpbj5WUmNjmUk= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 394DB3858D29 ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 394DB3858D29 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726775875; cv=none; b=IYAcDwPGIvdZkSKEJ/JuZwv11JBxBFu8Jq6qjkQzA7jFwCJxiRfM4dRdeInATpkPnu2GMGzIVK7+faYGylpnzKaIHGne+LkvlVMZHxWnJgzgZlPGmXxbcKUJuKnEx0rLN/FgDO6E4RzpyFjuUEnQ0rYgLE/B2deATKScS+D+6jc= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726775875; c=relaxed/simple; bh=2TzgKBNBgE5BkHMVhX7OgFExggOz9YlxJGt7p4+bdGE=; h=Message-ID:Date:MIME-Version:Subject:To:From; b=nSxaXcZze9mQx6Fk4amDY5j1X0v+C/Iq7QUFUmdttgjxkGilvpvbFbDKg1T7IeZ186jFDNksPn+kfKddRfcPM3ta48ys6QtM3w+yczgkQLp/GP2L1h+zl1skqFTkKeq1CFcAe+d0YpJkSYmo9AfiVHJQv1ACj8YlRBf9CBl6f+s= ARC-Authentication-Results: i=1; server2.sourceware.org Message-ID: <099234aa-e500-4814-a0ae-7b3bb7fbd19f@SystematicSW.ab.ca> Date: Thu, 19 Sep 2024 13:57:50 -0600 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 Content-Language: en-CA To: cygwin AT cygwin DOT com References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de> <6451a249-adcd-9c56-b76e-1b00886cea80 AT t-online DOT de> <66051d82-e2c3-684f-d13f-d1301170b0d4 AT t-online DOT de> <984103a4-ab2d-4337-9964-cc1e3208155d AT SystematicSW DOT ab DOT ca> <11036733-c4f3-c2e9-37c9-959c9e99edab AT t-online DOT de> Autocrypt: addr=Brian DOT Inglis AT Shaw DOT ca; keydata= xjMEXopx9BYJKwYBBAHaRw8BAQdAPq8FIaW+Bz7xnfyJ1gHQyf2EZo5sAwSPy/bRAcLeWl/N I0JyaWFuIEluZ2xpcyA8QnJpYW4uSW5nbGlzQFNoYXcuY2E+wpYEExYIAD4WIQTG63sbl+cr 2nyOuZiKvQKcH1E27wUCXopx9AIbAwUJCWYBgAULCQgHAgYVCgkICwIEFgIDAQIeAQIXgAAK CRCKvQKcH1E276DmAP91Bt8kfJhKHYb9b2sao2fxwJFsl1GlRi516WKI0OkphQEA+ULITsPs blfzSq+GgI7q4LPfRfTLy4Oo3gorlnhnfgnOOAReinH0EgorBgEEAZdVAQUBAQdAepgIsLwm GQicfoIBaB9xHp63MQJqVCPbgPzESTg7EEwDAQgHwn0EGBYIACYWIQTG63sbl+cr2nyOuZiK vQKcH1E27wUCXopx9AIbDAUJCWYBgAAKCRCKvQKcH1E27+zoAP4u2ivMQBAqaMeLOilqRWgy nV2ATImz1p2v1H5P4kBiDwD3caPK1cxU5lijzuSDCjgtIpgF/avHbjA32fxJdIRwAA== Organization: Systematic Software In-Reply-To: <11036733-c4f3-c2e9-37c9-959c9e99edab@t-online.de> X-Rspamd-Queue-Id: 156DD17 X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, KAM_NUMSUBJECT, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.6 X-Rspamd-Server: rspamout02 X-Stat-Signature: wde71sg1jc6tgrmohnnzxrnefj69gxr3 X-Session-Marker: 427269616E2E496E676C69734053797374656D6174696353572E61622E6361 X-Session-ID: U2FsdGVkX19t5DXClB/yVODQL6N8TombKawRJ2zYPVE= X-HE-Tag: 1726775870-882309 X-HE-Meta: U2FsdGVkX19hrXSv60M1IK1QygT+O0rQpNSY8eoU4v4i+0c1eE4QtflMpomfYAkxpXUD15GyN6oYPJEQzUC3seh4YugBm+wwaqj/ikQ+d7a3rlFvuBNsdYfUsSHbdbDp6GpP9ojzy0paYvvKK6cH5uHATH/tml5WUTvvU2HSY8eS/vU8xE9EljooyecjDv9/yrcB1sEGQy/pqt0k787B45Q0/oi1HbmTSRL7g/jAAVem1cSNDMFMnaTYP396G4zNg9ExVccRASqHFhBW4UBwVIRiWqo+3jNGCo3kiIHsdNFdeEkh+QXcBDQE9Rv+6fYke4aKEGoisCzq7hJTDC/TiPnDy9r1ixRa0B4aY8IMDSmvCmkjg+0oscqiGH1nqhEPPk/QGauyAJJ4D4+t09zLbKKwDBoz09UbsAcc1BGZNitq1t5VEGdFXikkh1Zg0+oKKE7uzZ27Em5GljM3Re5uCA== X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Brian Inglis via Cygwin Reply-To: cygwin AT cygwin DOT com Cc: Brian Inglis Content-Type: text/plain; charset="utf-8"; Format="flowed" Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 48JJwpD52846556 On 2024-09-19 11:30, Christian Franke via Cygwin wrote: > Brian Inglis via Cygwin wrote: >> On 2024-09-19 07:27, Christian Franke via Cygwin wrote: >>> Mark Liam Brown via Cygwin wrote: >>>> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin >>>> wrote: >>>>> Christian Franke via Cygwin wrote: >>>>>> Thomas Wolff via Cygwin wrote: >>>>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin: >>>>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin: >>>>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open() >>>>>>>>> does not refuse to create the file. Later readdir() returns a >>>>>>>>> different name which could not be used to access the file. >>>>>>>>> >>>>>>>>> Testcase with U+1F321 (Thermometer): >>>>>>>>> >>>>>>>>> $ uname -r >>>>>>>>> 3.5.4-1.x86_64 >>>>>>>>> >>>>>>>>> $ printf $'\U0001F321' | od -A none -t x1 >>>>>>>>>   f0 9f 8c a1 >>>>>>>>> >>>>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext' >>>>>>>>> >>>>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext' >>>>>>>>> >>>>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c' >>>>>>>>> >>>>>>>>> $ ls -1 >>>>>>>>> ls: cannot access 'file2-.?ext': No such file or directory >>>>>>>>> ls: cannot access 'file3-': No such file or directory >>>>>>>>> 'file1-'$'\360\237\214\241''.ext' >>>>>>>>> file2-.?ext >>>>>>>>> file3- >>>>>>>> I don't reproduce this. >>>>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto' >>>>>> which needs to call stat(). Plain 'ls' does not, so the errors do not >>>>>> occur then. >>>>>> >>>>>> >>>>>>>> While the file name gets mangled, all resulting file names are valid >>>>>>>> and >>>>>>>> listed: >>>>>>>> In file2 the sequence is turned into U+17B3 but exchanged with the dot. >>>>>>>> In file3 the same sequence is just dropped. >>>>>>>> $ ls -1|cat >>>>>>>> file1-🌡.ext >>>>>>>> file2-.ឳext >>>>>>>> file3- >>>>>>>> >>>>>>>> However, ls file2* fails, as does ls *. >>>>>>> On the other hand, ls file3- fails too, so some mapping error occurs >>>>>>> internally. >>>>>>> Also, the files cannot be deleted from cygwin (need to use cmd). >>>>>> 'rm' using the original names works for file2-..., but not for file3-... >>>>>> >>>>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext' >>>>>> removed 'file2-'$'\360\237\214''.ext' >>>>>> >>>>>> $ rm -v 'file3-'$'\xf0\x9f\x8c' >>>>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory >>>>>> >>>>> Further tests suggest that the problem only occurs with: >>>>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit) >>>>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16 >>>>> 'high surrogate' range (0xD800..0xDBFF). >>>> Makes perfect sense, the Windows kernel uses UTF16 internally. >>> >>> >>> Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> UTF-16 >>> mappings. This makes no sense: >>> >>> $ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on NTFS >>> >>> $ strace ls -F >>> ... >>> ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" > >>> "file-\xE2\x9E\xB3.ext") >>> ... >>>   ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...) >>> ... >>> ls: cannot access 'file-?.ext': No such file or directory >>> file-?.ext >>> >>> $ rm -v 'file-'$'\xed\xa0\x80''.ext' >>> removed 'file-'$'\355\240\200''.ext' >>> >>> The UTF-8 sequence returned by readdir() decodes to U+27B3 (White-Feathered >>> Rightwards Arrow). >>> >>> >>> This could be fixed by handling UTF-8 of the surrogate range similar to other >>> invalid sequences: Map each invalid byte to unicode range U+FF80 to U+FFFF. >>> This works as expected if the above UTF-8 sequence is truncated: >>> >>> $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" on NTFS >>> >>> $ ls -F >>> 'file-'$'\355\240''.ext' >> >> Surrogates halves are invalid for UTF-8 encoding; they should be first be >> encoded as a valid UTF-16 code point. >> The encoder should just fail if it encounters any invalid sequence! >> Handling surrogates or other invalid values as anything other than invalid >> turns the encoding into what has been called WTF-8 where W may be for Windows! >> ;^> > > :-) > > I guess the idea behind Cygwin's filename mapping was to emulate Linux behavior > as far as possible. AFAICS, Linux accepts any nonempty byte string without slash > as a plain filename and leaves the interpretation (UTF-8?) to the userland. > > Cygwin maps 0x20..0x7f and valid UTF-8 sequences to UTF-16. Control chars and > bytes from invalid UTF-8 sequences are mapped to the U+F0xx range. It should > handle UTF-8 sequences which lead to the surrogate range the same way but > currently does not. Windows allowing random legacy UCS-2 code points in what are meant to be UTF-16 character strings are a security issue, and should be prevented and discouraged by all possible means. This could be used similar to the homograph attacks allowed by IDN DNS names with, for example, Cyrillic letters like "а с е һ і ј ӏ о р ѕ ѵ ѡ х у": try https://суgѡіn.соm/ for example. I had a similar issue with man pages containing URLs, including file paths, being created with U+2010 HYPHEN rather than ASCII U+002D HYPHEN-MINUS, when not escaped as "\-", resulting in some URLs, file paths, and all long options not being useful when copied-and-pasted: check with `cat -A <<< '--long-option-name'`; better to try `command --help` for those. -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut -- Antoine de Saint-Exupéry -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple