delorie.com/archives/browse.cgi | search |
DKIM-Filter: | OpenDKIM Filter v2.11.0 delorie.com 48JHVAnN2745366 |
Authentication-Results: | delorie.com; |
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=WCmDj/EO | |
X-Recipient: | archive-cygwin AT delorie DOT com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 6957F3858414 |
DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; |
s=default; t=1726767069; | |
bh=moJx6IJ8bGRgXQGlcHEIkqpwEKHj14pqqE+cl6DH8yg=; | |
h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: | |
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: | |
From; | |
b=WCmDj/EO8NYfBDnM4b+KKJL/uaTLf31mP5Pm1Xp3Ac2w87FjxAcH07Daoi70H4nzF | |
lj9wt8pcQz50rZDLKjAC1c3Z6q/z7JrTQOvUg9n3xms7GcGyDZKXBpl3rtCFmuOrFL | |
H822o1PEdeZ0m5EkHBFBp9P+4gu13GsTNoHy5+k8= | |
X-Original-To: | cygwin AT cygwin DOT com |
Delivered-To: | cygwin AT cygwin DOT com |
DMARC-Filter: | OpenDMARC Filter v1.4.2 sourceware.org 1DDF03858D29 |
ARC-Filter: | OpenARC Filter v1.0.0 sourceware.org 1DDF03858D29 |
ARC-Seal: | i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726767011; cv=none; |
b=wLJNXAx4mf5S2mCba1EaoFCKPvYnrk3z/2uEzrTfBi/mD+Kzv1VMHZaxfGPpVwid8lFIoEmnDRKdO/Ww6KIMvHCkOWvFWgwST70XwyoPWFXy/75+m/e4WG5a0MBfpz7PTTFlnREKQfOYvDfPsTixqtJ3lrSg291tS+025VT8fI8= | |
ARC-Message-Signature: | i=1; a=rsa-sha256; d=sourceware.org; s=key; |
t=1726767011; c=relaxed/simple; | |
bh=e6m9Xrh0gTBMOhCAMnKk+GR6WurOBlIZzgsyG/wHhz8=; | |
h=Subject:To:From:Message-ID:Date:MIME-Version; | |
b=qI87FVrmDePJdabF7WAaAqusEt05w44A+QOB2XaW4eccD4eTa0Zgw5anOBJgDwzcRmnYpSQnT9T1OX6BlW0VWnxBdPzHPmItCgXoK3asind272+gZORccWTgB5iVMXlaNHIzGViTTeqNygHKMDEVIXmpU4dqq0A8/gORiLND1eU= | |
ARC-Authentication-Results: | i=1; server2.sourceware.org |
Subject: | Re: readdir() returns inaccessible name if file was created with |
invalid UTF-8 | |
To: | cygwin AT cygwin DOT com |
References: | <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de> |
<bc8bd61c-818e-424f-bb42-52f4fecd4849 AT towo DOT net> | |
<b6ab074b-919e-4514-8276-72a30c36ab58 AT towo DOT net> | |
<de4767e2-85b7-ead2-df9a-64e1f24f4e8f AT t-online DOT de> | |
<6451a249-adcd-9c56-b76e-1b00886cea80 AT t-online DOT de> | |
<CAN0SSYx+g4JE6AA6krNAzG6QXrve52TBv0d3VM0SODV-tzZQSQ AT mail DOT gmail DOT com> | |
<66051d82-e2c3-684f-d13f-d1301170b0d4 AT t-online DOT de> | |
<984103a4-ab2d-4337-9964-cc1e3208155d AT SystematicSW DOT ab DOT ca> | |
Message-ID: | <11036733-c4f3-c2e9-37c9-959c9e99edab@t-online.de> |
Date: | Thu, 19 Sep 2024 19:30:05 +0200 |
User-Agent: | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 |
SeaMonkey/2.53.18.2 | |
MIME-Version: | 1.0 |
In-Reply-To: | <984103a4-ab2d-4337-9964-cc1e3208155d@SystematicSW.ab.ca> |
X-TOI-EXPURGATEID: | 150726::1726767006-16FFB4ED-FB442723/0/0 CLEAN NORMAL |
X-TOI-MSGID: | 1a0df370-8824-4a13-8990-f9ca76f4421e |
X-Spam-Status: | No, score=-4.2 required=5.0 tests=BAYES_00, FREEMAIL_FROM, |
KAM_DMARC_STATUS, KAM_NUMSUBJECT, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, | |
RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, | |
TXREP autolearn=ham autolearn_force=no version=3.4.6 | |
X-Spam-Checker-Version: | SpamAssassin 3.4.6 (2021-04-09) on |
server2.sourceware.org | |
X-BeenThere: | cygwin AT cygwin DOT com |
X-Mailman-Version: | 2.1.30 |
List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
List-Unsubscribe: | <https://cygwin.com/mailman/options/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe> | |
List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
From: | Christian Franke via Cygwin <cygwin AT cygwin DOT com> |
Reply-To: | cygwin AT cygwin DOT com |
Cc: | Christian Franke <Christian DOT Franke AT t-online DOT de> |
Errors-To: | cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com |
Sender: | "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com> |
X-MIME-Autoconverted: | from base64 to 8bit by delorie.com id 48JHVAnN2745366 |
Brian Inglis via Cygwin wrote: > On 2024-09-19 07:27, Christian Franke via Cygwin wrote: >> Mark Liam Brown via Cygwin wrote: >>> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin >>> <cygwin AT cygwin DOT com> wrote: >>>> Christian Franke via Cygwin wrote: >>>>> Thomas Wolff via Cygwin wrote: >>>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin: >>>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin: >>>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, >>>>>>>> open() >>>>>>>> does not refuse to create the file. Later readdir() returns a >>>>>>>> different name which could not be used to access the file. >>>>>>>> >>>>>>>> Testcase with U+1F321 (Thermometer): >>>>>>>> >>>>>>>> $ uname -r >>>>>>>> 3.5.4-1.x86_64 >>>>>>>> >>>>>>>> $ printf $'\U0001F321' | od -A none -t x1 >>>>>>>>  f0 9f 8c a1 >>>>>>>> >>>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext' >>>>>>>> >>>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext' >>>>>>>> >>>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c' >>>>>>>> >>>>>>>> $ ls -1 >>>>>>>> ls: cannot access 'file2-.?ext': No such file or directory >>>>>>>> ls: cannot access 'file3-': No such file or directory >>>>>>>> 'file1-'$'\360\237\214\241''.ext' >>>>>>>> file2-.?ext >>>>>>>> file3- >>>>>>> I don't reproduce this. >>>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto' >>>>> which needs to call stat(). Plain 'ls' does not, so the errors do not >>>>> occur then. >>>>> >>>>> >>>>>>> While the file name gets mangled, all resulting file names are >>>>>>> valid >>>>>>> and >>>>>>> listed: >>>>>>> In file2 the sequence is turned into U+17B3 but exchanged with >>>>>>> the dot. >>>>>>> In file3 the same sequence is just dropped. >>>>>>> $ ls -1|cat >>>>>>> file1-🌡.ext >>>>>>> file2-.ឳext >>>>>>> file3- >>>>>>> >>>>>>> However, ls file2* fails, as does ls *. >>>>>> On the other hand, ls file3- fails too, so some mapping error occurs >>>>>> internally. >>>>>> Also, the files cannot be deleted from cygwin (need to use cmd). >>>>> 'rm' using the original names works for file2-..., but not for >>>>> file3-... >>>>> >>>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext' >>>>> removed 'file2-'$'\360\237\214''.ext' >>>>> >>>>> $ rm -v 'file3-'$'\xf0\x9f\x8c' >>>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory >>>>> >>>> Further tests suggest that the problem only occurs with: >>>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit) >>>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16 >>>> 'high surrogate' range (0xD800..0xDBFF). >>> Makes perfect sense, the Windows kernel uses UTF16 internally. >> >> >> Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> >> UTF-16 mappings. This makes no sense: >> >> $ touch 'file-'$'\xed\xa0\x80''.ext' # creates L"file-\xD800.ext" on >> NTFS >> >> $ strace ls -F >> ... >> ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" >> > "file-\xE2\x9E\xB3.ext") >> ... >>   ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...) >> ... >> ls: cannot access 'file-?.ext': No such file or directory >> file-?.ext >> >> $ rm -v 'file-'$'\xed\xa0\x80''.ext' >> removed 'file-'$'\355\240\200''.ext' >> >> The UTF-8 sequence returned by readdir() decodes to U+27B3 >> (White-Feathered Rightwards Arrow). >> >> >> This could be fixed by handling UTF-8 of the surrogate range similar >> to other invalid sequences: Map each invalid byte to unicode range >> U+FF80 to U+FFFF. This works as expected if the above UTF-8 sequence >> is truncated: >> >> $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" >> on NTFS >> >> $ ls -F >> 'file-'$'\355\240''.ext' > > Surrogates halves are invalid for UTF-8 encoding; they should be first > be encoded as a valid UTF-16 code point. > The encoder should just fail if it encounters any invalid sequence! > Handling surrogates or other invalid values as anything other than > invalid turns the encoding into what has been called WTF-8 where W may > be for Windows! ;^> :-) I guess the idea behind Cygwin's filename mapping was to emulate Linux behavior as far as possible. AFAICS, Linux accepts any nonempty byte string without slash as a plain filename and leaves the interpretation (UTF-8?) to the userland. Cygwin maps 0x20..0x7f and valid UTF-8 sequences to UTF-16. Control chars and bytes from invalid UTF-8 sequences are mapped to the U+F0xx range. It should handle UTF-8 sequences which lead to the surrogate range the same way but currently does not. -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |