delorie.com/archives/browse.cgi | search |
DMARC-Filter: | OpenDMARC Filter v1.4.2 delorie.com 55RAVDld1395510 |
Authentication-Results: | delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com |
Authentication-Results: | delorie.com; spf=pass smtp.mailfrom=cygwin.com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 delorie.com 55RAVDld1395510 |
Authentication-Results: | delorie.com; |
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=nrHGR8sl | |
X-Recipient: | archive-cygwin AT delorie DOT com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 5B3D63858C50 |
DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; |
s=default; t=1751020271; | |
bh=YARuUEpaOtjrkzU1wyK1/UElrqbGrpvuhriQgY5oHp0=; | |
h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: | |
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: | |
From; | |
b=nrHGR8sl/iY06BRAs3t2GdWp1gwE8HdWmDUqhegTqqDAHJWafKvrqBpUAW1DnE9Zy | |
sVPg80P3nvQg29IG5QkjD1B/zCfPThNhcABIAwPttvv8hrRX3RAcpFlB4QNxJr4LwN | |
4Fe1oQ8L2F7o9b1ARUAPh9Pa0YBGnbr2rsTozdPc= | |
X-Original-To: | cygwin AT cygwin DOT com |
Delivered-To: | cygwin AT cygwin DOT com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 0DDE53858C50 |
Date: | Fri, 27 Jun 2025 12:30:47 +0200 |
To: | cygwin AT cygwin DOT com |
Subject: | Re: readdir() returns inaccessible name if file was created with |
invalid UTF-8 | |
Message-ID: | <aF5y15iQ840LxLYJ@calimero.vinschen.de> |
Mail-Followup-To: | cygwin AT cygwin DOT com |
References: | <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de> |
<03c4fae7-7322-572c-ae72-52e300f0b438 AT t-online DOT de> | |
<aFxRfI4NdZ8y5IlK AT calimero DOT vinschen DOT de> | |
<f78c615c-aefe-b3d0-aada-5f9d0cf73a0a AT t-online DOT de> | |
MIME-Version: | 1.0 |
In-Reply-To: | <f78c615c-aefe-b3d0-aada-5f9d0cf73a0a@t-online.de> |
X-BeenThere: | cygwin AT cygwin DOT com |
X-Mailman-Version: | 2.1.30 |
List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
List-Unsubscribe: | <https://cygwin.com/mailman/options/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe> | |
List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
From: | Corinna Vinschen via Cygwin <cygwin AT cygwin DOT com> |
Reply-To: | cygwin AT cygwin DOT com |
Cc: | Corinna Vinschen <corinna-cygwin AT cygwin DOT com> |
Errors-To: | cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com |
Sender: | "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com> |
X-MIME-Autoconverted: | from base64 to 8bit by delorie.com id 55RAVDld1395510 |
Hi Christian, On Jun 26 19:07, Christian Franke via Cygwin wrote: > Corinna Vinschen via Cygwin wrote: > > On Jun 25 16:59, Christian Franke via Cygwin wrote: > > > On Sun, 15 Sep 2024 19:47:11 +0200, Christian Franke wrote: > > > > If a file name contains an invalid (truncated) UTF-8 sequence, open() > > > > does not refuse to create the file. Later readdir() returns a different > > > > name which could not be used to access the file. > > > > > > > > Testcase with U+1F321 (Thermometer): > > > > > > > > $ uname -r > > > > 3.5.4-1.x86_64 > > > > > > > > $ printf $'\U0001F321' | od -A none -t x1 > > > > Â f0 9f 8c a1 > > > > > > > > $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext' > > > > > > > > $ touch 'file2-'$'\xf0\x9f\x8c''.ext' > > > > > > > > $ touch 'file3-'$'\xf0\x9f\x8c' > > > > > > > > $ ls -1 > > > > ls: cannot access 'file2-.?ext': No such file or directory > > > > ls: cannot access 'file3-': No such file or directory > > > > 'file1-'$'\360\237\214\241''.ext' > > > > file2-.?ext > > > > file3- > > > > [...] > > I don't know exactly where this happens, but the input of the > > conversion is invalid UTF-8 because it's missing the 4th byte. > > There's no way to represent these filenames on Windows > > filesystems storing filenames as UTF-16 values. > > > > So the problem here is that the conversion somehow misses that > > the 4th byte is invalid and just plods forward and converts the > > leading three bytes into the matching high surrogate value and > > then stumbles over the conversion for the low surrogate. > > > > It would be really helpful to have an STC for this problem. > > With some trial and error I found a testcase for this more serious problem > reported yesterday but not quoted above: > > > > In cases like file3-... above, the converted Windows path ends with > > > 0xF000. This suggests that this is an accidental conversion of the > > > terminating null to the 0xF0xx range. > > > > > > In some cases, the created Windows file name has random garbage > > > behind the 0xF000. Then even Cygwin is not able to access or unlink > > > the file after creation. > > Testcase (attached): Thanks for the testcase! I found the problem in the newlib core function creating wchar_t from UTF-8 input. In case of 4 byte UTF-8 sequences, the code created the low surrogate already after reading byte 3, without checking if byte 4 of the UTF-8 sequence is a valid byte. Hilarity ensues. Fortunately this bug has only been introduced very recently, to wit, on 2009-03-24, a mere 16 years ago. And it is my bug and mine alone :} I'm just prep'ing a fix which I'll push in a minute or two. Thanks, Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |