delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2025/06/27/06:31:13

DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 55RAVDld1395510
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 55RAVDld1395510
Authentication-Results: delorie.com;
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=nrHGR8sl
X-Recipient: archive-cygwin AT delorie DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5B3D63858C50
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
s=default; t=1751020271;
bh=YARuUEpaOtjrkzU1wyK1/UElrqbGrpvuhriQgY5oHp0=;
h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
From;
b=nrHGR8sl/iY06BRAs3t2GdWp1gwE8HdWmDUqhegTqqDAHJWafKvrqBpUAW1DnE9Zy
sVPg80P3nvQg29IG5QkjD1B/zCfPThNhcABIAwPttvv8hrRX3RAcpFlB4QNxJr4LwN
4Fe1oQ8L2F7o9b1ARUAPh9Pa0YBGnbr2rsTozdPc=
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0DDE53858C50
Date: Fri, 27 Jun 2025 12:30:47 +0200
To: cygwin AT cygwin DOT com
Subject: Re: readdir() returns inaccessible name if file was created with
invalid UTF-8
Message-ID: <aF5y15iQ840LxLYJ@calimero.vinschen.de>
Mail-Followup-To: cygwin AT cygwin DOT com
References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de>
<03c4fae7-7322-572c-ae72-52e300f0b438 AT t-online DOT de>
<aFxRfI4NdZ8y5IlK AT calimero DOT vinschen DOT de>
<f78c615c-aefe-b3d0-aada-5f9d0cf73a0a AT t-online DOT de>
MIME-Version: 1.0
In-Reply-To: <f78c615c-aefe-b3d0-aada-5f9d0cf73a0a@t-online.de>
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.30
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
From: Corinna Vinschen via Cygwin <cygwin AT cygwin DOT com>
Reply-To: cygwin AT cygwin DOT com
Cc: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com>
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 55RAVDld1395510

Hi Christian,

On Jun 26 19:07, Christian Franke via Cygwin wrote:
> Corinna Vinschen via Cygwin wrote:
> > On Jun 25 16:59, Christian Franke via Cygwin wrote:
> > > On Sun, 15 Sep 2024 19:47:11 +0200, Christian Franke wrote:
> > > > If a file name contains an invalid (truncated) UTF-8 sequence, open()
> > > > does not refuse to create the file. Later readdir() returns a different
> > > > name which could not be used to access the file.
> > > > 
> > > > Testcase with U+1F321 (Thermometer):
> > > > 
> > > > $ uname -r
> > > > 3.5.4-1.x86_64
> > > > 
> > > > $ printf $'\U0001F321' | od -A none -t x1
> > > >   f0 9f 8c a1
> > > > 
> > > > $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
> > > > 
> > > > $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
> > > > 
> > > > $ touch 'file3-'$'\xf0\x9f\x8c'
> > > > 
> > > > $ ls -1
> > > > ls: cannot access 'file2-.?ext': No such file or directory
> > > > ls: cannot access 'file3-': No such file or directory
> > > > 'file1-'$'\360\237\214\241''.ext'
> > > > file2-.?ext
> > > > file3-
> > > > [...]
> > I don't know exactly where this happens, but the input of the
> > conversion is invalid UTF-8 because it's missing the 4th byte.
> > There's no way to represent these filenames on Windows
> > filesystems storing filenames as UTF-16 values.
> > 
> > So the problem here is that the conversion somehow misses that
> > the 4th byte is invalid and just plods forward and converts the
> > leading three bytes into the matching high surrogate value and
> > then stumbles over the conversion for the low surrogate.
> > 
> > It would be really helpful to have an STC for this problem.
> 
> With some trial and error I found a testcase for this more serious problem
> reported yesterday but not quoted above:
> 
> > > In cases like file3-... above, the converted Windows path ends with
> > > 0xF000. This suggests that this is an accidental conversion of the
> > > terminating null to the 0xF0xx range.
> > > 
> > > In some cases, the created Windows file name has random garbage
> > > behind the 0xF000. Then even Cygwin is not able to access or unlink
> > > the file after creation.
> 
> Testcase (attached):

Thanks for the testcase!

I found the problem in the newlib core function creating wchar_t from
UTF-8 input.  In case of 4 byte UTF-8 sequences, the code created the
low surrogate already after reading byte 3, without checking if byte 4
of the UTF-8 sequence is a valid byte. Hilarity ensues.

Fortunately this bug has only been introduced very recently, to wit, on
2009-03-24, a mere 16 years ago.  And it is my bug and mine alone :}

I'm just prep'ing a fix which I'll push in a minute or two.


Thanks,
Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019