DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 55RAVDld1395510
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 55RAVDld1395510
Authentication-Results: delorie.com;
	dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=nrHGR8sl
X-Recipient: archive-cygwin@delorie.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5B3D63858C50
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
	s=default; t=1751020271;
	bh=YARuUEpaOtjrkzU1wyK1/UElrqbGrpvuhriQgY5oHp0=;
	h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=nrHGR8sl/iY06BRAs3t2GdWp1gwE8HdWmDUqhegTqqDAHJWafKvrqBpUAW1DnE9Zy
	 sVPg80P3nvQg29IG5QkjD1B/zCfPThNhcABIAwPttvv8hrRX3RAcpFlB4QNxJr4LwN
	 4Fe1oQ8L2F7o9b1ARUAPh9Pa0YBGnbr2rsTozdPc=
X-Original-To: cygwin@cygwin.com
Delivered-To: cygwin@cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0DDE53858C50
Date: Fri, 27 Jun 2025 12:30:47 +0200
To: cygwin@cygwin.com
Subject: Re: readdir() returns inaccessible name if file was created with
 invalid UTF-8
Message-ID: <aF5y15iQ840LxLYJ@calimero.vinschen.de>
Mail-Followup-To: cygwin@cygwin.com
References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1@t-online.de>
 <03c4fae7-7322-572c-ae72-52e300f0b438@t-online.de>
 <aFxRfI4NdZ8y5IlK@calimero.vinschen.de>
 <f78c615c-aefe-b3d0-aada-5f9d0cf73a0a@t-online.de>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <f78c615c-aefe-b3d0-aada-5f9d0cf73a0a@t-online.de>
X-BeenThere: cygwin@cygwin.com
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=subscribe>
From: Corinna Vinschen via Cygwin <cygwin@cygwin.com>
Reply-To: cygwin@cygwin.com
Cc: Corinna Vinschen <corinna-cygwin@cygwin.com>
Content-Type: text/plain; charset="utf-8"
Errors-To: cygwin-bounces~archive-cygwin=delorie.com@cygwin.com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie.com@cygwin.com>
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 55RAVDld1395510

Hi Christian,

On Jun 26 19:07, Christian Franke via Cygwin wrote:
> Corinna Vinschen via Cygwin wrote:
> > On Jun 25 16:59, Christian Franke via Cygwin wrote:
> > > On Sun, 15 Sep 2024 19:47:11 +0200, Christian Franke wrote:
> > > > If a file name contains an invalid (truncated) UTF-8 sequence, open()
> > > > does not refuse to create the file. Later readdir() returns a different
> > > > name which could not be used to access the file.
> > > > 
> > > > Testcase with U+1F321 (Thermometer):
> > > > 
> > > > $ uname -r
> > > > 3.5.4-1.x86_64
> > > > 
> > > > $ printf $'\U0001F321' | od -A none -t x1
> > > >   f0 9f 8c a1
> > > > 
> > > > $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
> > > > 
> > > > $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
> > > > 
> > > > $ touch 'file3-'$'\xf0\x9f\x8c'
> > > > 
> > > > $ ls -1
> > > > ls: cannot access 'file2-.?ext': No such file or directory
> > > > ls: cannot access 'file3-': No such file or directory
> > > > 'file1-'$'\360\237\214\241''.ext'
> > > > file2-.?ext
> > > > file3-
> > > > [...]
> > I don't know exactly where this happens, but the input of the
> > conversion is invalid UTF-8 because it's missing the 4th byte.
> > There's no way to represent these filenames on Windows
> > filesystems storing filenames as UTF-16 values.
> > 
> > So the problem here is that the conversion somehow misses that
> > the 4th byte is invalid and just plods forward and converts the
> > leading three bytes into the matching high surrogate value and
> > then stumbles over the conversion for the low surrogate.
> > 
> > It would be really helpful to have an STC for this problem.
> 
> With some trial and error I found a testcase for this more serious problem
> reported yesterday but not quoted above:
> 
> > > In cases like file3-... above, the converted Windows path ends with
> > > 0xF000. This suggests that this is an accidental conversion of the
> > > terminating null to the 0xF0xx range.
> > > 
> > > In some cases, the created Windows file name has random garbage
> > > behind the 0xF000. Then even Cygwin is not able to access or unlink
> > > the file after creation.
> 
> Testcase (attached):

Thanks for the testcase!

I found the problem in the newlib core function creating wchar_t from
UTF-8 input.  In case of 4 byte UTF-8 sequences, the code created the
low surrogate already after reading byte 3, without checking if byte 4
of the UTF-8 sequence is a valid byte. Hilarity ensues.

Fortunately this bug has only been introduced very recently, to wit, on
2009-03-24, a mere 16 years ago.  And it is my bug and mine alone :}

I'm just prep'ing a fix which I'll push in a minute or two.


Thanks,
Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

