DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56MD70aQ4013410
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56MD70aQ4013410
Authentication-Results: delorie.com;
	dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=g7rdsThT
X-Recipient: archive-cygwin@delorie.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 474FF385AC27
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
	s=default; t=1753189618;
	bh=67L+E0iHxVgaPUoQa2cS+buKIxTuBjbgz0sriijw8Mo=;
	h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=g7rdsThTCHmk78+LEBfSlLPNupCQWX6MDoxwqO0SaB1VmJa4qXzinN67oe353jC+J
	 PBpvLb/MSDfa6uFW9s/jHa6ismmKxUpPhN/DS4p+rTHoCa8SxkO7cT8meCTj5DX/jP
	 kYGwJR9zv09qHohLkGyYzNt7cRHFrohRvSID2Q94=
X-Original-To: cygwin@cygwin.com
Delivered-To: cygwin@cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6C0B93858D1E
Date: Tue, 22 Jul 2025 15:05:59 +0200
To: Thomas Wolff <towo@towo.net>
Subject: Re: readdir() returns inaccessible name if file was created with
 invalid UTF-8
Message-ID: <aH-MtwqARmDmLwoo@calimero.vinschen.de>
Mail-Followup-To: Thomas Wolff <towo@towo.net>, cygwin@cygwin.com
References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1@t-online.de>
 <03c4fae7-7322-572c-ae72-52e300f0b438@t-online.de>
 <aFxRfI4NdZ8y5IlK@calimero.vinschen.de>
 <f78c615c-aefe-b3d0-aada-5f9d0cf73a0a@t-online.de>
 <aF5y15iQ840LxLYJ@calimero.vinschen.de>
 <ca205dbd-907f-4552-9e5c-2cb0050f83a3@towo.net>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <ca205dbd-907f-4552-9e5c-2cb0050f83a3@towo.net>
X-BeenThere: cygwin@cygwin.com
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=subscribe>
From: Corinna Vinschen via Cygwin <cygwin@cygwin.com>
Reply-To: cygwin@cygwin.com
Cc: Corinna Vinschen <corinna-cygwin@cygwin.com>, cygwin@cygwin.com
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: cygwin-bounces~archive-cygwin=delorie.com@cygwin.com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie.com@cygwin.com>

On Jul 22 05:38, Thomas Wolff via Cygwin wrote:
> Am 27.06.2025 um 12:30 schrieb Corinna Vinschen via Cygwin:
> > On Jun 26 19:07, Christian Franke via Cygwin wrote:
> > > With some trial and error I found a testcase for this more serious problem
> > > reported yesterday but not quoted above:
> > > 
> > > > > In cases like file3-... above, the converted Windows path ends with
> > > > > 0xF000. This suggests that this is an accidental conversion of the
> > > > > terminating null to the 0xF0xx range.
> > > > > 
> > > > > In some cases, the created Windows file name has random garbage
> > > > > behind the 0xF000. Then even Cygwin is not able to access or unlink
> > > > > the file after creation.
> > > Testcase (attached):
> > Thanks for the testcase!
> > 
> > I found the problem in the newlib core function creating wchar_t from
> > UTF-8 input.  In case of 4 byte UTF-8 sequences, the code created the
> > low surrogate already after reading byte 3, without checking if byte 4
> > of the UTF-8 sequence is a valid byte. Hilarity ensues.
> I'm afraid the fix may have broken mbrtowc as I just reported to the list,
> with a test case, thus also breaking mintty.
> The low surrogate MUST be created after byte 3 because otherwise the high
> surrogate cannot be delivered after byte 4 as it needs to.
> I think it's a drawback of UTF-16 that must be swallowed, even if some
> incorrect sequences slip through somehow.

Bummer.  What bugs me most is that you might be right here.  It's a bit
late, but we should have defined wchar_t as a 4 byte type back when we
worked on Cygwin 1.7.0... sigh.

mbrtowc() is inherently a bad idea when it comes to UTF-16.  It's a
function which only works really correctly for the unicode base plane,
or if wchar_t is big enough.

It's the reason we don't use mbrtowc() if possible.  It's better to call
mbstowcs() or friends and allow at least 3 chars in the wchar_t buffer.
You can't change that in mintty by any chance?


Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
