DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56OEA2EY1459146
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56OEA2EY1459146
Authentication-Results: delorie.com;
	dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=TM/b4jn4
X-Recipient: archive-cygwin@delorie.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3186E3857B98
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
	s=default; t=1753366201;
	bh=NvGLD2YxtOxP9Wu8S+uX77WTgGu3SsxYduEsoa3uUes=;
	h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=TM/b4jn4BHjGA+fSdsMMAwLlU73b4q/5XVa+/2s1vWtzfxj5manh83e29Bfhm9cIc
	 7vnRpqhBNP2xNYQTk/oyKtDd2z6+ypJr3zuSgQ4ZQsZVUe2iQqY11DNVm8H99XQTKO
	 BmWVGPVVYhmVBuSaN9h5EwAnmhvRtVTiA7TOEpXY=
X-Original-To: cygwin@cygwin.com
Delivered-To: cygwin@cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 385283857B9B
Date: Thu, 24 Jul 2025 16:08:49 +0200
To: Thomas Wolff <towo@towo.net>,
        Christian Franke <Christian.Franke@t-online.de>
Subject: Re: readdir() returns inaccessible name if file was created with
 invalid UTF-8
Message-ID: <aII-cQ0BCgfk3PQm@calimero.vinschen.de>
Mail-Followup-To: Thomas Wolff <towo@towo.net>,
 Christian Franke <Christian.Franke@t-online.de>, cygwin@cygwin.com
References: <aF5y15iQ840LxLYJ@calimero.vinschen.de>
 <ca205dbd-907f-4552-9e5c-2cb0050f83a3@towo.net>
 <aH-MtwqARmDmLwoo@calimero.vinschen.de>
 <91f26856-72b0-483b-8d04-bd90a27b6be0@towo.net>
 <4ab2c1b7-3164-4556-ba36-29814ecf5766@towo.net>
 <68f65634-8f4e-436b-ba6a-d30bdf882aaa@towo.net>
 <aICVBQzWUiCYwnL2@calimero.vinschen.de>
 <11282182-60d1-4841-bf78-5ef78cf30060@towo.net>
 <aIILWiKsr99DOaI8@calimero.vinschen.de>
 <b0a32549-77da-4c0f-b118-79617800faea@towo.net>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <b0a32549-77da-4c0f-b118-79617800faea@towo.net>
X-BeenThere: cygwin@cygwin.com
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=subscribe>
From: Corinna Vinschen via Cygwin <cygwin@cygwin.com>
Reply-To: cygwin@cygwin.com
Cc: Corinna Vinschen <corinna-cygwin@cygwin.com>, cygwin@cygwin.com
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: cygwin-bounces~archive-cygwin=delorie.com@cygwin.com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie.com@cygwin.com>

On Jul 24 15:41, Thomas Wolff via Cygwin wrote:
> Am 24.07.2025 um 12:30 schrieb Corinna Vinschen:
> > What does that mean?  Consider this UTF8 input string:
> > 
> >    0xf0 0x90 0x80 0x2e
> > 
> >    mbstowcs:     returns -1
> >    sys_mbstowcs: f0f0 f090 f080 002e
> > 
> > Let's convert it back to multibyte:
> > 
> >    sys_wcstombs: 0xf0 0x90 0x80 0x2e
> >    wcstombs:     0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e
> > 
> > So while sys_wcstombs has special code converting the string back to its
> > original MB string, wcstombs converts to the CESU-8 representation.
> > 
> > This is transparent.  If we convert this CESU-8 string back to
> > wide-char, the resulting wide-char strings are the same:
> > 
> >    mbstowcs:     f0f0 f090 f080 002e
> >    sys_mbstowcs: f0f0 f090 f080 002e
> > 
> > So the question here is, shall we keep the special case converting
> > private use area bytes back to their original byte encoding?
> > 
> > Or shall simply go along with CESU-8 when converting back to multibyte
> > to keep the string the same as with wcstombs?
> > 
> > Exempt from this are the characters not valid in a DOS filename.
> > These will always be converted if we create wide-char filenames.
> Sounds like a fair solution with only minor glitches. Poor 4th byte but
> thanks a lot anyway.
> About the latter decision, if there's no strong bias otherwise, I'd prefer
> to drop special handling (but don't take my vote, I don't care so much about
> that).

Thanks for your input.

As another datapoint we have to consider how sys_wcstombs is used.

wcstombs on a filename will be used by the application only, and only if
the filename is incoming application level data or has been converted to a
wide char by the application itself.

sys_wcstombs will be used to generate a readable multi-byte filename from
UTF-16 filenames read from the filesystem.  So it's major use in terms of
filenames is by readdir().

Knowing that, the question boils down to this:

Do we want readdir() returning the same name as given to open(), or is
CESU-8 sufficent?


Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple
