delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2025/07/24/10:10:02

DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56OEA2EY1459146
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56OEA2EY1459146
Authentication-Results: delorie.com;
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=TM/b4jn4
X-Recipient: archive-cygwin AT delorie DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3186E3857B98
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
s=default; t=1753366201;
bh=NvGLD2YxtOxP9Wu8S+uX77WTgGu3SsxYduEsoa3uUes=;
h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
From;
b=TM/b4jn4BHjGA+fSdsMMAwLlU73b4q/5XVa+/2s1vWtzfxj5manh83e29Bfhm9cIc
7vnRpqhBNP2xNYQTk/oyKtDd2z6+ypJr3zuSgQ4ZQsZVUe2iQqY11DNVm8H99XQTKO
BmWVGPVVYhmVBuSaN9h5EwAnmhvRtVTiA7TOEpXY=
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 385283857B9B
Date: Thu, 24 Jul 2025 16:08:49 +0200
To: Thomas Wolff <towo AT towo DOT net>,
Christian Franke <Christian DOT Franke AT t-online DOT de>
Subject: Re: readdir() returns inaccessible name if file was created with
invalid UTF-8
Message-ID: <aII-cQ0BCgfk3PQm@calimero.vinschen.de>
Mail-Followup-To: Thomas Wolff <towo AT towo DOT net>,
Christian Franke <Christian DOT Franke AT t-online DOT de>, cygwin AT cygwin DOT com
References: <aF5y15iQ840LxLYJ AT calimero DOT vinschen DOT de>
<ca205dbd-907f-4552-9e5c-2cb0050f83a3 AT towo DOT net>
<aH-MtwqARmDmLwoo AT calimero DOT vinschen DOT de>
<91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net>
<4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net>
<68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net>
<aICVBQzWUiCYwnL2 AT calimero DOT vinschen DOT de>
<11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net>
<aIILWiKsr99DOaI8 AT calimero DOT vinschen DOT de>
<b0a32549-77da-4c0f-b118-79617800faea AT towo DOT net>
MIME-Version: 1.0
In-Reply-To: <b0a32549-77da-4c0f-b118-79617800faea@towo.net>
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.30
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
From: Corinna Vinschen via Cygwin <cygwin AT cygwin DOT com>
Reply-To: cygwin AT cygwin DOT com
Cc: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>, cygwin AT cygwin DOT com
Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com>

On Jul 24 15:41, Thomas Wolff via Cygwin wrote:
> Am 24.07.2025 um 12:30 schrieb Corinna Vinschen:
> > What does that mean?  Consider this UTF8 input string:
> > 
> >    0xf0 0x90 0x80 0x2e
> > 
> >    mbstowcs:     returns -1
> >    sys_mbstowcs: f0f0 f090 f080 002e
> > 
> > Let's convert it back to multibyte:
> > 
> >    sys_wcstombs: 0xf0 0x90 0x80 0x2e
> >    wcstombs:     0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e
> > 
> > So while sys_wcstombs has special code converting the string back to its
> > original MB string, wcstombs converts to the CESU-8 representation.
> > 
> > This is transparent.  If we convert this CESU-8 string back to
> > wide-char, the resulting wide-char strings are the same:
> > 
> >    mbstowcs:     f0f0 f090 f080 002e
> >    sys_mbstowcs: f0f0 f090 f080 002e
> > 
> > So the question here is, shall we keep the special case converting
> > private use area bytes back to their original byte encoding?
> > 
> > Or shall simply go along with CESU-8 when converting back to multibyte
> > to keep the string the same as with wcstombs?
> > 
> > Exempt from this are the characters not valid in a DOS filename.
> > These will always be converted if we create wide-char filenames.
> Sounds like a fair solution with only minor glitches. Poor 4th byte but
> thanks a lot anyway.
> About the latter decision, if there's no strong bias otherwise, I'd prefer
> to drop special handling (but don't take my vote, I don't care so much about
> that).

Thanks for your input.

As another datapoint we have to consider how sys_wcstombs is used.

wcstombs on a filename will be used by the application only, and only if
the filename is incoming application level data or has been converted to a
wide char by the application itself.

sys_wcstombs will be used to generate a readable multi-byte filename from
UTF-16 filenames read from the filesystem.  So it's major use in terms of
filenames is by readdir().

Knowing that, the question boils down to this:

Do we want readdir() returning the same name as given to open(), or is
CESU-8 sufficent?


Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019