delorie.com/archives/browse.cgi | search |
DMARC-Filter: | OpenDMARC Filter v1.4.2 delorie.com 56OEA2EY1459146 |
Authentication-Results: | delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com |
Authentication-Results: | delorie.com; spf=pass smtp.mailfrom=cygwin.com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 delorie.com 56OEA2EY1459146 |
Authentication-Results: | delorie.com; |
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=TM/b4jn4 | |
X-Recipient: | archive-cygwin AT delorie DOT com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 3186E3857B98 |
DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; |
s=default; t=1753366201; | |
bh=NvGLD2YxtOxP9Wu8S+uX77WTgGu3SsxYduEsoa3uUes=; | |
h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: | |
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: | |
From; | |
b=TM/b4jn4BHjGA+fSdsMMAwLlU73b4q/5XVa+/2s1vWtzfxj5manh83e29Bfhm9cIc | |
7vnRpqhBNP2xNYQTk/oyKtDd2z6+ypJr3zuSgQ4ZQsZVUe2iQqY11DNVm8H99XQTKO | |
BmWVGPVVYhmVBuSaN9h5EwAnmhvRtVTiA7TOEpXY= | |
X-Original-To: | cygwin AT cygwin DOT com |
Delivered-To: | cygwin AT cygwin DOT com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 385283857B9B |
Date: | Thu, 24 Jul 2025 16:08:49 +0200 |
To: | Thomas Wolff <towo AT towo DOT net>, |
Christian Franke <Christian DOT Franke AT t-online DOT de> | |
Subject: | Re: readdir() returns inaccessible name if file was created with |
invalid UTF-8 | |
Message-ID: | <aII-cQ0BCgfk3PQm@calimero.vinschen.de> |
Mail-Followup-To: | Thomas Wolff <towo AT towo DOT net>, |
Christian Franke <Christian DOT Franke AT t-online DOT de>, cygwin AT cygwin DOT com | |
References: | <aF5y15iQ840LxLYJ AT calimero DOT vinschen DOT de> |
<ca205dbd-907f-4552-9e5c-2cb0050f83a3 AT towo DOT net> | |
<aH-MtwqARmDmLwoo AT calimero DOT vinschen DOT de> | |
<91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net> | |
<4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net> | |
<68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net> | |
<aICVBQzWUiCYwnL2 AT calimero DOT vinschen DOT de> | |
<11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net> | |
<aIILWiKsr99DOaI8 AT calimero DOT vinschen DOT de> | |
<b0a32549-77da-4c0f-b118-79617800faea AT towo DOT net> | |
MIME-Version: | 1.0 |
In-Reply-To: | <b0a32549-77da-4c0f-b118-79617800faea@towo.net> |
X-BeenThere: | cygwin AT cygwin DOT com |
X-Mailman-Version: | 2.1.30 |
List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
List-Unsubscribe: | <https://cygwin.com/mailman/options/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe> | |
List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
From: | Corinna Vinschen via Cygwin <cygwin AT cygwin DOT com> |
Reply-To: | cygwin AT cygwin DOT com |
Cc: | Corinna Vinschen <corinna-cygwin AT cygwin DOT com>, cygwin AT cygwin DOT com |
Errors-To: | cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com |
Sender: | "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com> |
On Jul 24 15:41, Thomas Wolff via Cygwin wrote: > Am 24.07.2025 um 12:30 schrieb Corinna Vinschen: > > What does that mean? Consider this UTF8 input string: > > > > 0xf0 0x90 0x80 0x2e > > > > mbstowcs: returns -1 > > sys_mbstowcs: f0f0 f090 f080 002e > > > > Let's convert it back to multibyte: > > > > sys_wcstombs: 0xf0 0x90 0x80 0x2e > > wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e > > > > So while sys_wcstombs has special code converting the string back to its > > original MB string, wcstombs converts to the CESU-8 representation. > > > > This is transparent. If we convert this CESU-8 string back to > > wide-char, the resulting wide-char strings are the same: > > > > mbstowcs: f0f0 f090 f080 002e > > sys_mbstowcs: f0f0 f090 f080 002e > > > > So the question here is, shall we keep the special case converting > > private use area bytes back to their original byte encoding? > > > > Or shall simply go along with CESU-8 when converting back to multibyte > > to keep the string the same as with wcstombs? > > > > Exempt from this are the characters not valid in a DOS filename. > > These will always be converted if we create wide-char filenames. > Sounds like a fair solution with only minor glitches. Poor 4th byte but > thanks a lot anyway. > About the latter decision, if there's no strong bias otherwise, I'd prefer > to drop special handling (but don't take my vote, I don't care so much about > that). Thanks for your input. As another datapoint we have to consider how sys_wcstombs is used. wcstombs on a filename will be used by the application only, and only if the filename is incoming application level data or has been converted to a wide char by the application itself. sys_wcstombs will be used to generate a readable multi-byte filename from UTF-16 filenames read from the filesystem. So it's major use in terms of filenames is by readdir(). Knowing that, the question boils down to this: Do we want readdir() returning the same name as given to open(), or is CESU-8 sufficent? Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |