| delorie.com/archives/browse.cgi | search |
| DMARC-Filter: | OpenDMARC Filter v1.4.2 delorie.com 56OEA2EY1459146 |
| Authentication-Results: | delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com |
| Authentication-Results: | delorie.com; spf=pass smtp.mailfrom=cygwin.com |
| DKIM-Filter: | OpenDKIM Filter v2.11.0 delorie.com 56OEA2EY1459146 |
| Authentication-Results: | delorie.com; |
| dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=TM/b4jn4 | |
| X-Recipient: | archive-cygwin AT delorie DOT com |
| DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 3186E3857B98 |
| DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; |
| s=default; t=1753366201; | |
| bh=NvGLD2YxtOxP9Wu8S+uX77WTgGu3SsxYduEsoa3uUes=; | |
| h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: | |
| List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: | |
| From; | |
| b=TM/b4jn4BHjGA+fSdsMMAwLlU73b4q/5XVa+/2s1vWtzfxj5manh83e29Bfhm9cIc | |
| 7vnRpqhBNP2xNYQTk/oyKtDd2z6+ypJr3zuSgQ4ZQsZVUe2iQqY11DNVm8H99XQTKO | |
| BmWVGPVVYhmVBuSaN9h5EwAnmhvRtVTiA7TOEpXY= | |
| X-Original-To: | cygwin AT cygwin DOT com |
| Delivered-To: | cygwin AT cygwin DOT com |
| DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 385283857B9B |
| Date: | Thu, 24 Jul 2025 16:08:49 +0200 |
| To: | Thomas Wolff <towo AT towo DOT net>, |
| Christian Franke <Christian DOT Franke AT t-online DOT de> | |
| Subject: | Re: readdir() returns inaccessible name if file was created with |
| invalid UTF-8 | |
| Message-ID: | <aII-cQ0BCgfk3PQm@calimero.vinschen.de> |
| Mail-Followup-To: | Thomas Wolff <towo AT towo DOT net>, |
| Christian Franke <Christian DOT Franke AT t-online DOT de>, cygwin AT cygwin DOT com | |
| References: | <aF5y15iQ840LxLYJ AT calimero DOT vinschen DOT de> |
| <ca205dbd-907f-4552-9e5c-2cb0050f83a3 AT towo DOT net> | |
| <aH-MtwqARmDmLwoo AT calimero DOT vinschen DOT de> | |
| <91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net> | |
| <4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net> | |
| <68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net> | |
| <aICVBQzWUiCYwnL2 AT calimero DOT vinschen DOT de> | |
| <11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net> | |
| <aIILWiKsr99DOaI8 AT calimero DOT vinschen DOT de> | |
| <b0a32549-77da-4c0f-b118-79617800faea AT towo DOT net> | |
| MIME-Version: | 1.0 |
| In-Reply-To: | <b0a32549-77da-4c0f-b118-79617800faea@towo.net> |
| X-BeenThere: | cygwin AT cygwin DOT com |
| X-Mailman-Version: | 2.1.30 |
| List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
| List-Unsubscribe: | <https://cygwin.com/mailman/options/cygwin>, |
| <mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe> | |
| List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
| List-Post: | <mailto:cygwin AT cygwin DOT com> |
| List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
| List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
| <mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
| From: | Corinna Vinschen via Cygwin <cygwin AT cygwin DOT com> |
| Reply-To: | cygwin AT cygwin DOT com |
| Cc: | Corinna Vinschen <corinna-cygwin AT cygwin DOT com>, cygwin AT cygwin DOT com |
| Errors-To: | cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com |
| Sender: | "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com> |
On Jul 24 15:41, Thomas Wolff via Cygwin wrote: > Am 24.07.2025 um 12:30 schrieb Corinna Vinschen: > > What does that mean? Consider this UTF8 input string: > > > > 0xf0 0x90 0x80 0x2e > > > > mbstowcs: returns -1 > > sys_mbstowcs: f0f0 f090 f080 002e > > > > Let's convert it back to multibyte: > > > > sys_wcstombs: 0xf0 0x90 0x80 0x2e > > wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e > > > > So while sys_wcstombs has special code converting the string back to its > > original MB string, wcstombs converts to the CESU-8 representation. > > > > This is transparent. If we convert this CESU-8 string back to > > wide-char, the resulting wide-char strings are the same: > > > > mbstowcs: f0f0 f090 f080 002e > > sys_mbstowcs: f0f0 f090 f080 002e > > > > So the question here is, shall we keep the special case converting > > private use area bytes back to their original byte encoding? > > > > Or shall simply go along with CESU-8 when converting back to multibyte > > to keep the string the same as with wcstombs? > > > > Exempt from this are the characters not valid in a DOS filename. > > These will always be converted if we create wide-char filenames. > Sounds like a fair solution with only minor glitches. Poor 4th byte but > thanks a lot anyway. > About the latter decision, if there's no strong bias otherwise, I'd prefer > to drop special handling (but don't take my vote, I don't care so much about > that). Thanks for your input. As another datapoint we have to consider how sys_wcstombs is used. wcstombs on a filename will be used by the application only, and only if the filename is incoming application level data or has been converted to a wide char by the application itself. sys_wcstombs will be used to generate a readable multi-byte filename from UTF-16 filenames read from the filesystem. So it's major use in terms of filenames is by readdir(). Knowing that, the question boils down to this: Do we want readdir() returning the same name as given to open(), or is CESU-8 sufficent? Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
| webmaster | delorie software privacy |
| Copyright © 2019 by DJ Delorie | Updated Jul 2019 |