DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56OEA2EY1459146 Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56OEA2EY1459146 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=TM/b4jn4 X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3186E3857B98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1753366201; bh=NvGLD2YxtOxP9Wu8S+uX77WTgGu3SsxYduEsoa3uUes=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=TM/b4jn4BHjGA+fSdsMMAwLlU73b4q/5XVa+/2s1vWtzfxj5manh83e29Bfhm9cIc 7vnRpqhBNP2xNYQTk/oyKtDd2z6+ypJr3zuSgQ4ZQsZVUe2iQqY11DNVm8H99XQTKO BmWVGPVVYhmVBuSaN9h5EwAnmhvRtVTiA7TOEpXY= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 385283857B9B Date: Thu, 24 Jul 2025 16:08:49 +0200 To: Thomas Wolff , Christian Franke Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 Message-ID: Mail-Followup-To: Thomas Wolff , Christian Franke , cygwin AT cygwin DOT com References: <91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net> <4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net> <68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net> <11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Corinna Vinschen via Cygwin Reply-To: cygwin AT cygwin DOT com Cc: Corinna Vinschen , cygwin AT cygwin DOT com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" On Jul 24 15:41, Thomas Wolff via Cygwin wrote: > Am 24.07.2025 um 12:30 schrieb Corinna Vinschen: > > What does that mean? Consider this UTF8 input string: > > > > 0xf0 0x90 0x80 0x2e > > > > mbstowcs: returns -1 > > sys_mbstowcs: f0f0 f090 f080 002e > > > > Let's convert it back to multibyte: > > > > sys_wcstombs: 0xf0 0x90 0x80 0x2e > > wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e > > > > So while sys_wcstombs has special code converting the string back to its > > original MB string, wcstombs converts to the CESU-8 representation. > > > > This is transparent. If we convert this CESU-8 string back to > > wide-char, the resulting wide-char strings are the same: > > > > mbstowcs: f0f0 f090 f080 002e > > sys_mbstowcs: f0f0 f090 f080 002e > > > > So the question here is, shall we keep the special case converting > > private use area bytes back to their original byte encoding? > > > > Or shall simply go along with CESU-8 when converting back to multibyte > > to keep the string the same as with wcstombs? > > > > Exempt from this are the characters not valid in a DOS filename. > > These will always be converted if we create wide-char filenames. > Sounds like a fair solution with only minor glitches. Poor 4th byte but > thanks a lot anyway. > About the latter decision, if there's no strong bias otherwise, I'd prefer > to drop special handling (but don't take my vote, I don't care so much about > that). Thanks for your input. As another datapoint we have to consider how sys_wcstombs is used. wcstombs on a filename will be used by the application only, and only if the filename is incoming application level data or has been converted to a wide char by the application itself. sys_wcstombs will be used to generate a readable multi-byte filename from UTF-16 filenames read from the filesystem. So it's major use in terms of filenames is by readdir(). Knowing that, the question boils down to this: Do we want readdir() returning the same name as given to open(), or is CESU-8 sufficent? Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple