DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56OFaBCr1498048 Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56OFaBCr1498048 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=O4vI+gvS X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 78768385AC1C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1753371370; bh=1VV2KOOYyFMN65fDRJReSvQOvlpPcu4M4PdvHa82PyE=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=O4vI+gvSYTbCW+R6kNr6obVKFPobHqGXwAMJgc+HaYg1Sd4eIMNMB2itiFq8FjGqn +F/y7OU48E6H2XwgR/VtU6dtv05PnCYx1wCo3sD0dXg6AkzgYRo5Nz9jDq25f8iRun bJ4PYKX9YdW52Tu3Qn42j/+ziSsjlHy8UiFBZ1k4= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com Date: Thu, 24 Jul 2025 17:35:06 +0200 To: Thomas Wolff Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 Message-ID: Mail-Followup-To: Thomas Wolff , cygwin AT cygwin DOT com References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de> <03c4fae7-7322-572c-ae72-52e300f0b438 AT t-online DOT de> <91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net> <4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net> <68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <68f65634-8f4e-436b-ba6a-d30bdf882aaa@towo.net> X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Corinna Vinschen via Cygwin Reply-To: cygwin AT cygwin DOT com Cc: Corinna Vinschen , cygwin AT cygwin DOT com Content-Type: text/plain; charset="utf-8" Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 56OFaBCr1498048 Thomas, On Jul 23 05:44, Thomas Wolff via Cygwin wrote: > > > Am 22.07.2025 um 15:05 schrieb Corinna Vinschen: > > > > mbrtowc() is inherently a bad idea when it comes to UTF-16. It's a > > > > function which only works really correctly for the unicode base plane, > > > > or if wchar_t is big enough. > > > > > > > > It's the reason we don't use mbrtowc() if possible.  It's better > > > > to call > > > > mbstowcs() or friends and allow at least 3 chars in the wchar_t buffer. > > > > You can't change that in mintty by any chance? > > [...] > OK, suppose I'd consider to switch to mbs[[n]r]towcs, collecting bytes until > the function gives me a result. > This would work fine as long as I receive only valid sequences. But look at > input string test case > char nonbmp[] = {0xF8, 0x88, 0x8A, 0xAF, 0x2D, 0}; // an invalid sequence > followed by a valid char > The functions only return -1 and (in the case of mbsnrtowcs) do not advance > the input pointer. > So how am I supposed to recognize that the invalid sequence has ended and a > valid character has arrived? Apart from that, you probably still have a problem in mintty: GB18030. The problem with GB18030 is, that you need all four bytes to generate the high surrogate. Consider the following GB18030 string: 0x90 0x30 0x81 0x30 This string translates into a UTF-16 surrogate pair: 0xd800 0xdc00. If you run a tweaked version of your test applicaton from https://cygwin.com/pipermail/cygwin/2025-July/258513.html: setlocale (LC_CTYPE, "zh_CN.gb18030"); mb (0x90); mb (0x30); mb (0x81); mb (0x30); Then the output is: 90 -> 0000 : -2 30 -> 0000 : -2 81 -> 0000 : -2 30 -> D800 : 0 However, if you notice this situation... if (ret_from_mbrtowc == 0 && codeset == gb18030 && (pwc & 0xfc00) == 0xd800) ...you can just add a fake NUL byte: mbrtowc (&wc, '\0', 1, &mbstate); If you do that, the above sequence becomes: 90 -> 0000 : -2 30 -> 0000 : -2 81 -> 0000 : -2 30 -> D800 : 0 00 -> DC00 : 1 I hope this helps, if you didn't already handle GB18030 differently in mintty. Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple