DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56OIdHae1566221 Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56OIdHae1566221 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=ysZYpR8B X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EE7F6385B83D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1753382356; bh=ByPM96S6TNR8egbVvuwwpONxLSkZKXDaNgqquPvOfSQ=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=ysZYpR8B2D5Cc1I1gbiYMbMQf0e5WFaJpNe0YA18fMtT9HiRFIainaxIMZL0mKo46 U2fixghmblwI3z4lrJb9z9i3zfgIZRy01xReq1axEspklt+H7W5++OVZxOAdV6Goab iTKXaENgpk/vnM+OKzQHqVhJNFcHV/53CPrWTtOg= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5ECBA385B519 Date: Thu, 24 Jul 2025 20:38:18 +0200 To: cygwin AT cygwin DOT com Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 Message-ID: Mail-Followup-To: cygwin AT cygwin DOT com References: <91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net> <4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net> <68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Corinna Vinschen via Cygwin Reply-To: cygwin AT cygwin DOT com Cc: Corinna Vinschen Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" On Jul 24 19:45, Thomas Wolff via Cygwin wrote: > Am 24.07.2025 um 17:35 schrieb Corinna Vinschen: > > Consider the following GB18030 string: 0x90 0x30 0x81 0x30 > > > > This string translates into a UTF-16 surrogate pair: 0xd800 0xdc00. > > > > If you run a tweaked version of your test applicaton from > > https://cygwin.com/pipermail/cygwin/2025-July/258513.html: > > > > setlocale (LC_CTYPE, "zh_CN.gb18030"); > > mb (0x90); > > mb (0x30); > > mb (0x81); > > mb (0x30); > > > > Then the output is: > > > > 90 -> 0000 : -2 > > 30 -> 0000 : -2 > > 81 -> 0000 : -2 > > 30 -> D800 : 0 > > > > However, if you notice this situation... > > > > if (ret_from_mbrtowc == 0 && codeset == gb18030 > > && (pwc & 0xfc00) == 0xd800) > > > > ...you can just add a fake NUL byte: > > > > mbrtowc (&wc, '\0', 1, &mbstate); > > > > If you do that, the above sequence becomes: > > > > 90 -> 0000 : -2 > > 30 -> 0000 : -2 > > 81 -> 0000 : -2 > > 30 -> D800 : 0 > > 00 -> DC00 : 1 > > > > I hope this helps, if you didn't already handle GB18030 differently > > in mintty. > Oooff. No, I didn't. So that is already before 3.6.4 (and again 3.6.5), > right? Starting with 3.5.0 in fact. > Thanks for the notice, I'll check and test your workaround. No worries. While I was testing the UTF-8 problem, I realized that we have another strange encoding we're supporting for a short while. GB18030 is tricky, because there's no such thing as a simple mathematical conversion, as it is for UTF-8. The 2nd and 4th bytes may have position dependent meaning and could just as well represent an ASCII char. You can't simply search backwards in a string either. As I wrote, you need all 4 bytes to allow conversion into UTF-16, so a workaround as above is, unfortunately, necessary. Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple