DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56OFaBCr1498048
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56OFaBCr1498048
Authentication-Results: delorie.com;
	dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=O4vI+gvS
X-Recipient: archive-cygwin@delorie.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 78768385AC1C
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
	s=default; t=1753371370;
	bh=1VV2KOOYyFMN65fDRJReSvQOvlpPcu4M4PdvHa82PyE=;
	h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=O4vI+gvSYTbCW+R6kNr6obVKFPobHqGXwAMJgc+HaYg1Sd4eIMNMB2itiFq8FjGqn
	 +F/y7OU48E6H2XwgR/VtU6dtv05PnCYx1wCo3sD0dXg6AkzgYRo5Nz9jDq25f8iRun
	 bJ4PYKX9YdW52Tu3Qn42j/+ziSsjlHy8UiFBZ1k4=
X-Original-To: cygwin@cygwin.com
Delivered-To: cygwin@cygwin.com
Date: Thu, 24 Jul 2025 17:35:06 +0200
To: Thomas Wolff <towo@towo.net>
Subject: Re: readdir() returns inaccessible name if file was created with
 invalid UTF-8
Message-ID: <aIJSqk4abV6QdeVS@calimero.vinschen.de>
Mail-Followup-To: Thomas Wolff <towo@towo.net>, cygwin@cygwin.com
References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1@t-online.de>
 <03c4fae7-7322-572c-ae72-52e300f0b438@t-online.de>
 <aFxRfI4NdZ8y5IlK@calimero.vinschen.de>
 <f78c615c-aefe-b3d0-aada-5f9d0cf73a0a@t-online.de>
 <aF5y15iQ840LxLYJ@calimero.vinschen.de>
 <ca205dbd-907f-4552-9e5c-2cb0050f83a3@towo.net>
 <aH-MtwqARmDmLwoo@calimero.vinschen.de>
 <91f26856-72b0-483b-8d04-bd90a27b6be0@towo.net>
 <4ab2c1b7-3164-4556-ba36-29814ecf5766@towo.net>
 <68f65634-8f4e-436b-ba6a-d30bdf882aaa@towo.net>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <68f65634-8f4e-436b-ba6a-d30bdf882aaa@towo.net>
X-BeenThere: cygwin@cygwin.com
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=subscribe>
From: Corinna Vinschen via Cygwin <cygwin@cygwin.com>
Reply-To: cygwin@cygwin.com
Cc: Corinna Vinschen <corinna-cygwin@cygwin.com>, cygwin@cygwin.com
Content-Type: text/plain; charset="utf-8"
Errors-To: cygwin-bounces~archive-cygwin=delorie.com@cygwin.com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie.com@cygwin.com>
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 56OFaBCr1498048

Thomas,

On Jul 23 05:44, Thomas Wolff via Cygwin wrote:
> > > Am 22.07.2025 um 15:05 schrieb Corinna Vinschen:
> > > > mbrtowc() is inherently a bad idea when it comes to UTF-16. It's a
> > > > function which only works really correctly for the unicode base plane,
> > > > or if wchar_t is big enough.
> > > > 
> > > > It's the reason we don't use mbrtowc() if possible.  It's better
> > > > to call
> > > > mbstowcs() or friends and allow at least 3 chars in the wchar_t buffer.
> > > > You can't change that in mintty by any chance?
> > [...]
> OK, suppose I'd consider to switch to mbs[[n]r]towcs, collecting bytes until
> the function gives me a result.
> This would work fine as long as I receive only valid sequences. But look at
> input string test case
> char nonbmp[] = {0xF8, 0x88, 0x8A, 0xAF, 0x2D, 0}; // an invalid sequence
> followed by a valid char
> The functions only return -1 and (in the case of mbsnrtowcs) do not advance
> the input pointer.
> So how am I supposed to recognize that the invalid sequence has ended and a
> valid character has arrived?

Apart from that, you probably still have a problem in mintty: GB18030.

The problem with GB18030 is, that you need all four bytes to generate
the high surrogate.

Consider the following GB18030 string: 0x90 0x30 0x81 0x30

This string translates into a UTF-16 surrogate pair: 0xd800 0xdc00.

If you run a tweaked version of your test applicaton from
https://cygwin.com/pipermail/cygwin/2025-July/258513.html:

  setlocale (LC_CTYPE, "zh_CN.gb18030");
  mb (0x90);
  mb (0x30);
  mb (0x81);
  mb (0x30);

Then the output is:

  90 -> 0000 : -2
  30 -> 0000 : -2
  81 -> 0000 : -2
  30 -> D800 : 0

However, if you notice this situation...

  if (ret_from_mbrtowc == 0 && codeset == gb18030
      && (pwc & 0xfc00) == 0xd800)

...you can just add a fake NUL byte:

    mbrtowc (&wc, '\0', 1, &mbstate);

If you do that, the above sequence becomes:

  90 -> 0000 : -2
  30 -> 0000 : -2
  81 -> 0000 : -2
  30 -> D800 : 0
  00 -> DC00 : 1

I hope this helps, if you didn't already handle GB18030 differently
in mintty.


Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

