delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2025/07/24/14:39:17

DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56OIdHae1566221
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56OIdHae1566221
Authentication-Results: delorie.com;
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=ysZYpR8B
X-Recipient: archive-cygwin AT delorie DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EE7F6385B83D
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
s=default; t=1753382356;
bh=ByPM96S6TNR8egbVvuwwpONxLSkZKXDaNgqquPvOfSQ=;
h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
From;
b=ysZYpR8B2D5Cc1I1gbiYMbMQf0e5WFaJpNe0YA18fMtT9HiRFIainaxIMZL0mKo46
U2fixghmblwI3z4lrJb9z9i3zfgIZRy01xReq1axEspklt+H7W5++OVZxOAdV6Goab
iTKXaENgpk/vnM+OKzQHqVhJNFcHV/53CPrWTtOg=
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5ECBA385B519
Date: Thu, 24 Jul 2025 20:38:18 +0200
To: cygwin AT cygwin DOT com
Subject: Re: readdir() returns inaccessible name if file was created with
invalid UTF-8
Message-ID: <aIJ9mlDotG7qw1-n@calimero.vinschen.de>
Mail-Followup-To: cygwin AT cygwin DOT com
References: <aFxRfI4NdZ8y5IlK AT calimero DOT vinschen DOT de>
<f78c615c-aefe-b3d0-aada-5f9d0cf73a0a AT t-online DOT de>
<aF5y15iQ840LxLYJ AT calimero DOT vinschen DOT de>
<ca205dbd-907f-4552-9e5c-2cb0050f83a3 AT towo DOT net>
<aH-MtwqARmDmLwoo AT calimero DOT vinschen DOT de>
<91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net>
<4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net>
<68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net>
<aIJSqk4abV6QdeVS AT calimero DOT vinschen DOT de>
<a41d289a-c440-4616-967c-850d7b7679d6 AT towo DOT net>
MIME-Version: 1.0
In-Reply-To: <a41d289a-c440-4616-967c-850d7b7679d6@towo.net>
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.30
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
From: Corinna Vinschen via Cygwin <cygwin AT cygwin DOT com>
Reply-To: cygwin AT cygwin DOT com
Cc: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com>

On Jul 24 19:45, Thomas Wolff via Cygwin wrote:
> Am 24.07.2025 um 17:35 schrieb Corinna Vinschen:
> > Consider the following GB18030 string: 0x90 0x30 0x81 0x30
> > 
> > This string translates into a UTF-16 surrogate pair: 0xd800 0xdc00.
> > 
> > If you run a tweaked version of your test applicaton from
> > https://cygwin.com/pipermail/cygwin/2025-July/258513.html:
> > 
> >    setlocale (LC_CTYPE, "zh_CN.gb18030");
> >    mb (0x90);
> >    mb (0x30);
> >    mb (0x81);
> >    mb (0x30);
> > 
> > Then the output is:
> > 
> >    90 -> 0000 : -2
> >    30 -> 0000 : -2
> >    81 -> 0000 : -2
> >    30 -> D800 : 0
> > 
> > However, if you notice this situation...
> > 
> >    if (ret_from_mbrtowc == 0 && codeset == gb18030
> >        && (pwc & 0xfc00) == 0xd800)
> > 
> > ...you can just add a fake NUL byte:
> > 
> >      mbrtowc (&wc, '\0', 1, &mbstate);
> > 
> > If you do that, the above sequence becomes:
> > 
> >    90 -> 0000 : -2
> >    30 -> 0000 : -2
> >    81 -> 0000 : -2
> >    30 -> D800 : 0
> >    00 -> DC00 : 1
> > 
> > I hope this helps, if you didn't already handle GB18030 differently
> > in mintty.
> Oooff. No, I didn't. So that is already before 3.6.4 (and again 3.6.5),
> right?

Starting with 3.5.0 in fact.

> Thanks for the notice, I'll check and test your workaround.

No worries.  While I was testing the UTF-8 problem, I realized that
we have another strange encoding we're supporting for a short while.

GB18030 is tricky, because there's no such thing as a simple
mathematical conversion, as it is for UTF-8.  The 2nd and 4th bytes may
have position dependent meaning and could just as well represent an
ASCII char.  You can't simply search backwards in a string either.

As I wrote, you need all 4 bytes to allow conversion into UTF-16, so
a workaround as above is, unfortunately, necessary.


Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019