delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2025/07/24/06:31:27

DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56OAVQdK1339443
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56OAVQdK1339443
Authentication-Results: delorie.com;
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=Fk2UL3fB
X-Recipient: archive-cygwin AT delorie DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7D5C43858430
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
s=default; t=1753353084;
bh=dyrzAsYNX9BAwDIWUdbzy/Ybl35BFJDtduT/fAXwmEE=;
h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
From;
b=Fk2UL3fB5GBqdjVqEBJPMZz+RTq6DzMNnUC85OcqmeM4Jp0yjQUta7eqgDRkdIedm
bMkRUwJpPeWYKkl05GNtbK1bxc89f5mCGnUM+Swf/MehRmXhpxCKHS0GkHJO5yzTNt
L9mQb7jITZn1PqhbwpindXT05opJY3SM8Y+D4PMk=
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 336553858D35
Date: Thu, 24 Jul 2025 12:30:50 +0200
To: Thomas Wolff <towo AT towo DOT net>,
Christian Franke <Christian DOT Franke AT t-online DOT de>
Subject: Re: readdir() returns inaccessible name if file was created with
invalid UTF-8
Message-ID: <aIILWiKsr99DOaI8@calimero.vinschen.de>
Mail-Followup-To: Thomas Wolff <towo AT towo DOT net>,
Christian Franke <Christian DOT Franke AT t-online DOT de>, cygwin AT cygwin DOT com
References: <aFxRfI4NdZ8y5IlK AT calimero DOT vinschen DOT de>
<f78c615c-aefe-b3d0-aada-5f9d0cf73a0a AT t-online DOT de>
<aF5y15iQ840LxLYJ AT calimero DOT vinschen DOT de>
<ca205dbd-907f-4552-9e5c-2cb0050f83a3 AT towo DOT net>
<aH-MtwqARmDmLwoo AT calimero DOT vinschen DOT de>
<91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net>
<4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net>
<68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net>
<aICVBQzWUiCYwnL2 AT calimero DOT vinschen DOT de>
<11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net>
MIME-Version: 1.0
In-Reply-To: <11282182-60d1-4841-bf78-5ef78cf30060@towo.net>
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.30
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
From: Corinna Vinschen via Cygwin <cygwin AT cygwin DOT com>
Reply-To: cygwin AT cygwin DOT com
Cc: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>, cygwin AT cygwin DOT com
Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com>

Hi Thomas, hi Christian,

On Jul 23 17:50, Thomas Wolff via Cygwin wrote:
> Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin:
> > On Jul 23 05:44, Thomas Wolff via Cygwin wrote:
> > What bugs me is that we have the choice between a broken mbrtowc on
> > one side and a chance to generate broken filenames on the other side.
> I did not look into those details, but while characters to be handled by a
> terminal come sequentially as a stream, filenames can be handled as a
> compound string, isn't that easier to check?
> 
> > I think we should actually revert fa272e05bbd0 ("wcstombs: also call
> > __WCTOMB on terminating NUL if output buffer is NULL") and see if we can
> > fix the filename issue in the Cygwin functions for filename conversion
> > alone.
> > 
> > Any ideas appreciated.

I think I have a fix.  I reverted fa272e05bbd0 so mbrtowc is operating
as before.  This should fix mintty.

As for the filename problem, I had another look into the _sys_wcstombs
and _sys_mbstowcs functions.

It occured to me that the algorithm how to handle an invalid MB sequence
is upside down when it comes to invalid UTF8 4 byte sequences.

Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f.  This 
sequence is converted to a byte sequence in the private use area like this:

  0xc2 0x7f -> 0xf0c2 0x007f

So the first byte of the sequence is wrong, so it's converted to 0xf0xx.
At this point, we reset the mbstate and try the mbtowc conversion again
with byte 2.  Byte 2 is now a valid single byte.  Hence 0xf0c2 0x007f.
Also

  0xc2 0xff -> 0xf0c2 0xf0ff

because 0xc2 0xff is not valid and 0xff is not a valid lead byte.

Now consider a broken 3 byte sequence.  Same as above:

  0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f

Now the 4 byte sequence with a broken 4th byte:

  0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f

What's wrong here is the fact that the broken sequence results in
a valid high surrogate and the trailing 4th byte is treated as the
broken sequence.

But in fact the leading three bytes are the broken sequence.  The
current algorithm doesn't catch that, because it's already done
and handled.  So the innocent 4th byte has to take the punch.

I added a patch to _sys_mbstowcs:
- note the fact we already got a high surrogate
- if the next underlying mbtowc call returns an error, backtrack
  to the high surrogate in the output string and overwrite it with
  a per-byte sequence in the private use area
- reset mbstate
- retry the next byte after the broken sequence

As far as my testing goes, all cases with broken filenames should
work now.  The upcoming test release 3.7.0-0.261.gf21fbcaf583e
will contain the patch.

However, there's one problem left.  I added a FIXME comment to
_sys_wcstombs:

   FIXME? The conversion of invalid bytes from the private use area
   like we do here is not actually necessary.  If we skip it, the
   generated multibyte string is not identical to the original multibyte
   string, but it's equivalent in the sense, that another mbstowcs will
   generate the same wide-char string.  It would also be identical to
   the same string converted by wcstombs.  And while the original
   multibyte string can't be converted by mbstowcs, this string can.

What does that mean?  Consider this UTF8 input string:

  0xf0 0x90 0x80 0x2e

  mbstowcs:     returns -1
  sys_mbstowcs: f0f0 f090 f080 002e

Let's convert it back to multibyte:

  sys_wcstombs: 0xf0 0x90 0x80 0x2e
  wcstombs:     0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e

So while sys_wcstombs has special code converting the string back to its
original MB string, wcstombs converts to the CESU-8 representation.

This is transparent.  If we convert this CESU-8 string back to
wide-char, the resulting wide-char strings are the same:

  mbstowcs:     f0f0 f090 f080 002e
  sys_mbstowcs: f0f0 f090 f080 002e

So the question here is, shall we keep the special case converting
private use area bytes back to their original byte encoding?

Or shall simply go along with CESU-8 when converting back to multibyte
to keep the string the same as with wcstombs?

Exempt from this are the characters not valid in a DOS filename.
These will always be converted if we create wide-char filenames.


Thanks,
Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019