delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2025/07/24/11:29:14

DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56OFTEHc1491048
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56OFTEHc1491048
X-Recipient: archive-cygwin AT delorie DOT com
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EBFAD385735B
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org EBFAD385735B
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1753370931; cv=none;
b=xT9TR2JjCoPmpdX6EuaCyZDVUGE7gVd45l1dQUONwkwm031ib63hYzgmyHlm0ZTxut1OLcVKYRVxuAg4B9e6xRSJY6BqMXyQBdkHETComormLptvfNVp7KUAGs1iwAoMYaqFO7Zplv72ZGsu3Mm/16GZQKLZIhVd1D9b6eyZE5M=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
t=1753370931; c=relaxed/simple;
bh=OLNkmB77WW8vjvwz7Xdinz7t0l0+HQE+r0cVcS7/fwQ=;
h=Message-ID:Date:MIME-Version:From:Subject:To:DKIM-Signature;
b=OiiAsbUdRO936LpnBTATElNAY2DOy7M5oTbrMB7i5Z5dGgvwroNIdum+mP6LSeX7iGoG3Zy7wK+oHkslZoFN43eX1vGG3lMNxGvuidWoLCUOdiiO3Kw57gyrMfDoGOr+kMraopjqTckwGjJM5aG2fsPYrbAMdwBb5RwMRsrryr0=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EBFAD385735B
Message-ID: <aec69850-227c-4c37-8aa9-6ea97dbec25b@systematicsw.ab.ca>
Date: Thu, 24 Jul 2025 09:28:48 -0600
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: readdir() returns inaccessible name if file was created with
invalid UTF-8
To: cygwin AT cygwin DOT com
References: <aFxRfI4NdZ8y5IlK AT calimero DOT vinschen DOT de>
<f78c615c-aefe-b3d0-aada-5f9d0cf73a0a AT t-online DOT de>
<aF5y15iQ840LxLYJ AT calimero DOT vinschen DOT de>
<ca205dbd-907f-4552-9e5c-2cb0050f83a3 AT towo DOT net>
<aH-MtwqARmDmLwoo AT calimero DOT vinschen DOT de>
<91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net>
<4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net>
<68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net>
<aICVBQzWUiCYwnL2 AT calimero DOT vinschen DOT de>
<11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net>
<aIILWiKsr99DOaI8 AT calimero DOT vinschen DOT de>
Organization: Systematic Software
In-Reply-To: <aIILWiKsr99DOaI8@calimero.vinschen.de>
X-Stat-Signature: o18ee6xrnerkim8iifueiy468d43bqak
X-Rspamd-Server: rspamout07
X-Rspamd-Queue-Id: 05C992002A
X-Session-Marker: 427269616E2E496E676C69734053797374656D6174696353572E61622E6361
X-Session-ID: U2FsdGVkX1+cznN+c4AFkRhM+hzQNgqbs0+3RoQVkXE=
X-HE-Tag: 1753370928-399951
X-HE-Meta: U2FsdGVkX192LMJuAgwOkQsz0j7xInixGosA8fjwuNHVg2nGeCY5usitOZIlUUSBfX7JU3jpoE1OLyUsH+9SX71Mv3QrHJY+kV5qVvUPY8broiuMSptpiRXCya7y8RUXCCpC+l5gdAnuppmqUk9wPVum59oQVeli9pVI3bJPkeUmnz6CrZXjtCsGPJBdYIsIRseWGsGMXj95b+3WKaSl24akI43p2Pqlg36WTP9M0WooLG11ql49X8j2ft0CdwNeekm63VoY5SS3cRlkUsxHVO8fPRiY8X57xfIraApmzU11Lz6OIsFhBtRiE+CXGFjLALiyznEJQ2bmFDJ//gEpdNqAUZHVEsgd
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.30
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
From: Brian Inglis via Cygwin <cygwin AT cygwin DOT com>
Reply-To: cygwin AT cygwin DOT com
Cc: brian DOT inglis AT systematicsw DOT ab DOT ca
Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com>
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 56OFTEHc1491048

On 2025-07-24 04:30, Corinna Vinschen via Cygwin wrote:
> Hi Thomas, hi Christian,
> 
> On Jul 23 17:50, Thomas Wolff via Cygwin wrote:
>> Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin:
>>> On Jul 23 05:44, Thomas Wolff via Cygwin wrote:
>>> What bugs me is that we have the choice between a broken mbrtowc on
>>> one side and a chance to generate broken filenames on the other side.
>> I did not look into those details, but while characters to be handled by a
>> terminal come sequentially as a stream, filenames can be handled as a
>> compound string, isn't that easier to check?
>>
>>> I think we should actually revert fa272e05bbd0 ("wcstombs: also call
>>> __WCTOMB on terminating NUL if output buffer is NULL") and see if we can
>>> fix the filename issue in the Cygwin functions for filename conversion
>>> alone.
>>>
>>> Any ideas appreciated.
> 
> I think I have a fix.  I reverted fa272e05bbd0 so mbrtowc is operating
> as before.  This should fix mintty.
> 
> As for the filename problem, I had another look into the _sys_wcstombs
> and _sys_mbstowcs functions.
> 
> It occured to me that the algorithm how to handle an invalid MB sequence
> is upside down when it comes to invalid UTF8 4 byte sequences.
> 
> Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f.  This
> sequence is converted to a byte sequence in the private use area like this:
> 
>    0xc2 0x7f -> 0xf0c2 0x007f
> 
> So the first byte of the sequence is wrong, so it's converted to 0xf0xx.
> At this point, we reset the mbstate and try the mbtowc conversion again
> with byte 2.  Byte 2 is now a valid single byte.  Hence 0xf0c2 0x007f.
> Also
> 
>    0xc2 0xff -> 0xf0c2 0xf0ff
> 
> because 0xc2 0xff is not valid and 0xff is not a valid lead byte.
> 
> Now consider a broken 3 byte sequence.  Same as above:
> 
>    0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f
> 
> Now the 4 byte sequence with a broken 4th byte:
> 
>    0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f
> 
> What's wrong here is the fact that the broken sequence results in
> a valid high surrogate and the trailing 4th byte is treated as the
> broken sequence.
> 
> But in fact the leading three bytes are the broken sequence.  The
> current algorithm doesn't catch that, because it's already done
> and handled.  So the innocent 4th byte has to take the punch.
> 
> I added a patch to _sys_mbstowcs:
> - note the fact we already got a high surrogate
> - if the next underlying mbtowc call returns an error, backtrack
>    to the high surrogate in the output string and overwrite it with
>    a per-byte sequence in the private use area
> - reset mbstate
> - retry the next byte after the broken sequence
> 
> As far as my testing goes, all cases with broken filenames should
> work now.  The upcoming test release 3.7.0-0.261.gf21fbcaf583e
> will contain the patch.
> 
> However, there's one problem left.  I added a FIXME comment to
> _sys_wcstombs:
> 
>     FIXME? The conversion of invalid bytes from the private use area
>     like we do here is not actually necessary.  If we skip it, the
>     generated multibyte string is not identical to the original multibyte
>     string, but it's equivalent in the sense, that another mbstowcs will
>     generate the same wide-char string.  It would also be identical to
>     the same string converted by wcstombs.  And while the original
>     multibyte string can't be converted by mbstowcs, this string can.
> 
> What does that mean?  Consider this UTF8 input string:
> 
>    0xf0 0x90 0x80 0x2e
> 
>    mbstowcs:     returns -1
>    sys_mbstowcs: f0f0 f090 f080 002e
> 
> Let's convert it back to multibyte:
> 
>    sys_wcstombs: 0xf0 0x90 0x80 0x2e
>    wcstombs:     0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e
> 
> So while sys_wcstombs has special code converting the string back to its
> original MB string, wcstombs converts to the CESU-8 representation.
> 
> This is transparent.  If we convert this CESU-8 string back to
> wide-char, the resulting wide-char strings are the same:
> 
>    mbstowcs:     f0f0 f090 f080 002e
>    sys_mbstowcs: f0f0 f090 f080 002e
> 
> So the question here is, shall we keep the special case converting
> private use area bytes back to their original byte encoding?
> 
> Or shall simply go along with CESU-8 when converting back to multibyte
> to keep the string the same as with wcstombs?

There are 15 * SMP as BMP characters, so many non-Western and emoji characters 
will be expanded from 4 UTF-8 bytes to 6 CESU-8 bytes, and this is not supported 
anywhere as a string representation, designed for internal use only per the TR.

> Exempt from this are the characters not valid in a DOS filename.
> These will always be converted if we create wide-char filenames.
-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retrancher  but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019