DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56OFTEHc1491048 Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56OFTEHc1491048 X-Recipient: archive-cygwin AT delorie DOT com X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EBFAD385735B ARC-Filter: OpenARC Filter v1.0.0 sourceware.org EBFAD385735B ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1753370931; cv=none; b=xT9TR2JjCoPmpdX6EuaCyZDVUGE7gVd45l1dQUONwkwm031ib63hYzgmyHlm0ZTxut1OLcVKYRVxuAg4B9e6xRSJY6BqMXyQBdkHETComormLptvfNVp7KUAGs1iwAoMYaqFO7Zplv72ZGsu3Mm/16GZQKLZIhVd1D9b6eyZE5M= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1753370931; c=relaxed/simple; bh=OLNkmB77WW8vjvwz7Xdinz7t0l0+HQE+r0cVcS7/fwQ=; h=Message-ID:Date:MIME-Version:From:Subject:To:DKIM-Signature; b=OiiAsbUdRO936LpnBTATElNAY2DOy7M5oTbrMB7i5Z5dGgvwroNIdum+mP6LSeX7iGoG3Zy7wK+oHkslZoFN43eX1vGG3lMNxGvuidWoLCUOdiiO3Kw57gyrMfDoGOr+kMraopjqTckwGjJM5aG2fsPYrbAMdwBb5RwMRsrryr0= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EBFAD385735B Message-ID: Date: Thu, 24 Jul 2025 09:28:48 -0600 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 Content-Language: en-CA To: cygwin AT cygwin DOT com References: <91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net> <4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net> <68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net> <11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net> Organization: Systematic Software In-Reply-To: X-Stat-Signature: o18ee6xrnerkim8iifueiy468d43bqak X-Rspamd-Server: rspamout07 X-Rspamd-Queue-Id: 05C992002A X-Session-Marker: 427269616E2E496E676C69734053797374656D6174696353572E61622E6361 X-Session-ID: U2FsdGVkX1+cznN+c4AFkRhM+hzQNgqbs0+3RoQVkXE= X-HE-Tag: 1753370928-399951 X-HE-Meta: U2FsdGVkX192LMJuAgwOkQsz0j7xInixGosA8fjwuNHVg2nGeCY5usitOZIlUUSBfX7JU3jpoE1OLyUsH+9SX71Mv3QrHJY+kV5qVvUPY8broiuMSptpiRXCya7y8RUXCCpC+l5gdAnuppmqUk9wPVum59oQVeli9pVI3bJPkeUmnz6CrZXjtCsGPJBdYIsIRseWGsGMXj95b+3WKaSl24akI43p2Pqlg36WTP9M0WooLG11ql49X8j2ft0CdwNeekm63VoY5SS3cRlkUsxHVO8fPRiY8X57xfIraApmzU11Lz6OIsFhBtRiE+CXGFjLALiyznEJQ2bmFDJ//gEpdNqAUZHVEsgd X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Brian Inglis via Cygwin Reply-To: cygwin AT cygwin DOT com Cc: brian DOT inglis AT systematicsw DOT ab DOT ca Content-Type: text/plain; charset="utf-8"; Format="flowed" Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 56OFTEHc1491048 On 2025-07-24 04:30, Corinna Vinschen via Cygwin wrote: > Hi Thomas, hi Christian, > > On Jul 23 17:50, Thomas Wolff via Cygwin wrote: >> Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin: >>> On Jul 23 05:44, Thomas Wolff via Cygwin wrote: >>> What bugs me is that we have the choice between a broken mbrtowc on >>> one side and a chance to generate broken filenames on the other side. >> I did not look into those details, but while characters to be handled by a >> terminal come sequentially as a stream, filenames can be handled as a >> compound string, isn't that easier to check? >> >>> I think we should actually revert fa272e05bbd0 ("wcstombs: also call >>> __WCTOMB on terminating NUL if output buffer is NULL") and see if we can >>> fix the filename issue in the Cygwin functions for filename conversion >>> alone. >>> >>> Any ideas appreciated. > > I think I have a fix. I reverted fa272e05bbd0 so mbrtowc is operating > as before. This should fix mintty. > > As for the filename problem, I had another look into the _sys_wcstombs > and _sys_mbstowcs functions. > > It occured to me that the algorithm how to handle an invalid MB sequence > is upside down when it comes to invalid UTF8 4 byte sequences. > > Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f. This > sequence is converted to a byte sequence in the private use area like this: > > 0xc2 0x7f -> 0xf0c2 0x007f > > So the first byte of the sequence is wrong, so it's converted to 0xf0xx. > At this point, we reset the mbstate and try the mbtowc conversion again > with byte 2. Byte 2 is now a valid single byte. Hence 0xf0c2 0x007f. > Also > > 0xc2 0xff -> 0xf0c2 0xf0ff > > because 0xc2 0xff is not valid and 0xff is not a valid lead byte. > > Now consider a broken 3 byte sequence. Same as above: > > 0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f > > Now the 4 byte sequence with a broken 4th byte: > > 0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f > > What's wrong here is the fact that the broken sequence results in > a valid high surrogate and the trailing 4th byte is treated as the > broken sequence. > > But in fact the leading three bytes are the broken sequence. The > current algorithm doesn't catch that, because it's already done > and handled. So the innocent 4th byte has to take the punch. > > I added a patch to _sys_mbstowcs: > - note the fact we already got a high surrogate > - if the next underlying mbtowc call returns an error, backtrack > to the high surrogate in the output string and overwrite it with > a per-byte sequence in the private use area > - reset mbstate > - retry the next byte after the broken sequence > > As far as my testing goes, all cases with broken filenames should > work now. The upcoming test release 3.7.0-0.261.gf21fbcaf583e > will contain the patch. > > However, there's one problem left. I added a FIXME comment to > _sys_wcstombs: > > FIXME? The conversion of invalid bytes from the private use area > like we do here is not actually necessary. If we skip it, the > generated multibyte string is not identical to the original multibyte > string, but it's equivalent in the sense, that another mbstowcs will > generate the same wide-char string. It would also be identical to > the same string converted by wcstombs. And while the original > multibyte string can't be converted by mbstowcs, this string can. > > What does that mean? Consider this UTF8 input string: > > 0xf0 0x90 0x80 0x2e > > mbstowcs: returns -1 > sys_mbstowcs: f0f0 f090 f080 002e > > Let's convert it back to multibyte: > > sys_wcstombs: 0xf0 0x90 0x80 0x2e > wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e > > So while sys_wcstombs has special code converting the string back to its > original MB string, wcstombs converts to the CESU-8 representation. > > This is transparent. If we convert this CESU-8 string back to > wide-char, the resulting wide-char strings are the same: > > mbstowcs: f0f0 f090 f080 002e > sys_mbstowcs: f0f0 f090 f080 002e > > So the question here is, shall we keep the special case converting > private use area bytes back to their original byte encoding? > > Or shall simply go along with CESU-8 when converting back to multibyte > to keep the string the same as with wcstombs? There are 15 * SMP as BMP characters, so many non-Western and emoji characters will be expanded from 4 UTF-8 bytes to 6 CESU-8 bytes, and this is not supported anywhere as a string representation, designed for internal use only per the TR. > Exempt from this are the characters not valid in a DOS filename. > These will always be converted if we create wide-char filenames. -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retrancher but when there is no more to cut -- Antoine de Saint-Exupéry -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple