delorie.com/archives/browse.cgi | search |
DMARC-Filter: | OpenDMARC Filter v1.4.2 delorie.com 56OFTEHc1491048 |
Authentication-Results: | delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com |
Authentication-Results: | delorie.com; spf=pass smtp.mailfrom=cygwin.com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 delorie.com 56OFTEHc1491048 |
X-Recipient: | archive-cygwin AT delorie DOT com |
X-Original-To: | cygwin AT cygwin DOT com |
Delivered-To: | cygwin AT cygwin DOT com |
DMARC-Filter: | OpenDMARC Filter v1.4.2 sourceware.org EBFAD385735B |
ARC-Filter: | OpenARC Filter v1.0.0 sourceware.org EBFAD385735B |
ARC-Seal: | i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1753370931; cv=none; |
b=xT9TR2JjCoPmpdX6EuaCyZDVUGE7gVd45l1dQUONwkwm031ib63hYzgmyHlm0ZTxut1OLcVKYRVxuAg4B9e6xRSJY6BqMXyQBdkHETComormLptvfNVp7KUAGs1iwAoMYaqFO7Zplv72ZGsu3Mm/16GZQKLZIhVd1D9b6eyZE5M= | |
ARC-Message-Signature: | i=1; a=rsa-sha256; d=sourceware.org; s=key; |
t=1753370931; c=relaxed/simple; | |
bh=OLNkmB77WW8vjvwz7Xdinz7t0l0+HQE+r0cVcS7/fwQ=; | |
h=Message-ID:Date:MIME-Version:From:Subject:To:DKIM-Signature; | |
b=OiiAsbUdRO936LpnBTATElNAY2DOy7M5oTbrMB7i5Z5dGgvwroNIdum+mP6LSeX7iGoG3Zy7wK+oHkslZoFN43eX1vGG3lMNxGvuidWoLCUOdiiO3Kw57gyrMfDoGOr+kMraopjqTckwGjJM5aG2fsPYrbAMdwBb5RwMRsrryr0= | |
ARC-Authentication-Results: | i=1; server2.sourceware.org |
DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org EBFAD385735B |
Message-ID: | <aec69850-227c-4c37-8aa9-6ea97dbec25b@systematicsw.ab.ca> |
Date: | Thu, 24 Jul 2025 09:28:48 -0600 |
MIME-Version: | 1.0 |
User-Agent: | Mozilla Thunderbird |
Subject: | Re: readdir() returns inaccessible name if file was created with |
invalid UTF-8 | |
To: | cygwin AT cygwin DOT com |
References: | <aFxRfI4NdZ8y5IlK AT calimero DOT vinschen DOT de> |
<f78c615c-aefe-b3d0-aada-5f9d0cf73a0a AT t-online DOT de> | |
<aF5y15iQ840LxLYJ AT calimero DOT vinschen DOT de> | |
<ca205dbd-907f-4552-9e5c-2cb0050f83a3 AT towo DOT net> | |
<aH-MtwqARmDmLwoo AT calimero DOT vinschen DOT de> | |
<91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net> | |
<4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net> | |
<68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net> | |
<aICVBQzWUiCYwnL2 AT calimero DOT vinschen DOT de> | |
<11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net> | |
<aIILWiKsr99DOaI8 AT calimero DOT vinschen DOT de> | |
Organization: | Systematic Software |
In-Reply-To: | <aIILWiKsr99DOaI8@calimero.vinschen.de> |
X-Stat-Signature: | o18ee6xrnerkim8iifueiy468d43bqak |
X-Rspamd-Server: | rspamout07 |
X-Rspamd-Queue-Id: | 05C992002A |
X-Session-Marker: | 427269616E2E496E676C69734053797374656D6174696353572E61622E6361 |
X-Session-ID: | U2FsdGVkX1+cznN+c4AFkRhM+hzQNgqbs0+3RoQVkXE= |
X-HE-Tag: | 1753370928-399951 |
X-HE-Meta: | U2FsdGVkX192LMJuAgwOkQsz0j7xInixGosA8fjwuNHVg2nGeCY5usitOZIlUUSBfX7JU3jpoE1OLyUsH+9SX71Mv3QrHJY+kV5qVvUPY8broiuMSptpiRXCya7y8RUXCCpC+l5gdAnuppmqUk9wPVum59oQVeli9pVI3bJPkeUmnz6CrZXjtCsGPJBdYIsIRseWGsGMXj95b+3WKaSl24akI43p2Pqlg36WTP9M0WooLG11ql49X8j2ft0CdwNeekm63VoY5SS3cRlkUsxHVO8fPRiY8X57xfIraApmzU11Lz6OIsFhBtRiE+CXGFjLALiyznEJQ2bmFDJ//gEpdNqAUZHVEsgd |
X-BeenThere: | cygwin AT cygwin DOT com |
X-Mailman-Version: | 2.1.30 |
List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
List-Unsubscribe: | <https://cygwin.com/mailman/options/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe> | |
List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
From: | Brian Inglis via Cygwin <cygwin AT cygwin DOT com> |
Reply-To: | cygwin AT cygwin DOT com |
Cc: | brian DOT inglis AT systematicsw DOT ab DOT ca |
Errors-To: | cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com |
Sender: | "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com> |
X-MIME-Autoconverted: | from base64 to 8bit by delorie.com id 56OFTEHc1491048 |
On 2025-07-24 04:30, Corinna Vinschen via Cygwin wrote: > Hi Thomas, hi Christian, > > On Jul 23 17:50, Thomas Wolff via Cygwin wrote: >> Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin: >>> On Jul 23 05:44, Thomas Wolff via Cygwin wrote: >>> What bugs me is that we have the choice between a broken mbrtowc on >>> one side and a chance to generate broken filenames on the other side. >> I did not look into those details, but while characters to be handled by a >> terminal come sequentially as a stream, filenames can be handled as a >> compound string, isn't that easier to check? >> >>> I think we should actually revert fa272e05bbd0 ("wcstombs: also call >>> __WCTOMB on terminating NUL if output buffer is NULL") and see if we can >>> fix the filename issue in the Cygwin functions for filename conversion >>> alone. >>> >>> Any ideas appreciated. > > I think I have a fix. I reverted fa272e05bbd0 so mbrtowc is operating > as before. This should fix mintty. > > As for the filename problem, I had another look into the _sys_wcstombs > and _sys_mbstowcs functions. > > It occured to me that the algorithm how to handle an invalid MB sequence > is upside down when it comes to invalid UTF8 4 byte sequences. > > Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f. This > sequence is converted to a byte sequence in the private use area like this: > > 0xc2 0x7f -> 0xf0c2 0x007f > > So the first byte of the sequence is wrong, so it's converted to 0xf0xx. > At this point, we reset the mbstate and try the mbtowc conversion again > with byte 2. Byte 2 is now a valid single byte. Hence 0xf0c2 0x007f. > Also > > 0xc2 0xff -> 0xf0c2 0xf0ff > > because 0xc2 0xff is not valid and 0xff is not a valid lead byte. > > Now consider a broken 3 byte sequence. Same as above: > > 0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f > > Now the 4 byte sequence with a broken 4th byte: > > 0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f > > What's wrong here is the fact that the broken sequence results in > a valid high surrogate and the trailing 4th byte is treated as the > broken sequence. > > But in fact the leading three bytes are the broken sequence. The > current algorithm doesn't catch that, because it's already done > and handled. So the innocent 4th byte has to take the punch. > > I added a patch to _sys_mbstowcs: > - note the fact we already got a high surrogate > - if the next underlying mbtowc call returns an error, backtrack > to the high surrogate in the output string and overwrite it with > a per-byte sequence in the private use area > - reset mbstate > - retry the next byte after the broken sequence > > As far as my testing goes, all cases with broken filenames should > work now. The upcoming test release 3.7.0-0.261.gf21fbcaf583e > will contain the patch. > > However, there's one problem left. I added a FIXME comment to > _sys_wcstombs: > > FIXME? The conversion of invalid bytes from the private use area > like we do here is not actually necessary. If we skip it, the > generated multibyte string is not identical to the original multibyte > string, but it's equivalent in the sense, that another mbstowcs will > generate the same wide-char string. It would also be identical to > the same string converted by wcstombs. And while the original > multibyte string can't be converted by mbstowcs, this string can. > > What does that mean? Consider this UTF8 input string: > > 0xf0 0x90 0x80 0x2e > > mbstowcs: returns -1 > sys_mbstowcs: f0f0 f090 f080 002e > > Let's convert it back to multibyte: > > sys_wcstombs: 0xf0 0x90 0x80 0x2e > wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e > > So while sys_wcstombs has special code converting the string back to its > original MB string, wcstombs converts to the CESU-8 representation. > > This is transparent. If we convert this CESU-8 string back to > wide-char, the resulting wide-char strings are the same: > > mbstowcs: f0f0 f090 f080 002e > sys_mbstowcs: f0f0 f090 f080 002e > > So the question here is, shall we keep the special case converting > private use area bytes back to their original byte encoding? > > Or shall simply go along with CESU-8 when converting back to multibyte > to keep the string the same as with wcstombs? There are 15 * SMP as BMP characters, so many non-Western and emoji characters will be expanded from 4 UTF-8 bytes to 6 CESU-8 bytes, and this is not supported anywhere as a string representation, designed for internal use only per the TR. > Exempt from this are the characters not valid in a DOS filename. > These will always be converted if we create wide-char filenames. -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retrancher but when there is no more to cut -- Antoine de Saint-Exupéry -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |