| delorie.com/archives/browse.cgi | search |
| DMARC-Filter: | OpenDMARC Filter v1.4.2 delorie.com 56OFTEHc1491048 |
| Authentication-Results: | delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com |
| Authentication-Results: | delorie.com; spf=pass smtp.mailfrom=cygwin.com |
| DKIM-Filter: | OpenDKIM Filter v2.11.0 delorie.com 56OFTEHc1491048 |
| X-Recipient: | archive-cygwin AT delorie DOT com |
| X-Original-To: | cygwin AT cygwin DOT com |
| Delivered-To: | cygwin AT cygwin DOT com |
| DMARC-Filter: | OpenDMARC Filter v1.4.2 sourceware.org EBFAD385735B |
| ARC-Filter: | OpenARC Filter v1.0.0 sourceware.org EBFAD385735B |
| ARC-Seal: | i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1753370931; cv=none; |
| b=xT9TR2JjCoPmpdX6EuaCyZDVUGE7gVd45l1dQUONwkwm031ib63hYzgmyHlm0ZTxut1OLcVKYRVxuAg4B9e6xRSJY6BqMXyQBdkHETComormLptvfNVp7KUAGs1iwAoMYaqFO7Zplv72ZGsu3Mm/16GZQKLZIhVd1D9b6eyZE5M= | |
| ARC-Message-Signature: | i=1; a=rsa-sha256; d=sourceware.org; s=key; |
| t=1753370931; c=relaxed/simple; | |
| bh=OLNkmB77WW8vjvwz7Xdinz7t0l0+HQE+r0cVcS7/fwQ=; | |
| h=Message-ID:Date:MIME-Version:From:Subject:To:DKIM-Signature; | |
| b=OiiAsbUdRO936LpnBTATElNAY2DOy7M5oTbrMB7i5Z5dGgvwroNIdum+mP6LSeX7iGoG3Zy7wK+oHkslZoFN43eX1vGG3lMNxGvuidWoLCUOdiiO3Kw57gyrMfDoGOr+kMraopjqTckwGjJM5aG2fsPYrbAMdwBb5RwMRsrryr0= | |
| ARC-Authentication-Results: | i=1; server2.sourceware.org |
| DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org EBFAD385735B |
| Message-ID: | <aec69850-227c-4c37-8aa9-6ea97dbec25b@systematicsw.ab.ca> |
| Date: | Thu, 24 Jul 2025 09:28:48 -0600 |
| MIME-Version: | 1.0 |
| User-Agent: | Mozilla Thunderbird |
| Subject: | Re: readdir() returns inaccessible name if file was created with |
| invalid UTF-8 | |
| To: | cygwin AT cygwin DOT com |
| References: | <aFxRfI4NdZ8y5IlK AT calimero DOT vinschen DOT de> |
| <f78c615c-aefe-b3d0-aada-5f9d0cf73a0a AT t-online DOT de> | |
| <aF5y15iQ840LxLYJ AT calimero DOT vinschen DOT de> | |
| <ca205dbd-907f-4552-9e5c-2cb0050f83a3 AT towo DOT net> | |
| <aH-MtwqARmDmLwoo AT calimero DOT vinschen DOT de> | |
| <91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net> | |
| <4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net> | |
| <68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net> | |
| <aICVBQzWUiCYwnL2 AT calimero DOT vinschen DOT de> | |
| <11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net> | |
| <aIILWiKsr99DOaI8 AT calimero DOT vinschen DOT de> | |
| Organization: | Systematic Software |
| In-Reply-To: | <aIILWiKsr99DOaI8@calimero.vinschen.de> |
| X-Stat-Signature: | o18ee6xrnerkim8iifueiy468d43bqak |
| X-Rspamd-Server: | rspamout07 |
| X-Rspamd-Queue-Id: | 05C992002A |
| X-Session-Marker: | 427269616E2E496E676C69734053797374656D6174696353572E61622E6361 |
| X-Session-ID: | U2FsdGVkX1+cznN+c4AFkRhM+hzQNgqbs0+3RoQVkXE= |
| X-HE-Tag: | 1753370928-399951 |
| X-HE-Meta: | U2FsdGVkX192LMJuAgwOkQsz0j7xInixGosA8fjwuNHVg2nGeCY5usitOZIlUUSBfX7JU3jpoE1OLyUsH+9SX71Mv3QrHJY+kV5qVvUPY8broiuMSptpiRXCya7y8RUXCCpC+l5gdAnuppmqUk9wPVum59oQVeli9pVI3bJPkeUmnz6CrZXjtCsGPJBdYIsIRseWGsGMXj95b+3WKaSl24akI43p2Pqlg36WTP9M0WooLG11ql49X8j2ft0CdwNeekm63VoY5SS3cRlkUsxHVO8fPRiY8X57xfIraApmzU11Lz6OIsFhBtRiE+CXGFjLALiyznEJQ2bmFDJ//gEpdNqAUZHVEsgd |
| X-BeenThere: | cygwin AT cygwin DOT com |
| X-Mailman-Version: | 2.1.30 |
| List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
| List-Unsubscribe: | <https://cygwin.com/mailman/options/cygwin>, |
| <mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe> | |
| List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
| List-Post: | <mailto:cygwin AT cygwin DOT com> |
| List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
| List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
| <mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
| From: | Brian Inglis via Cygwin <cygwin AT cygwin DOT com> |
| Reply-To: | cygwin AT cygwin DOT com |
| Cc: | brian DOT inglis AT systematicsw DOT ab DOT ca |
| Errors-To: | cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com |
| Sender: | "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com> |
| X-MIME-Autoconverted: | from base64 to 8bit by delorie.com id 56OFTEHc1491048 |
On 2025-07-24 04:30, Corinna Vinschen via Cygwin wrote:
> Hi Thomas, hi Christian,
>
> On Jul 23 17:50, Thomas Wolff via Cygwin wrote:
>> Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin:
>>> On Jul 23 05:44, Thomas Wolff via Cygwin wrote:
>>> What bugs me is that we have the choice between a broken mbrtowc on
>>> one side and a chance to generate broken filenames on the other side.
>> I did not look into those details, but while characters to be handled by a
>> terminal come sequentially as a stream, filenames can be handled as a
>> compound string, isn't that easier to check?
>>
>>> I think we should actually revert fa272e05bbd0 ("wcstombs: also call
>>> __WCTOMB on terminating NUL if output buffer is NULL") and see if we can
>>> fix the filename issue in the Cygwin functions for filename conversion
>>> alone.
>>>
>>> Any ideas appreciated.
>
> I think I have a fix. I reverted fa272e05bbd0 so mbrtowc is operating
> as before. This should fix mintty.
>
> As for the filename problem, I had another look into the _sys_wcstombs
> and _sys_mbstowcs functions.
>
> It occured to me that the algorithm how to handle an invalid MB sequence
> is upside down when it comes to invalid UTF8 4 byte sequences.
>
> Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f. This
> sequence is converted to a byte sequence in the private use area like this:
>
> 0xc2 0x7f -> 0xf0c2 0x007f
>
> So the first byte of the sequence is wrong, so it's converted to 0xf0xx.
> At this point, we reset the mbstate and try the mbtowc conversion again
> with byte 2. Byte 2 is now a valid single byte. Hence 0xf0c2 0x007f.
> Also
>
> 0xc2 0xff -> 0xf0c2 0xf0ff
>
> because 0xc2 0xff is not valid and 0xff is not a valid lead byte.
>
> Now consider a broken 3 byte sequence. Same as above:
>
> 0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f
>
> Now the 4 byte sequence with a broken 4th byte:
>
> 0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f
>
> What's wrong here is the fact that the broken sequence results in
> a valid high surrogate and the trailing 4th byte is treated as the
> broken sequence.
>
> But in fact the leading three bytes are the broken sequence. The
> current algorithm doesn't catch that, because it's already done
> and handled. So the innocent 4th byte has to take the punch.
>
> I added a patch to _sys_mbstowcs:
> - note the fact we already got a high surrogate
> - if the next underlying mbtowc call returns an error, backtrack
> to the high surrogate in the output string and overwrite it with
> a per-byte sequence in the private use area
> - reset mbstate
> - retry the next byte after the broken sequence
>
> As far as my testing goes, all cases with broken filenames should
> work now. The upcoming test release 3.7.0-0.261.gf21fbcaf583e
> will contain the patch.
>
> However, there's one problem left. I added a FIXME comment to
> _sys_wcstombs:
>
> FIXME? The conversion of invalid bytes from the private use area
> like we do here is not actually necessary. If we skip it, the
> generated multibyte string is not identical to the original multibyte
> string, but it's equivalent in the sense, that another mbstowcs will
> generate the same wide-char string. It would also be identical to
> the same string converted by wcstombs. And while the original
> multibyte string can't be converted by mbstowcs, this string can.
>
> What does that mean? Consider this UTF8 input string:
>
> 0xf0 0x90 0x80 0x2e
>
> mbstowcs: returns -1
> sys_mbstowcs: f0f0 f090 f080 002e
>
> Let's convert it back to multibyte:
>
> sys_wcstombs: 0xf0 0x90 0x80 0x2e
> wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e
>
> So while sys_wcstombs has special code converting the string back to its
> original MB string, wcstombs converts to the CESU-8 representation.
>
> This is transparent. If we convert this CESU-8 string back to
> wide-char, the resulting wide-char strings are the same:
>
> mbstowcs: f0f0 f090 f080 002e
> sys_mbstowcs: f0f0 f090 f080 002e
>
> So the question here is, shall we keep the special case converting
> private use area bytes back to their original byte encoding?
>
> Or shall simply go along with CESU-8 when converting back to multibyte
> to keep the string the same as with wcstombs?
There are 15 * SMP as BMP characters, so many non-Western and emoji characters
will be expanded from 4 UTF-8 bytes to 6 CESU-8 bytes, and this is not supported
anywhere as a string representation, designed for internal use only per the TR.
> Exempt from this are the characters not valid in a DOS filename.
> These will always be converted if we create wide-char filenames.
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retrancher but when there is no more to cut
-- Antoine de Saint-Exupéry
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
| webmaster | delorie software privacy |
| Copyright © 2019 by DJ Delorie | Updated Jul 2019 |