DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56ODgCOm1444723 Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56ODgCOm1444723 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=e0/7Xg5H X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6C2873857BBB DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1753364530; bh=S5YLTX6pY/ECDgVCdjF23Pm5kJDrvPDQa0xcY35jsIA=; h=Date:Subject:To:References:Cc:In-Reply-To:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=e0/7Xg5HBSt064p7yAxTSqN0HAMIgqVDixmYs99NHHpj9En7hvfMlXXuFKpTdi/wi IrWh20iC1S94asxpq2vwnYadFZ5TL4VTT6ppHSbx07sMYspxasjBJE62pBVi+ugi4k Eq6iZKxCGKTLXcKBhzGYCojxzuBbzCNWE8hNgdxo= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3508E3858C56 ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 3508E3858C56 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1753364502; cv=none; b=F9FwCpJWawVBzUwY812/Q4Bbt9rraypphVRA9M/jaxs48ElHqeINfhzmNMpSuHmgpivIAHR54t+hLA1U2ZVwE7Tvv6pmhsSxVonyPsFMIEV1DL5bweYQF0gLQjKJMwljF3jXyEfWHDvHofhlxhjkOjgiwQHzh1uTCq/zV/KOgFM= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1753364502; c=relaxed/simple; bh=QfYAsGajUHmoHNJeH+iApMg4K7sYEpGoyKtSIcK6VzI=; h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=DyhUdqvh3kE2byb/17Ubk96aJ0QZTO+MNxGlfy/4jh0xTo6VQEqm8YTYtn4PsuWffzoydL68NSWmHYiABpBx0AZrqlRt+n0rzhwfxVwDov8A+2Yk347+SsWy5R9NAkYzm3p//QGwQOG2YcRkHV7gilgmEfLCX17Jgd5ZpTf8hEo= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3508E3858C56 X-UI-Sender-Class: 55c96926-9e95-11ee-ae09-1f7a4046a0f6 Message-ID: Date: Thu, 24 Jul 2025 15:41:10 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 To: cygwin AT cygwin DOT com References: <91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net> <4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net> <68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net> <11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net> Cc: Christian Franke , Corinna Vinschen Autocrypt: addr=towo AT towo DOT net; keydata= xsDNBGNaf3QBDACVevqudcTSevLThXKQPU1QpaDxtGuYjtwmr7i9wXxVGih4Y4oxOJN4PYlu KBX9IVAI4651dA+xYtXuyIkWOPZWyyzkGKavQOn3Q7dk09oj7bh2IwOndpxXXde337D408EQ bQEGbMHr9lOWhSAideowzgCeFIvGTf2AovbPh97HpexJn1/HCRiRAhTNlrkS1DByUgCAeEMK fEr6aGM/Ou29MT+eTnQwOIZTnl9Z9LxM2FtqqMH3MycC7I2OoW3XXhuL8BPQdyJUjWa0/J11 Oo5jFkRXtWenIns6jGn18oW72jnDmo9jXwwS+iZWAV6Y51nhD7jSC+3xs9ORmPCdtHUSpTr1 zh67UueUJ3DUUNVuA25Hn/9EJMJ2L60BGUEr88NEB6pcZhmcwdkurAQeYT6t+frzBz2ctsoN BoxP/Xc02yd+z7hXWRRMrJWh9WHlQHA3Z4FfmyNhyPhs3MgKTJ1E9QfzGquigAmF3/k/Dc1m 7cSOKhGYhpEJdSpdXccJFKkAEQEAAc0cVGhvbWFzIFdvbGZmIDx0b3dvQHRvd28ubmV0PsLB BwQTAQgAMRYhBHUiRKsHn5d8BpWdP8bz0e72Bp0CBQJjWn93AhsDBAsJCAcFFQgJCgsFFgID AQAACgkQxvPR7vYGnQKSMAv8Di+8MXB2mcfsemRdShfLLKcLOv+d0CXAtPVaY3XKxbKpRvC9 +AAT5wIHYjQft77/b2y87vGIh+nQ5hKLtNtQPSDtqG/Igkb5jAXpLi28fSUzgM96DvARmwve 5wSnAU3prxH+Y63YpOpslEcGMRoEtYCDy1ANMYPcEZT/YvDd4CplyyEai4VYrw3/LsESDYlY GK6uMQzZ1jl2cNOUFu6BwLUeZIcwaqGto8n4R4nbf4jxUEpa21bWBPqE+Jf49uipjPr/iJ72 5HbdWuuCfyTTJEJjfNEBigWP2RXM9iNDcO61V3aEjh76tThfBK2MMlLWfZkQaQziu24x8R4B I0efJYWBX2Sv2qnsH/EWj7FUIZjRqGG7LnWHLShfG6yjSOTOWYi8BbsvoftpaLWgZX28aGX4 uzuSZ5L0caXh/pr/gSgqoH/YbuFIgqtQH4seOBgTybd22Vpe78rnc+8450pN8qwchHAZaJka UxS0SpYxXzXmHUKILA4C43s0U/z2Mez9zsDNBGNaf3cBDADeJ7paMrb6f1+k8wM7tyk0/Ded KX/pOejt/D20Ceerw2iL/4tUmBL+A3ic2yjiSFUSsEfHwgCVwKrn4MwZtkesdiphm2lk6xWc k1ENCQy44QwQT6UZ/mHWYWcj5LS6ua183x1zdn9iF3lv150nm/ssw56D7USz/ap1Vh0lf5te D+CIheGLocVDqxWiu7rHP8jKRWFgq/+OU6HKX8p2Yv1oYsykh9qF2bFzawLDS+S1VbfRicfD G0RtceL/BAf7b6UE5u9TGdfrFEa2TKZeS/FS/ViKUfwsXQIki1sWt2FQENbuDY28vxyR46ZZ 0gixDCFUoBw5pkmOGVQa+1RQYrRqlN4X0CAgp7mFVeEHl5NTgiL1bemkQVmHOUDG+CzNg+Lk UGoedAtT672l3JjrnSs4j8zNshpgV2OfAhAC+V9XvqCjMnxzVfXkVlbuWpPfUWQeFclLGg8P agpQUE0Ux+VV4DoeQCxYEnRCf/n7n+IRfILj5+2l6Zw4M7zSu6ii0tUAEQEAAcLA9gQYAQgA IBYhBHUiRKsHn5d8BpWdP8bz0e72Bp0CBQJjWn97AhsMAAoJEMbz0e72Bp0CQr4L/REdT0SF mbapnZIe92THCdtAUgwEv8VdNiNFBJelz8P/fuXuNPtisYvQQD4e64zpWe2UC4Cxo9DUk/pW 6Qci1xaXRKEiSPjHdSGGVB1PFIcqiS75GCf/ga/Dnfsy0Y4Uh6OGTQnkvZLBCe3vvcVLDQ7F PuV79zA9/eOeOW6aGoO6bq/wH+z96f9LyTITkQDy07fm6JYTGuzAoJE2AEboU1mgbtlx+tAa QFkpAQkp2g1Vhc3A7k4vntlHOrjMC+uVFh7QTGFfIlLRF6izUjSe6EZ06LErzlIiE05RP3yF FSRWidW0wze26peYlxYVgH1+T9wMTW2oiTBybfAMHBAxUP7Gr1WUo/oJEr0srWhatz8AwydP y7NwFbdpYn0NcFBaIlLW/JL11Eovwlivow+oGpzGFuuzSuflp2q9s2JWtn4EhW0kEs93D0LP iuJWvRaCZ6aD3uF3FMW8wyVWZYsLrzune2jH8w/uKMprDEOGOm+BcyhEFedTyY1ygbZKl+0G kQ== In-Reply-To: X-Provags-ID: V03:K1:gNOy8wWNN/bGiqLk7uM5KtybP0L7MzueHh9NMQqGFEECoih/vos mZlRIfj4CRfpmCX+hHUPBHbV/8T7w/bXvj7L9XCCYx+I3bZqeVldErGLQMOGi1xhhVsTJVl zqhL1FSt23URmvuIDHOjp6S3YecBcCPQnHDQ3ud+o83Q1WmADPvkD1luSxXQX1q1clCctwG LumYi8pQ3Qwsz1nBtaRiA== UI-OutboundReport: notjunk:1;M01:P0:mJcfsnQB/jw=;/Fk/4FaCzuZAAEG3LlkIWLJHh68 N3b93xchtGKKfNZvi/4AkCn47ru2yb8loD3/5HueX+CdGeZ3jrSoyxi3NoBxve+KFw1jlniU5 OGcpD+tRhd+2KlSt1SqP9sFRyI/TtQL7/uIyaUQ0y49aJwcE1SRcH+MoyXfl/RwbiDwdqWmcs 7mA4fdTpWZgYgVyjW+39PhQFOyCmsg9/nCaek+GGKMafSjyJfDJ9auwl8QLvE2V4rIZ+7rUOh OMBYPNtgeym+PnpvIK1XJNNqkCqRad5mTwb+QTQXHiVAbtuQz8k/V6H3xhLHjMRTBRIBy+y86 NFdgt7ErEYaVCEYso0mamMfaljxxdQyu6Gc/HZI0qJYieI/0vjXJnoLmAkN5hEjNZqQgJfQ7l 7sfQTJCfNQ3y3nAacmgzHaVMcJ6lENP8wk9PL6BmJfBDEngb1GPDOX8OddxLgZYBdbFxSaHNu d2YuJFKfG25MCxCCIkQGTKCKaeuTFVAsaR5gTnFKNjMB2OHRd82YqM+XNmM4ilRAmwMe+oGAk NfmA4AOu0/kzkQDxFtIlgoa36V6hEz5SKN2Fvpz/hbSKQP6rzEg6VsioBfyCS5B2V4REzm7i7 8Vo4DTiCfl/pqXCj+1of8KKO+aF1D2uFBYBmQrhOuZuFCf+DY99d3g5S5k0DIvIh8blDxHfwc JiHMMRSM++dgkbpa9nyXRkEY5FQWQgqMcvZ3bc2x9ODNZZVkiXyJRjmQhh7PraxiOGPhMXq39 Xz7rEioo2jVfoMJPMB2J3R6+bBLkrBn1vaO0AwWsbVWGjM57Q+KJR6zF671rwByfJTM50ccRu xCALxRNxkigXHUmP5/DBxvy/YxR5o462cFcdvj29rYF6/jfywarowB6ADWWis34M0JoaFRT53 VtRdoNgX/iPDKAvrMvrPaCBLj2mZvZyAg8fRd7Htd2GJR9gQa7SD5d4Oy9MWTBqmSwu9Nf5B9 pXQ4TmuWqt4U+/2HoR3JmMykLGva5eqOS461oKi29y6i0V0ImEC3b4yUl4jOquJcy4PcOGxm5 onBlERsGElOiwDR0TBxrDqiTmUU4xfxYysqGENluFP9m+010RWsrpfgyX5EgTe/VRN3KbgKgm louzW7DiRp6PS2hJPTAZnUuBRQACcYIYFekFI41bKzuy0cXOIcGVXUxwX6UPJbAcDTl3ohtYB 8rVdUI7yeGHObvm5YpD5dk+uoc/rzKGm+lXgZVZZWkZwgkUOcUEuu8+rZu8qTCllBD/EbVBrg VZA6DguNidIH/NpSaz5/s9dpxRQomv3a2qKXGoJuKthCdgvkds7h9nRwC1tbhc7VnpiX6ecmT Bh1fzTOZZXZHX3gNtcxfseI1tkYVrWiJkkgHiSGFwEhy+oR1U+DcTXFOTCfJ/ulIxs1s88HsY /6MTcmWGAPQnt2TqakWFQZTY20SB+xAxm9pyHbTeCEzpqDcTIKgin7FzSz7pK7r/Zahtjjgcu HTpoWI/psn1SfoiFfqEoX2qh/xn4snKzuTGOEki+vso82PEKqbCfyfI89OikL11BbvlQbKv9P JtFOSaCdfF7XsroAKoyrhgYPu/M5X+8qYIZq7gN36jb8i89MrF2Yr07MhbgbsepdcP/QEsGlJ tIDt9DBF0A= X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Thomas Wolff via Cygwin Reply-To: Thomas Wolff Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Am 24.07.2025 um 12:30 schrieb Corinna Vinschen: > Hi Thomas, hi Christian, > > On Jul 23 17:50, Thomas Wolff via Cygwin wrote: >> Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin: >>> On Jul 23 05:44, Thomas Wolff via Cygwin wrote: >>> What bugs me is that we have the choice between a broken mbrtowc on >>> one side and a chance to generate broken filenames on the other side. >> I did not look into those details, but while characters to be handled by a >> terminal come sequentially as a stream, filenames can be handled as a >> compound string, isn't that easier to check? >> >>> I think we should actually revert fa272e05bbd0 ("wcstombs: also call >>> __WCTOMB on terminating NUL if output buffer is NULL") and see if we can >>> fix the filename issue in the Cygwin functions for filename conversion >>> alone. >>> >>> Any ideas appreciated. > I think I have a fix. I reverted fa272e05bbd0 so mbrtowc is operating > as before. This should fix mintty. > > As for the filename problem, I had another look into the _sys_wcstombs > and _sys_mbstowcs functions. > > It occured to me that the algorithm how to handle an invalid MB sequence > is upside down when it comes to invalid UTF8 4 byte sequences. > > Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f. This > sequence is converted to a byte sequence in the private use area like this: > > 0xc2 0x7f -> 0xf0c2 0x007f > > So the first byte of the sequence is wrong, so it's converted to 0xf0xx. > At this point, we reset the mbstate and try the mbtowc conversion again > with byte 2. Byte 2 is now a valid single byte. Hence 0xf0c2 0x007f. > Also > > 0xc2 0xff -> 0xf0c2 0xf0ff > > because 0xc2 0xff is not valid and 0xff is not a valid lead byte. > > Now consider a broken 3 byte sequence. Same as above: > > 0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f > > Now the 4 byte sequence with a broken 4th byte: > > 0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f > > What's wrong here is the fact that the broken sequence results in > a valid high surrogate and the trailing 4th byte is treated as the > broken sequence. > > But in fact the leading three bytes are the broken sequence. The > current algorithm doesn't catch that, because it's already done > and handled. So the innocent 4th byte has to take the punch. > > I added a patch to _sys_mbstowcs: > - note the fact we already got a high surrogate > - if the next underlying mbtowc call returns an error, backtrack > to the high surrogate in the output string and overwrite it with > a per-byte sequence in the private use area > - reset mbstate > - retry the next byte after the broken sequence > > As far as my testing goes, all cases with broken filenames should > work now. The upcoming test release 3.7.0-0.261.gf21fbcaf583e > will contain the patch. > > However, there's one problem left. I added a FIXME comment to > _sys_wcstombs: > > FIXME? The conversion of invalid bytes from the private use area > like we do here is not actually necessary. If we skip it, the > generated multibyte string is not identical to the original multibyte > string, but it's equivalent in the sense, that another mbstowcs will > generate the same wide-char string. It would also be identical to > the same string converted by wcstombs. And while the original > multibyte string can't be converted by mbstowcs, this string can. > > What does that mean? Consider this UTF8 input string: > > 0xf0 0x90 0x80 0x2e > > mbstowcs: returns -1 > sys_mbstowcs: f0f0 f090 f080 002e > > Let's convert it back to multibyte: > > sys_wcstombs: 0xf0 0x90 0x80 0x2e > wcstombs: 0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e > > So while sys_wcstombs has special code converting the string back to its > original MB string, wcstombs converts to the CESU-8 representation. > > This is transparent. If we convert this CESU-8 string back to > wide-char, the resulting wide-char strings are the same: > > mbstowcs: f0f0 f090 f080 002e > sys_mbstowcs: f0f0 f090 f080 002e > > So the question here is, shall we keep the special case converting > private use area bytes back to their original byte encoding? > > Or shall simply go along with CESU-8 when converting back to multibyte > to keep the string the same as with wcstombs? > > Exempt from this are the characters not valid in a DOS filename. > These will always be converted if we create wide-char filenames. Sounds like a fair solution with only minor glitches. Poor 4th byte but thanks a lot anyway. About the latter decision, if there's no strong bias otherwise, I'd prefer to drop special handling (but don't take my vote, I don't care so much about that). Thomas > Thanks, > Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple