delorie.com/archives/browse.cgi | search |
DMARC-Filter: | OpenDMARC Filter v1.4.2 delorie.com 55RDc1jk1452905 |
Authentication-Results: | delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com |
Authentication-Results: | delorie.com; spf=pass smtp.mailfrom=cygwin.com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 delorie.com 55RDc1jk1452905 |
Authentication-Results: | delorie.com; |
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=GQXPGGwh | |
X-Recipient: | archive-cygwin AT delorie DOT com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org EC4E03858019 |
DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; |
s=default; t=1751031479; | |
bh=N8FW3/woBo1us8fqzp6kP033ZlbBAcTVdmWSerqPV1M=; | |
h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: | |
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: | |
From; | |
b=GQXPGGwhOyf+vQyz5MxPSVGdlIP69gcA7DEyussIIfc7cRyMEZS8KB9F5zyr7gYU3 | |
UXwnLH0uo58kcA8PkGGUZmcYSCze5vnCDAZz0sYspy1RPVLSyBiUNmg1UPRznS9tE0 | |
pXbelrfh+BPu3/akcFHclI4rW9C5zZBmKYHHSKzQ= | |
X-Original-To: | cygwin AT cygwin DOT com |
Delivered-To: | cygwin AT cygwin DOT com |
DMARC-Filter: | OpenDMARC Filter v1.4.2 sourceware.org 334663857015 |
ARC-Filter: | OpenARC Filter v1.0.0 sourceware.org 334663857015 |
ARC-Seal: | i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1751031180; cv=none; |
b=Lh/kvCuly6/VIcVVgT50uaC/ZqT9OYk0WK3JQzdBLvgWs0FEoZRIpNCJsVQB+wY5fZLgsKSMbNwemFmLh7Y0IaVBeIewHUZX8z6I0QQk7sBgrseCFhZivbwbE0oCLev1Z36Z0F1NNgeLMpEIPr24j/Fy4/WeJzHIR8sovkxjKh4= | |
ARC-Message-Signature: | i=1; a=rsa-sha256; d=sourceware.org; s=key; |
t=1751031180; c=relaxed/simple; | |
bh=f1QpvmNnuM7ES0T9mEpGKIHJbWSw1z+QuIdbV5XcyxM=; | |
h=Subject:To:From:Message-ID:Date:MIME-Version; | |
b=nQ/HNGzebR3/Xfkq9o7Bw0S1i/DifqyxLDQcMPm+ynM1BYnyNPzF20+A3k3H9CWDlkkB5PgPr9o3k1aWz7WEMI3G7ls79UgoUlR4im4Kmy5NW6CP1FStCzykrzGRvFioBc76S2RtP9hZhqtkkCG7znM/OKFmHHLhf12k6hme8Nc= | |
ARC-Authentication-Results: | i=1; server2.sourceware.org |
DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 334663857015 |
Subject: | Re: readdir() returns inaccessible name if file was created with |
invalid UTF-8 | |
To: | cygwin AT cygwin DOT com |
References: | <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de> |
<03c4fae7-7322-572c-ae72-52e300f0b438 AT t-online DOT de> | |
<aFxRfI4NdZ8y5IlK AT calimero DOT vinschen DOT de> | |
<f78c615c-aefe-b3d0-aada-5f9d0cf73a0a AT t-online DOT de> | |
<aF5y15iQ840LxLYJ AT calimero DOT vinschen DOT de> | |
Message-ID: | <3295c8bd-2c09-76c7-8b5f-0106dc39dd96@t-online.de> |
Date: | Fri, 27 Jun 2025 15:32:53 +0200 |
User-Agent: | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 |
SeaMonkey/2.53.20 | |
MIME-Version: | 1.0 |
In-Reply-To: | <aF5y15iQ840LxLYJ@calimero.vinschen.de> |
X-TOI-EXPURGATEID: | 150726::1751031175-65FF9546-FDB5F11A/0/0 CLEAN NORMAL |
X-TOI-MSGID: | 08c22a78-7e4a-424f-89bc-7a997d770abc |
X-BeenThere: | cygwin AT cygwin DOT com |
X-Mailman-Version: | 2.1.30 |
List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
List-Unsubscribe: | <https://cygwin.com/mailman/options/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe> | |
List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
From: | Christian Franke via Cygwin <cygwin AT cygwin DOT com> |
Reply-To: | cygwin AT cygwin DOT com |
Cc: | Christian Franke <Christian DOT Franke AT t-online DOT de> |
Errors-To: | cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com |
Sender: | "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com> |
X-MIME-Autoconverted: | from base64 to 8bit by delorie.com id 55RDc1jk1452905 |
Hi Corinna, Corinna Vinschen via Cygwin wrote: > Hi Christian, > > On Jun 26 19:07, Christian Franke via Cygwin wrote: >> Corinna Vinschen via Cygwin wrote: >>> On Jun 25 16:59, Christian Franke via Cygwin wrote: >>>> On Sun, 15 Sep 2024 19:47:11 +0200, Christian Franke wrote: >>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open() >>>>> does not refuse to create the file. Later readdir() returns a different >>>>> name which could not be used to access the file. >>>>> >>>>> Testcase with U+1F321 (Thermometer): >>>>> >>>>> $ uname -r >>>>> 3.5.4-1.x86_64 >>>>> >>>>> $ printf $'\U0001F321' | od -A none -t x1 >>>>> Â f0 9f 8c a1 >>>>> >>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext' >>>>> >>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext' >>>>> >>>>> $ touch 'file3-'$'\xf0\x9f\x8c' >>>>> >>>>> $ ls -1 >>>>> ls: cannot access 'file2-.?ext': No such file or directory >>>>> ls: cannot access 'file3-': No such file or directory >>>>> 'file1-'$'\360\237\214\241''.ext' >>>>> file2-.?ext >>>>> file3- >>>>> [...] >>> I don't know exactly where this happens, but the input of the >>> conversion is invalid UTF-8 because it's missing the 4th byte. >>> There's no way to represent these filenames on Windows >>> filesystems storing filenames as UTF-16 values. >>> >>> So the problem here is that the conversion somehow misses that >>> the 4th byte is invalid and just plods forward and converts the >>> leading three bytes into the matching high surrogate value and >>> then stumbles over the conversion for the low surrogate. >>> >>> It would be really helpful to have an STC for this problem. >> With some trial and error I found a testcase for this more serious problem >> reported yesterday but not quoted above: >> >>>> In cases like file3-... above, the converted Windows path ends with >>>> 0xF000. This suggests that this is an accidental conversion of the >>>> terminating null to the 0xF0xx range. >>>> >>>> In some cases, the created Windows file name has random garbage >>>> behind the 0xF000. Then even Cygwin is not able to access or unlink >>>> the file after creation. >> Testcase (attached): > Thanks for the testcase! > > I found the problem in the newlib core function creating wchar_t from > UTF-8 input. In case of 4 byte UTF-8 sequences, the code created the > low surrogate already after reading byte 3, without checking if byte 4 > of the UTF-8 sequence is a valid byte. Hilarity ensues. > > Fortunately this bug has only been introduced very recently, to wit, on > 2009-03-24, a mere 16 years ago. And it is my bug and mine alone :} > > I'm just prep'ing a fix which I'll push in a minute or two. This fixes the problem demonstrated by the testcase, thanks. The original problem reported last year in the very first post of this thread still persists: Example: $ uname -r 3.7.0-dev-163-g5c8475417bc3.x86_64 $ mkdir test.tmp $ cd test.tmp $ touch $'t-\xef\x80\x80' $ ls ls: cannot access 't-': No such file or directory t- $ touch t- $ ls -1 t- t- $ rm t- $ ls ls: cannot access 't-': No such file or directory t- $ cd .. $ rm -rf test.tmp rm: cannot remove 'test.tmp': Directory not empty $ rm test.tmp/$'t-\xef\x80\x80' $ rmdir test.tmp The name mapping is: "t-\xEF\x80\x80" -(open, ...)-> L"t-\xDB59" -(readdir)-> "t-" Possibly difficult to fix except if creation of such files is rejected. -- Thanks, Christian -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |