DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 55PJj00U656545
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 55PJj00U656545
Authentication-Results: delorie.com;
	dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=u/imtZVH
X-Recipient: archive-cygwin@delorie.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 765D33856DE3
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
	s=default; t=1750880699;
	bh=vRycyhNhkxZm6zyAys6KeolDDYSHu2Od6fwTHxqcRDQ=;
	h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=u/imtZVHHRnshnVbmoXdw2m4+nbiGC50BJ8m2IYLokKZaD23muPZ4ifEkoMzfLB42
	 dB1c/NE7CvekGqqnc/rqwlyi6iwfkN8DmjACZ/LSqd/WuY1C7vXQr/3wbpQZnoXScS
	 5wSO+GLr6l1hcDH6PDHah3S69rapQ17LC+/ixlpg=
X-Original-To: cygwin@cygwin.com
Delivered-To: cygwin@cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7BF5A3857400
Date: Wed, 25 Jun 2025 21:43:56 +0200
To: cygwin@cygwin.com
Subject: Re: readdir() returns inaccessible name if file was created with
 invalid UTF-8
Message-ID: <aFxRfI4NdZ8y5IlK@calimero.vinschen.de>
Mail-Followup-To: cygwin@cygwin.com
References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1@t-online.de>
 <03c4fae7-7322-572c-ae72-52e300f0b438@t-online.de>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <03c4fae7-7322-572c-ae72-52e300f0b438@t-online.de>
X-BeenThere: cygwin@cygwin.com
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=subscribe>
From: Corinna Vinschen via Cygwin <cygwin@cygwin.com>
Reply-To: cygwin@cygwin.com
Cc: Corinna Vinschen <corinna-cygwin@cygwin.com>
Content-Type: text/plain; charset="utf-8"
Errors-To: cygwin-bounces~archive-cygwin=delorie.com@cygwin.com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie.com@cygwin.com>
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 55PJj00U656545

On Jun 25 16:59, Christian Franke via Cygwin wrote:
> On Sun, 15 Sep 2024 19:47:11 +0200, Christian Franke wrote:
> > If a file name contains an invalid (truncated) UTF-8 sequence, open()
> > does not refuse to create the file. Later readdir() returns a different
> > name which could not be used to access the file.
> > 
> > Testcase with U+1F321 (Thermometer):
> > 
> > $ uname -r
> > 3.5.4-1.x86_64
> > 
> > $ printf $'\U0001F321' | od -A none -t x1
> >  f0 9f 8c a1
> > 
> > $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
> > 
> > $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
> > 
> > $ touch 'file3-'$'\xf0\x9f\x8c'
> > 
> > $ ls -1
> > ls: cannot access 'file2-.?ext': No such file or directory
> > ls: cannot access 'file3-': No such file or directory
> > 'file1-'$'\360\237\214\241''.ext'
> > file2-.?ext
> > file3-
> > 
> > 
> > Name mapping according to "fhandler_disk_file::readdir" strace lines:
> > 
> > "file1-\xF0\x9F\x8C\xA1.ext" -(open)-> L"file1-\xD83C\xDF21.ext"
> > -(readdir)->
> > "file1-\xF0\x9F\x8C\xA1.ext"
> > 
> > "file2-\xF0\x9f\x8C.ext" -(open)-> L"file2-\xD83C\xF02Eext" -(readdir)->
> > "file2-.\xE1\x9E\xB3ext"
> > 
> > "file3-\xF0\x9F\x8C" -(open)-> L"file3-\xD83C\xF000" -(readdir)->
> > "file3-"

I don't know exactly where this happens, but the input of the
conversion is invalid UTF-8 because it's missing the 4th byte.
There's no way to represent these filenames on Windows
filesystems storing filenames as UTF-16 values.

So the problem here is that the conversion somehow misses that
the 4th byte is invalid and just plods forward and converts the
leading three bytes into the matching high surrogate value and
then stumbles over the conversion for the low surrogate.

It would be really helpful to have an STC for this problem.


Thanks,
Corinna

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

