DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 48JF0mj52643613
Authentication-Results: delorie.com;
	dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=V7Hiw32z
X-Recipient: archive-cygwin@delorie.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1DEB83858C31
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
	s=default; t=1726758048;
	bh=IMgq6vM10MIQkyg27X6UvuoBYnt1MbsqoIk786PCzuE=;
	h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=V7Hiw32zZMUpgp2Q5XLaL1aed2H1hdQr3HhPpQ705oZZZUjYjPpPEbM3+dj9u+6Un
	 QKw/x498wbyU2qmJRgI7/zDWR8uiwGP4cbjZPbdGLREwbCJV0kpHxr4bp7XZqvoShC
	 eebFKPaAmyoTptGwFIDOFIC8loIMNvVB/ZUrYy/Q=
X-Original-To: cygwin@cygwin.com
Delivered-To: cygwin@cygwin.com
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1882F3858D28
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1882F3858D28
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726758024; cv=none;
 b=xIkLOwl/hVQ91mtCLRSFkTi3pVtpDpaCfhb/bWbOjvAc8qXJP7au5poL02j9myqFffwe3zZp368gfdy9a49yLLwS45H+JmOnOTKj8PqYhntpfsBsLGfChUV5/WR0Y2VG9muVUZf7+5aVSST3KU2M9Ie2ZT942T6TohFEa9rgtAo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1726758024; c=relaxed/simple;
 bh=q1aDUBEL3E/ATzeULkaHj219ib5n0HCTg8AXBLcq5Dw=;
 h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To;
 b=uwuxTLFu3Rue4dO5/4kTNIcAcmHzQZALuW1/hB8qBmUQ26difpPp/ema5Db2lcszzsLAHKHmdFR9I6c3u4VkT2YisJP2rpeqKt7TjljEZowbQSGYX+Th/5g401bJzHPHw0dyRI0h6ah1yVInaAiObbP+M/AD4+OG+Y5n/44dNK0=
ARC-Authentication-Results: i=1; server2.sourceware.org
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1726758020; x=1727362820;
 h=content-transfer-encoding:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=XcFesDGVVQqEGM5qysxzwzXi+O4K6bzC37rWQH4NovY=;
 b=nSI/HVCNmFFfB7b5HDTMohb/FOAibz3keppmsU+kyvcWFwPX4YKSC/Oxg6ETPglNiV
 mbCWSfhAnY/l9fVQl1p5S3T0/bliC73WwULtGZHCNeqfI/FhaaXip9A+efeaiC/jnlAl
 Hl9YcHWZd6PZxC9kgKPQH0DZRnkwo8TwOOZVIgo0V2SVPyyr2bVA12dKN58GLopGTC8w
 /UQfmAarIjdtcFk8b2JWQIqwbE4ak44zh8gGQmkX6S19rUQhmFAikPHGWaEnPQDfh01J
 aWr2kSYwNVU26kbrgCHNpV6a06jDSDsjkfZ7HDZNvHC4AR4P9x/ffTUQOpRCNS8NUSFg
 Cx3A==
X-Gm-Message-State: AOJu0YwL9hCr3sDy1W8Iqw+wPdZl4Ike6ZIVGkzZ1HYdHyZQe1/QbHQy
 YUZEm1XF6HzNSpfhhdi1BKJgKYLuZMNoaK1ctsyP4pDvMZn6l6s4JBpoh9dhWfhMWqh7JMUJtbK
 xEaeEgGu0OlfRt80gBu0qmFXBV+KWM2PbJUQ=
X-Google-Smtp-Source: AGHT+IGA6o+z/LbhSooT29oDZNyTr1I+6ikx2gc9aiaO4tu1xdQUpkggdDnzMqeRooZeN5LRq4S3KxQFsjPjM7XeKa8=
X-Received: by 2002:a05:6402:518f:b0:5bf:1bd:adb3 with SMTP id
 4fb4d7f45d1cf-5c41e18ea13mr18571118a12.14.1726758020093; Thu, 19 Sep 2024
 08:00:20 -0700 (PDT)
MIME-Version: 1.0
References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1@t-online.de>
 <bc8bd61c-818e-424f-bb42-52f4fecd4849@towo.net>
 <b6ab074b-919e-4514-8276-72a30c36ab58@towo.net>
 <de4767e2-85b7-ead2-df9a-64e1f24f4e8f@t-online.de>
 <6451a249-adcd-9c56-b76e-1b00886cea80@t-online.de>
 <CAN0SSYx+g4JE6AA6krNAzG6QXrve52TBv0d3VM0SODV-tzZQSQ@mail.gmail.com>
 <66051d82-e2c3-684f-d13f-d1301170b0d4@t-online.de>
 <984103a4-ab2d-4337-9964-cc1e3208155d@SystematicSW.ab.ca>
In-Reply-To: <984103a4-ab2d-4337-9964-cc1e3208155d@SystematicSW.ab.ca>
Date: Thu, 19 Sep 2024 16:59:42 +0200
Message-ID: <CALXu0UcbKcAVKQ9uopn3rV28OruR4yG=kyh_ati9a_stR2GSrw@mail.gmail.com>
Subject: Re: readdir() returns inaccessible name if file was created with
 invalid UTF-8
To: cygwin@cygwin.com
X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_NUMSUBJECT,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: cygwin@cygwin.com
X-Mailman-Version: 2.1.30
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=subscribe>
From: Cedric Blancher via Cygwin <cygwin@cygwin.com>
Reply-To: Cedric Blancher <cedric.blancher@gmail.com>
Content-Type: text/plain; charset="utf-8"
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie.com@cygwin.com>
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 48JF0mj52643613

On Thu, 19 Sept 2024 at 16:46, Brian Inglis via Cygwin
<cygwin@cygwin.com> wrote:
>
> On 2024-09-19 07:27, Christian Franke via Cygwin wrote:
> > Mark Liam Brown via Cygwin wrote:
> >> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin
> >> <cygwin@cygwin.com> wrote:
> >>> Christian Franke via Cygwin wrote:
> >>>> Thomas Wolff via Cygwin wrote:
> >>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
> >>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
> >>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open()
> >>>>>>> does not refuse to create the file. Later readdir() returns a
> >>>>>>> different name which could not be used to access the file.
> >>>>>>>
> >>>>>>> Testcase with U+1F321 (Thermometer):
> >>>>>>>
> >>>>>>> $ uname -r
> >>>>>>> 3.5.4-1.x86_64
> >>>>>>>
> >>>>>>> $ printf $'\U0001F321' | od -A none -t x1
> >>>>>>>   f0 9f 8c a1
> >>>>>>>
> >>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
> >>>>>>>
> >>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
> >>>>>>>
> >>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
> >>>>>>>
> >>>>>>> $ ls -1
> >>>>>>> ls: cannot access 'file2-.?ext': No such file or directory
> >>>>>>> ls: cannot access 'file3-': No such file or directory
> >>>>>>> 'file1-'$'\360\237\214\241''.ext'
> >>>>>>> file2-.?ext
> >>>>>>> file3-
> >>>>>> I don't reproduce this.
> >>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto'
> >>>> which needs to call stat(). Plain 'ls' does not, so the errors do not
> >>>> occur then.
> >>>>
> >>>>
> >>>>>> While the file name gets mangled, all resulting file names are valid
> >>>>>> and
> >>>>>> listed:
> >>>>>> In file2 the sequence is turned into U+17B3 but exchanged with the dot.
> >>>>>> In file3 the same sequence is just dropped.
> >>>>>> $ ls -1|cat
> >>>>>> file1-🌡.ext
> >>>>>> file2-.ឳext
> >>>>>> file3-
> >>>>>>
> >>>>>> However, ls file2* fails, as does ls *.
> >>>>> On the other hand, ls file3- fails too, so some mapping error occurs
> >>>>> internally.
> >>>>> Also, the files cannot be deleted from cygwin (need to use cmd).
> >>>> 'rm' using the original names works for file2-..., but not for file3-...
> >>>>
> >>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
> >>>> removed 'file2-'$'\360\237\214''.ext'
> >>>>
> >>>> $ rm -v 'file3-'$'\xf0\x9f\x8c'
> >>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
> >>>>
> >>> Further tests suggest that the problem only occurs with:
> >>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
> >>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16
> >>> 'high surrogate' range (0xD800..0xDBFF).
> >> Makes perfect sense, the Windows kernel uses UTF16 internally.
> >
> >
> > Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> UTF-16
> > mappings. This makes no sense:
> >
> > $ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on NTFS
> >
> > $ strace ls -F
> > ...
> > ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" >
> > "file-\xE2\x9E\xB3.ext")
> > ...
> >   ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...)
> > ...
> > ls: cannot access 'file-?.ext': No such file or directory
> > file-?.ext
> >
> > $ rm -v 'file-'$'\xed\xa0\x80''.ext'
> > removed 'file-'$'\355\240\200''.ext'
> >
> > The UTF-8 sequence returned by readdir() decodes to U+27B3 (White-Feathered
> > Rightwards Arrow).
> >
> >
> > This could be fixed by handling UTF-8 of the surrogate range similar to other
> > invalid sequences: Map each invalid byte to unicode range U+FF80 to U+FFFF. This
> > works as expected if the above UTF-8 sequence is truncated:
> >
> > $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" on NTFS
> >
> > $ ls -F
> > 'file-'$'\355\240''.ext'
>
> Surrogates halves are invalid for UTF-8 encoding; they should be first be
> encoded as a valid UTF-16 code point.
> The encoder should just fail if it encounters any invalid sequence!
> Handling surrogates or other invalid values as anything other than invalid turns
> the encoding into what has been called WTF-8 where W may be for Windows! ;^>
>
Nope, the WTF-8 means "What the F*ck-8"!

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

