DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 48JF0mj52643613 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=V7Hiw32z X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1DEB83858C31 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1726758048; bh=IMgq6vM10MIQkyg27X6UvuoBYnt1MbsqoIk786PCzuE=; h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=V7Hiw32zZMUpgp2Q5XLaL1aed2H1hdQr3HhPpQ705oZZZUjYjPpPEbM3+dj9u+6Un QKw/x498wbyU2qmJRgI7/zDWR8uiwGP4cbjZPbdGLREwbCJV0kpHxr4bp7XZqvoShC eebFKPaAmyoTptGwFIDOFIC8loIMNvVB/ZUrYy/Q= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1882F3858D28 ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1882F3858D28 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726758024; cv=none; b=xIkLOwl/hVQ91mtCLRSFkTi3pVtpDpaCfhb/bWbOjvAc8qXJP7au5poL02j9myqFffwe3zZp368gfdy9a49yLLwS45H+JmOnOTKj8PqYhntpfsBsLGfChUV5/WR0Y2VG9muVUZf7+5aVSST3KU2M9Ie2ZT942T6TohFEa9rgtAo= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726758024; c=relaxed/simple; bh=q1aDUBEL3E/ATzeULkaHj219ib5n0HCTg8AXBLcq5Dw=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=uwuxTLFu3Rue4dO5/4kTNIcAcmHzQZALuW1/hB8qBmUQ26difpPp/ema5Db2lcszzsLAHKHmdFR9I6c3u4VkT2YisJP2rpeqKt7TjljEZowbQSGYX+Th/5g401bJzHPHw0dyRI0h6ah1yVInaAiObbP+M/AD4+OG+Y5n/44dNK0= ARC-Authentication-Results: i=1; server2.sourceware.org X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726758020; x=1727362820; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XcFesDGVVQqEGM5qysxzwzXi+O4K6bzC37rWQH4NovY=; b=nSI/HVCNmFFfB7b5HDTMohb/FOAibz3keppmsU+kyvcWFwPX4YKSC/Oxg6ETPglNiV mbCWSfhAnY/l9fVQl1p5S3T0/bliC73WwULtGZHCNeqfI/FhaaXip9A+efeaiC/jnlAl Hl9YcHWZd6PZxC9kgKPQH0DZRnkwo8TwOOZVIgo0V2SVPyyr2bVA12dKN58GLopGTC8w /UQfmAarIjdtcFk8b2JWQIqwbE4ak44zh8gGQmkX6S19rUQhmFAikPHGWaEnPQDfh01J aWr2kSYwNVU26kbrgCHNpV6a06jDSDsjkfZ7HDZNvHC4AR4P9x/ffTUQOpRCNS8NUSFg Cx3A== X-Gm-Message-State: AOJu0YwL9hCr3sDy1W8Iqw+wPdZl4Ike6ZIVGkzZ1HYdHyZQe1/QbHQy YUZEm1XF6HzNSpfhhdi1BKJgKYLuZMNoaK1ctsyP4pDvMZn6l6s4JBpoh9dhWfhMWqh7JMUJtbK xEaeEgGu0OlfRt80gBu0qmFXBV+KWM2PbJUQ= X-Google-Smtp-Source: AGHT+IGA6o+z/LbhSooT29oDZNyTr1I+6ikx2gc9aiaO4tu1xdQUpkggdDnzMqeRooZeN5LRq4S3KxQFsjPjM7XeKa8= X-Received: by 2002:a05:6402:518f:b0:5bf:1bd:adb3 with SMTP id 4fb4d7f45d1cf-5c41e18ea13mr18571118a12.14.1726758020093; Thu, 19 Sep 2024 08:00:20 -0700 (PDT) MIME-Version: 1.0 References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de> <6451a249-adcd-9c56-b76e-1b00886cea80 AT t-online DOT de> <66051d82-e2c3-684f-d13f-d1301170b0d4 AT t-online DOT de> <984103a4-ab2d-4337-9964-cc1e3208155d AT SystematicSW DOT ab DOT ca> In-Reply-To: <984103a4-ab2d-4337-9964-cc1e3208155d@SystematicSW.ab.ca> Date: Thu, 19 Sep 2024 16:59:42 +0200 Message-ID: Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 To: cygwin AT cygwin DOT com X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 List-Id: General Cygwin discussions and problem reports List-Archive: List-Post: List-Help: List-Subscribe: , From: Cedric Blancher via Cygwin Reply-To: Cedric Blancher Content-Type: text/plain; charset="utf-8" Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 48JF0mj52643613 On Thu, 19 Sept 2024 at 16:46, Brian Inglis via Cygwin wrote: > > On 2024-09-19 07:27, Christian Franke via Cygwin wrote: > > Mark Liam Brown via Cygwin wrote: > >> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin > >> wrote: > >>> Christian Franke via Cygwin wrote: > >>>> Thomas Wolff via Cygwin wrote: > >>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin: > >>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin: > >>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open() > >>>>>>> does not refuse to create the file. Later readdir() returns a > >>>>>>> different name which could not be used to access the file. > >>>>>>> > >>>>>>> Testcase with U+1F321 (Thermometer): > >>>>>>> > >>>>>>> $ uname -r > >>>>>>> 3.5.4-1.x86_64 > >>>>>>> > >>>>>>> $ printf $'\U0001F321' | od -A none -t x1 > >>>>>>> f0 9f 8c a1 > >>>>>>> > >>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext' > >>>>>>> > >>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext' > >>>>>>> > >>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c' > >>>>>>> > >>>>>>> $ ls -1 > >>>>>>> ls: cannot access 'file2-.?ext': No such file or directory > >>>>>>> ls: cannot access 'file3-': No such file or directory > >>>>>>> 'file1-'$'\360\237\214\241''.ext' > >>>>>>> file2-.?ext > >>>>>>> file3- > >>>>>> I don't reproduce this. > >>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto' > >>>> which needs to call stat(). Plain 'ls' does not, so the errors do not > >>>> occur then. > >>>> > >>>> > >>>>>> While the file name gets mangled, all resulting file names are valid > >>>>>> and > >>>>>> listed: > >>>>>> In file2 the sequence is turned into U+17B3 but exchanged with the dot. > >>>>>> In file3 the same sequence is just dropped. > >>>>>> $ ls -1|cat > >>>>>> file1-🌡.ext > >>>>>> file2-.ឳext > >>>>>> file3- > >>>>>> > >>>>>> However, ls file2* fails, as does ls *. > >>>>> On the other hand, ls file3- fails too, so some mapping error occurs > >>>>> internally. > >>>>> Also, the files cannot be deleted from cygwin (need to use cmd). > >>>> 'rm' using the original names works for file2-..., but not for file3-... > >>>> > >>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext' > >>>> removed 'file2-'$'\360\237\214''.ext' > >>>> > >>>> $ rm -v 'file3-'$'\xf0\x9f\x8c' > >>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory > >>>> > >>> Further tests suggest that the problem only occurs with: > >>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit) > >>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16 > >>> 'high surrogate' range (0xD800..0xDBFF). > >> Makes perfect sense, the Windows kernel uses UTF16 internally. > > > > > > Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> UTF-16 > > mappings. This makes no sense: > > > > $ touch 'file-'$'\xed\xa0\x80''.ext' # creates L"file-\xD800.ext" on NTFS > > > > $ strace ls -F > > ... > > ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" > > > "file-\xE2\x9E\xB3.ext") > > ... > > ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...) > > ... > > ls: cannot access 'file-?.ext': No such file or directory > > file-?.ext > > > > $ rm -v 'file-'$'\xed\xa0\x80''.ext' > > removed 'file-'$'\355\240\200''.ext' > > > > The UTF-8 sequence returned by readdir() decodes to U+27B3 (White-Feathered > > Rightwards Arrow). > > > > > > This could be fixed by handling UTF-8 of the surrogate range similar to other > > invalid sequences: Map each invalid byte to unicode range U+FF80 to U+FFFF. This > > works as expected if the above UTF-8 sequence is truncated: > > > > $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" on NTFS > > > > $ ls -F > > 'file-'$'\355\240''.ext' > > Surrogates halves are invalid for UTF-8 encoding; they should be first be > encoded as a valid UTF-16 code point. > The encoder should just fail if it encounters any invalid sequence! > Handling surrogates or other invalid values as anything other than invalid turns > the encoding into what has been called WTF-8 where W may be for Windows! ;^> > Nope, the WTF-8 means "What the F*ck-8"! Ced -- Cedric Blancher [https://plus.google.com/u/0/+CedricBlancher/] Institute Pasteur -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple