delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2024/09/19/11:00:49

DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 48JF0mj52643613
Authentication-Results: delorie.com;
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=V7Hiw32z
X-Recipient: archive-cygwin AT delorie DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1DEB83858C31
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
s=default; t=1726758048;
bh=IMgq6vM10MIQkyg27X6UvuoBYnt1MbsqoIk786PCzuE=;
h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe:
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
From;
b=V7Hiw32zZMUpgp2Q5XLaL1aed2H1hdQr3HhPpQ705oZZZUjYjPpPEbM3+dj9u+6Un
QKw/x498wbyU2qmJRgI7/zDWR8uiwGP4cbjZPbdGLREwbCJV0kpHxr4bp7XZqvoShC
eebFKPaAmyoTptGwFIDOFIC8loIMNvVB/ZUrYy/Q=
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1882F3858D28
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1882F3858D28
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726758024; cv=none;
b=xIkLOwl/hVQ91mtCLRSFkTi3pVtpDpaCfhb/bWbOjvAc8qXJP7au5poL02j9myqFffwe3zZp368gfdy9a49yLLwS45H+JmOnOTKj8PqYhntpfsBsLGfChUV5/WR0Y2VG9muVUZf7+5aVSST3KU2M9Ie2ZT942T6TohFEa9rgtAo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
t=1726758024; c=relaxed/simple;
bh=q1aDUBEL3E/ATzeULkaHj219ib5n0HCTg8AXBLcq5Dw=;
h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To;
b=uwuxTLFu3Rue4dO5/4kTNIcAcmHzQZALuW1/hB8qBmUQ26difpPp/ema5Db2lcszzsLAHKHmdFR9I6c3u4VkT2YisJP2rpeqKt7TjljEZowbQSGYX+Th/5g401bJzHPHw0dyRI0h6ah1yVInaAiObbP+M/AD4+OG+Y5n/44dNK0=
ARC-Authentication-Results: i=1; server2.sourceware.org
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1726758020; x=1727362820;
h=content-transfer-encoding:to:subject:message-id:date:from
:in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
:subject:date:message-id:reply-to;
bh=XcFesDGVVQqEGM5qysxzwzXi+O4K6bzC37rWQH4NovY=;
b=nSI/HVCNmFFfB7b5HDTMohb/FOAibz3keppmsU+kyvcWFwPX4YKSC/Oxg6ETPglNiV
mbCWSfhAnY/l9fVQl1p5S3T0/bliC73WwULtGZHCNeqfI/FhaaXip9A+efeaiC/jnlAl
Hl9YcHWZd6PZxC9kgKPQH0DZRnkwo8TwOOZVIgo0V2SVPyyr2bVA12dKN58GLopGTC8w
/UQfmAarIjdtcFk8b2JWQIqwbE4ak44zh8gGQmkX6S19rUQhmFAikPHGWaEnPQDfh01J
aWr2kSYwNVU26kbrgCHNpV6a06jDSDsjkfZ7HDZNvHC4AR4P9x/ffTUQOpRCNS8NUSFg
Cx3A==
X-Gm-Message-State: AOJu0YwL9hCr3sDy1W8Iqw+wPdZl4Ike6ZIVGkzZ1HYdHyZQe1/QbHQy
YUZEm1XF6HzNSpfhhdi1BKJgKYLuZMNoaK1ctsyP4pDvMZn6l6s4JBpoh9dhWfhMWqh7JMUJtbK
xEaeEgGu0OlfRt80gBu0qmFXBV+KWM2PbJUQ=
X-Google-Smtp-Source: AGHT+IGA6o+z/LbhSooT29oDZNyTr1I+6ikx2gc9aiaO4tu1xdQUpkggdDnzMqeRooZeN5LRq4S3KxQFsjPjM7XeKa8=
X-Received: by 2002:a05:6402:518f:b0:5bf:1bd:adb3 with SMTP id
4fb4d7f45d1cf-5c41e18ea13mr18571118a12.14.1726758020093; Thu, 19 Sep 2024
08:00:20 -0700 (PDT)
MIME-Version: 1.0
References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de>
<bc8bd61c-818e-424f-bb42-52f4fecd4849 AT towo DOT net>
<b6ab074b-919e-4514-8276-72a30c36ab58 AT towo DOT net>
<de4767e2-85b7-ead2-df9a-64e1f24f4e8f AT t-online DOT de>
<6451a249-adcd-9c56-b76e-1b00886cea80 AT t-online DOT de>
<CAN0SSYx+g4JE6AA6krNAzG6QXrve52TBv0d3VM0SODV-tzZQSQ AT mail DOT gmail DOT com>
<66051d82-e2c3-684f-d13f-d1301170b0d4 AT t-online DOT de>
<984103a4-ab2d-4337-9964-cc1e3208155d AT SystematicSW DOT ab DOT ca>
In-Reply-To: <984103a4-ab2d-4337-9964-cc1e3208155d@SystematicSW.ab.ca>
Date: Thu, 19 Sep 2024 16:59:42 +0200
Message-ID: <CALXu0UcbKcAVKQ9uopn3rV28OruR4yG=kyh_ati9a_stR2GSrw@mail.gmail.com>
Subject: Re: readdir() returns inaccessible name if file was created with
invalid UTF-8
To: cygwin AT cygwin DOT com
X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_NUMSUBJECT,
RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
server2.sourceware.org
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.30
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
From: Cedric Blancher via Cygwin <cygwin AT cygwin DOT com>
Reply-To: Cedric Blancher <cedric DOT blancher AT gmail DOT com>
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com>
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 48JF0mj52643613

On Thu, 19 Sept 2024 at 16:46, Brian Inglis via Cygwin
<cygwin AT cygwin DOT com> wrote:
>
> On 2024-09-19 07:27, Christian Franke via Cygwin wrote:
> > Mark Liam Brown via Cygwin wrote:
> >> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin
> >> <cygwin AT cygwin DOT com> wrote:
> >>> Christian Franke via Cygwin wrote:
> >>>> Thomas Wolff via Cygwin wrote:
> >>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
> >>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
> >>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open()
> >>>>>>> does not refuse to create the file. Later readdir() returns a
> >>>>>>> different name which could not be used to access the file.
> >>>>>>>
> >>>>>>> Testcase with U+1F321 (Thermometer):
> >>>>>>>
> >>>>>>> $ uname -r
> >>>>>>> 3.5.4-1.x86_64
> >>>>>>>
> >>>>>>> $ printf $'\U0001F321' | od -A none -t x1
> >>>>>>>   f0 9f 8c a1
> >>>>>>>
> >>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
> >>>>>>>
> >>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
> >>>>>>>
> >>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
> >>>>>>>
> >>>>>>> $ ls -1
> >>>>>>> ls: cannot access 'file2-.?ext': No such file or directory
> >>>>>>> ls: cannot access 'file3-': No such file or directory
> >>>>>>> 'file1-'$'\360\237\214\241''.ext'
> >>>>>>> file2-.?ext
> >>>>>>> file3-
> >>>>>> I don't reproduce this.
> >>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto'
> >>>> which needs to call stat(). Plain 'ls' does not, so the errors do not
> >>>> occur then.
> >>>>
> >>>>
> >>>>>> While the file name gets mangled, all resulting file names are valid
> >>>>>> and
> >>>>>> listed:
> >>>>>> In file2 the sequence is turned into U+17B3 but exchanged with the dot.
> >>>>>> In file3 the same sequence is just dropped.
> >>>>>> $ ls -1|cat
> >>>>>> file1-🌡.ext
> >>>>>> file2-.áž³ext
> >>>>>> file3-
> >>>>>>
> >>>>>> However, ls file2* fails, as does ls *.
> >>>>> On the other hand, ls file3- fails too, so some mapping error occurs
> >>>>> internally.
> >>>>> Also, the files cannot be deleted from cygwin (need to use cmd).
> >>>> 'rm' using the original names works for file2-..., but not for file3-...
> >>>>
> >>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
> >>>> removed 'file2-'$'\360\237\214''.ext'
> >>>>
> >>>> $ rm -v 'file3-'$'\xf0\x9f\x8c'
> >>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
> >>>>
> >>> Further tests suggest that the problem only occurs with:
> >>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
> >>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16
> >>> 'high surrogate' range (0xD800..0xDBFF).
> >> Makes perfect sense, the Windows kernel uses UTF16 internally.
> >
> >
> > Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> UTF-16
> > mappings. This makes no sense:
> >
> > $ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on NTFS
> >
> > $ strace ls -F
> > ...
> > ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" >
> > "file-\xE2\x9E\xB3.ext")
> > ...
> >   ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...)
> > ...
> > ls: cannot access 'file-?.ext': No such file or directory
> > file-?.ext
> >
> > $ rm -v 'file-'$'\xed\xa0\x80''.ext'
> > removed 'file-'$'\355\240\200''.ext'
> >
> > The UTF-8 sequence returned by readdir() decodes to U+27B3 (White-Feathered
> > Rightwards Arrow).
> >
> >
> > This could be fixed by handling UTF-8 of the surrogate range similar to other
> > invalid sequences: Map each invalid byte to unicode range U+FF80 to U+FFFF. This
> > works as expected if the above UTF-8 sequence is truncated:
> >
> > $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" on NTFS
> >
> > $ ls -F
> > 'file-'$'\355\240''.ext'
>
> Surrogates halves are invalid for UTF-8 encoding; they should be first be
> encoded as a valid UTF-16 code point.
> The encoder should just fail if it encounters any invalid sequence!
> Handling surrogates or other invalid values as anything other than invalid turns
> the encoding into what has been called WTF-8 where W may be for Windows! ;^>
>
Nope, the WTF-8 means "What the F*ck-8"!

Ced
-- 
Cedric Blancher <cedric DOT blancher AT gmail DOT com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019