delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2024/09/19/10:46:17

DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 48JEkGA72633431
Authentication-Results: delorie.com;
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=yMQ91bRW
X-Recipient: archive-cygwin AT delorie DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8B13E3858D28
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
s=default; t=1726757174;
bh=oj2qvxp2MApBUYX9wX10aP6PUljFBgeB6jxbe20o07g=;
h=Date:Subject:To:References:In-Reply-To:List-Id:List-Unsubscribe:
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
From;
b=yMQ91bRWKRuJzMFpx1i4qBIO14s2LKCAoJcuLYaon+w27vmUVjZcj6OeZgqtgrmaA
cfbhs1txmGw3hXzi+1UlJxUmz79Kt32djzw+ttmTm5HAEyo8tnEBMlDZKflXAtTVtI
w1vsXNd+U6NUoqePXOStPILc4otMFtNZZnpnykrc=
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A57B53858D28
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A57B53858D28
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726757149; cv=none;
b=Uf/gSXPIINzj7oDUc5vlILD2WtPQAowglP7gmh5hDTgexpvXDaBfYzKw77qVlwfxBfrmOARuWSDhedv9yVFGZM61lCgtvSQ+FUp/PtWGgUb2hrqyAcjcZld3iIulfWRbXQiVkDLZehINfp19BvdZJrkpe5LKy6qP8y4fqzG2gK4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
t=1726757149; c=relaxed/simple;
bh=3EXn9Xah5xgF6AKL/a3i+epgNH32hdmqsxObK4UZpzM=;
h=Message-ID:Date:MIME-Version:Subject:To:From;
b=URoZQjzipXXaADQzr61iNip4yojunnP66x2TcnyNJyu0RWxnEpr5r2pEkNaFevehKpMv7Ai732P+jvQOTFg8zEKlH1DzQGY0+AznySZJx41jm8rqDVvD0W0X7tvk2fjzNUoPBhKBVrchnVF3AZVfyQASHQ5WVpV+twWBgbuOmmw=
ARC-Authentication-Results: i=1; server2.sourceware.org
Message-ID: <984103a4-ab2d-4337-9964-cc1e3208155d@SystematicSW.ab.ca>
Date: Thu, 19 Sep 2024 08:45:44 -0600
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: readdir() returns inaccessible name if file was created with
invalid UTF-8
To: cygwin AT cygwin DOT com
References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de>
<bc8bd61c-818e-424f-bb42-52f4fecd4849 AT towo DOT net>
<b6ab074b-919e-4514-8276-72a30c36ab58 AT towo DOT net>
<de4767e2-85b7-ead2-df9a-64e1f24f4e8f AT t-online DOT de>
<6451a249-adcd-9c56-b76e-1b00886cea80 AT t-online DOT de>
<CAN0SSYx+g4JE6AA6krNAzG6QXrve52TBv0d3VM0SODV-tzZQSQ AT mail DOT gmail DOT com>
<66051d82-e2c3-684f-d13f-d1301170b0d4 AT t-online DOT de>
Organization: Systematic Software
In-Reply-To: <66051d82-e2c3-684f-d13f-d1301170b0d4@t-online.de>
X-Rspamd-Queue-Id: AD5922002A
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
KAM_NUMSUBJECT, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS,
TXREP, UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.6
X-Rspamd-Server: rspamout02
X-Stat-Signature: a8ocb1nkf7ghqehe6n9m61hx3aotq9w7
X-Session-Marker: 427269616E2E496E676C69734053797374656D6174696353572E61622E6361
X-Session-ID: U2FsdGVkX1/sFV3bl1SU7wkLLkO1hyHElaeB08cq1ek=
X-HE-Tag: 1726757145-108542
X-HE-Meta: U2FsdGVkX1/wABTPlJptemmGuAxJ5UCWZByFt+zmpOTNMTTiC3iv8En6NGYL4nbAaN90kybY3qGFkGfdSA9DuQJMlChAvNZo5bXRjSJpRRsJWssSDMt3YCWVLrAVyo/hGJynMW+qVG2ZWRw4pXchAAX00WWkbUGLs0/C1xBBDq3qUyVLEIUkYlvyEmW/Obh4weswp8JgZoU2mV6yQ3DbpFfLX1xJnMfq4iPjg95hfgtNKazzb4ur4BjPQeGaJfwQrkfpyzq24uEp/cOt48PW3rrZIgQ7s3new5WFzcBm2p1xXtS4+iRNyKBRHdoVCdboSjyHbzi9iMvYK5nfg2MapaSWzGdUnYGi80RNe+nDlRtMVq4gqVuY8aPPwPqwxRSTBb1pSVAt+ZTxuoNI4/iE0g==
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
server2.sourceware.org
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.30
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
From: Brian Inglis via Cygwin <cygwin AT cygwin DOT com>
Reply-To: cygwin AT cygwin DOT com
Cc: Brian Inglis <Brian DOT Inglis AT SystematicSW DOT ab DOT ca>
Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com>
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 48JEkGA72633431

On 2024-09-19 07:27, Christian Franke via Cygwin wrote:
> Mark Liam Brown via Cygwin wrote:
>> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin
>> <cygwin AT cygwin DOT com> wrote:
>>> Christian Franke via Cygwin wrote:
>>>> Thomas Wolff via Cygwin wrote:
>>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
>>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
>>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, open()
>>>>>>> does not refuse to create the file. Later readdir() returns a
>>>>>>> different name which could not be used to access the file.
>>>>>>>
>>>>>>> Testcase with U+1F321 (Thermometer):
>>>>>>>
>>>>>>> $ uname -r
>>>>>>> 3.5.4-1.x86_64
>>>>>>>
>>>>>>> $ printf $'\U0001F321' | od -A none -t x1
>>>>>>>   f0 9f 8c a1
>>>>>>>
>>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
>>>>>>>
>>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>>>>
>>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
>>>>>>>
>>>>>>> $ ls -1
>>>>>>> ls: cannot access 'file2-.?ext': No such file or directory
>>>>>>> ls: cannot access 'file3-': No such file or directory
>>>>>>> 'file1-'$'\360\237\214\241''.ext'
>>>>>>> file2-.?ext
>>>>>>> file3-
>>>>>> I don't reproduce this.
>>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto'
>>>> which needs to call stat(). Plain 'ls' does not, so the errors do not
>>>> occur then.
>>>>
>>>>
>>>>>> While the file name gets mangled, all resulting file names are valid
>>>>>> and
>>>>>> listed:
>>>>>> In file2 the sequence is turned into U+17B3 but exchanged with the dot.
>>>>>> In file3 the same sequence is just dropped.
>>>>>> $ ls -1|cat
>>>>>> file1-🌡.ext
>>>>>> file2-.áž³ext
>>>>>> file3-
>>>>>>
>>>>>> However, ls file2* fails, as does ls *.
>>>>> On the other hand, ls file3- fails too, so some mapping error occurs
>>>>> internally.
>>>>> Also, the files cannot be deleted from cygwin (need to use cmd).
>>>> 'rm' using the original names works for file2-..., but not for file3-...
>>>>
>>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
>>>> removed 'file2-'$'\360\237\214''.ext'
>>>>
>>>> $ rm -v 'file3-'$'\xf0\x9f\x8c'
>>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
>>>>
>>> Further tests suggest that the problem only occurs with:
>>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
>>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16
>>> 'high surrogate' range (0xD800..0xDBFF).
>> Makes perfect sense, the Windows kernel uses UTF16 internally.
> 
> 
> Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> UTF-16 
> mappings. This makes no sense:
> 
> $ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on NTFS
> 
> $ strace ls -F
> ...
> ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" > 
> "file-\xE2\x9E\xB3.ext")
> ...
>   ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...)
> ...
> ls: cannot access 'file-?.ext': No such file or directory
> file-?.ext
> 
> $ rm -v 'file-'$'\xed\xa0\x80''.ext'
> removed 'file-'$'\355\240\200''.ext'
> 
> The UTF-8 sequence returned by readdir() decodes to U+27B3 (White-Feathered 
> Rightwards Arrow).
> 
> 
> This could be fixed by handling UTF-8 of the surrogate range similar to other 
> invalid sequences: Map each invalid byte to unicode range U+FF80 to U+FFFF. This 
> works as expected if the above UTF-8 sequence is truncated:
> 
> $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" on NTFS
> 
> $ ls -F
> 'file-'$'\355\240''.ext'

Surrogates halves are invalid for UTF-8 encoding; they should be first be 
encoded as a valid UTF-16 code point.
The encoder should just fail if it encounters any invalid sequence!
Handling surrogates or other invalid values as anything other than invalid turns 
the encoding into what has been called WTF-8 where W may be for Windows! ;^>

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry


-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019