delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2024/09/19/13:31:10

DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 48JHVAnN2745366
Authentication-Results: delorie.com;
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=WCmDj/EO
X-Recipient: archive-cygwin AT delorie DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6957F3858414
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
s=default; t=1726767069;
bh=moJx6IJ8bGRgXQGlcHEIkqpwEKHj14pqqE+cl6DH8yg=;
h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
From;
b=WCmDj/EO8NYfBDnM4b+KKJL/uaTLf31mP5Pm1Xp3Ac2w87FjxAcH07Daoi70H4nzF
lj9wt8pcQz50rZDLKjAC1c3Z6q/z7JrTQOvUg9n3xms7GcGyDZKXBpl3rtCFmuOrFL
H822o1PEdeZ0m5EkHBFBp9P+4gu13GsTNoHy5+k8=
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1DDF03858D29
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1DDF03858D29
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1726767011; cv=none;
b=wLJNXAx4mf5S2mCba1EaoFCKPvYnrk3z/2uEzrTfBi/mD+Kzv1VMHZaxfGPpVwid8lFIoEmnDRKdO/Ww6KIMvHCkOWvFWgwST70XwyoPWFXy/75+m/e4WG5a0MBfpz7PTTFlnREKQfOYvDfPsTixqtJ3lrSg291tS+025VT8fI8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
t=1726767011; c=relaxed/simple;
bh=e6m9Xrh0gTBMOhCAMnKk+GR6WurOBlIZzgsyG/wHhz8=;
h=Subject:To:From:Message-ID:Date:MIME-Version;
b=qI87FVrmDePJdabF7WAaAqusEt05w44A+QOB2XaW4eccD4eTa0Zgw5anOBJgDwzcRmnYpSQnT9T1OX6BlW0VWnxBdPzHPmItCgXoK3asind272+gZORccWTgB5iVMXlaNHIzGViTTeqNygHKMDEVIXmpU4dqq0A8/gORiLND1eU=
ARC-Authentication-Results: i=1; server2.sourceware.org
Subject: Re: readdir() returns inaccessible name if file was created with
invalid UTF-8
To: cygwin AT cygwin DOT com
References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de>
<bc8bd61c-818e-424f-bb42-52f4fecd4849 AT towo DOT net>
<b6ab074b-919e-4514-8276-72a30c36ab58 AT towo DOT net>
<de4767e2-85b7-ead2-df9a-64e1f24f4e8f AT t-online DOT de>
<6451a249-adcd-9c56-b76e-1b00886cea80 AT t-online DOT de>
<CAN0SSYx+g4JE6AA6krNAzG6QXrve52TBv0d3VM0SODV-tzZQSQ AT mail DOT gmail DOT com>
<66051d82-e2c3-684f-d13f-d1301170b0d4 AT t-online DOT de>
<984103a4-ab2d-4337-9964-cc1e3208155d AT SystematicSW DOT ab DOT ca>
Message-ID: <11036733-c4f3-c2e9-37c9-959c9e99edab@t-online.de>
Date: Thu, 19 Sep 2024 19:30:05 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
SeaMonkey/2.53.18.2
MIME-Version: 1.0
In-Reply-To: <984103a4-ab2d-4337-9964-cc1e3208155d@SystematicSW.ab.ca>
X-TOI-EXPURGATEID: 150726::1726767006-16FFB4ED-FB442723/0/0 CLEAN NORMAL
X-TOI-MSGID: 1a0df370-8824-4a13-8990-f9ca76f4421e
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, FREEMAIL_FROM,
KAM_DMARC_STATUS, KAM_NUMSUBJECT, NICE_REPLY_A, RCVD_IN_DNSWL_NONE,
RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS,
TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
server2.sourceware.org
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.30
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
From: Christian Franke via Cygwin <cygwin AT cygwin DOT com>
Reply-To: cygwin AT cygwin DOT com
Cc: Christian Franke <Christian DOT Franke AT t-online DOT de>
Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com>
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 48JHVAnN2745366

Brian Inglis via Cygwin wrote:
> On 2024-09-19 07:27, Christian Franke via Cygwin wrote:
>> Mark Liam Brown via Cygwin wrote:
>>> On Mon, Sep 16, 2024 at 11:51 AM Christian Franke via Cygwin
>>> <cygwin AT cygwin DOT com> wrote:
>>>> Christian Franke via Cygwin wrote:
>>>>> Thomas Wolff via Cygwin wrote:
>>>>>> Am 15.09.2024 um 20:15 schrieb Thomas Wolff via Cygwin:
>>>>>>> Am 15.09.2024 um 19:47 schrieb Christian Franke via Cygwin:
>>>>>>>> If a file name contains an invalid (truncated) UTF-8 sequence, 
>>>>>>>> open()
>>>>>>>> does not refuse to create the file. Later readdir() returns a
>>>>>>>> different name which could not be used to access the file.
>>>>>>>>
>>>>>>>> Testcase with U+1F321 (Thermometer):
>>>>>>>>
>>>>>>>> $ uname -r
>>>>>>>> 3.5.4-1.x86_64
>>>>>>>>
>>>>>>>> $ printf $'\U0001F321' | od -A none -t x1
>>>>>>>>   f0 9f 8c a1
>>>>>>>>
>>>>>>>> $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext'
>>>>>>>>
>>>>>>>> $ touch 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>>>>>
>>>>>>>> $ touch 'file3-'$'\xf0\x9f\x8c'
>>>>>>>>
>>>>>>>> $ ls -1
>>>>>>>> ls: cannot access 'file2-.?ext': No such file or directory
>>>>>>>> ls: cannot access 'file3-': No such file or directory
>>>>>>>> 'file1-'$'\360\237\214\241''.ext'
>>>>>>>> file2-.?ext
>>>>>>>> file3-
>>>>>>> I don't reproduce this.
>>>>> Yes, sorry, the above 'ls' was actually aliased to 'ls --color=auto'
>>>>> which needs to call stat(). Plain 'ls' does not, so the errors do not
>>>>> occur then.
>>>>>
>>>>>
>>>>>>> While the file name gets mangled, all resulting file names are 
>>>>>>> valid
>>>>>>> and
>>>>>>> listed:
>>>>>>> In file2 the sequence is turned into U+17B3 but exchanged with 
>>>>>>> the dot.
>>>>>>> In file3 the same sequence is just dropped.
>>>>>>> $ ls -1|cat
>>>>>>> file1-🌡.ext
>>>>>>> file2-.áž³ext
>>>>>>> file3-
>>>>>>>
>>>>>>> However, ls file2* fails, as does ls *.
>>>>>> On the other hand, ls file3- fails too, so some mapping error occurs
>>>>>> internally.
>>>>>> Also, the files cannot be deleted from cygwin (need to use cmd).
>>>>> 'rm' using the original names works for file2-..., but not for 
>>>>> file3-...
>>>>>
>>>>> $ rm -v 'file2-'$'\xf0\x9f\x8c''.ext'
>>>>> removed 'file2-'$'\360\237\214''.ext'
>>>>>
>>>>> $ rm -v 'file3-'$'\xf0\x9f\x8c'
>>>>> rm: cannot remove 'file3-'$'\360\237\214': No such file or directory
>>>>>
>>>> Further tests suggest that the problem only occurs with:
>>>> - incomplete 4 byte UTF-8 sequences (Unicode above 16 bit)
>>>> - complete but invalid 3 byte UTF-8 sequences which encode the UTF-16
>>>> 'high surrogate' range (0xD800..0xDBFF).
>>> Makes perfect sense, the Windows kernel uses UTF16 internally.
>>
>>
>> Yes, but Cygwin does not provide consistent forward/reverse UTF-8 <-> 
>> UTF-16 mappings. This makes no sense:
>>
>> $ touch 'file-'$'\xed\xa0\x80''.ext'  # creates L"file-\xD800.ext" on 
>> NTFS
>>
>> $ strace ls -F
>> ...
>> ... fhandler_disk_file::readdir: 0 = readdir(...) (L"file-\xD800.ext" 
>> > "file-\xE2\x9E\xB3.ext")
>> ...
>>   ... stat_worker: -1 = (\??\C:\cygwin64\tmp\file-?.ext,...)
>> ...
>> ls: cannot access 'file-?.ext': No such file or directory
>> file-?.ext
>>
>> $ rm -v 'file-'$'\xed\xa0\x80''.ext'
>> removed 'file-'$'\355\240\200''.ext'
>>
>> The UTF-8 sequence returned by readdir() decodes to U+27B3 
>> (White-Feathered Rightwards Arrow).
>>
>>
>> This could be fixed by handling UTF-8 of the surrogate range similar 
>> to other invalid sequences: Map each invalid byte to unicode range 
>> U+FF80 to U+FFFF. This works as expected if the above UTF-8 sequence 
>> is truncated:
>>
>> $ touch 'file-'$'\xed\xa0''.ext' # creates L"file-\xF0ED\xF0A0.ext" 
>> on NTFS
>>
>> $ ls -F
>> 'file-'$'\355\240''.ext'
>
> Surrogates halves are invalid for UTF-8 encoding; they should be first 
> be encoded as a valid UTF-16 code point.
> The encoder should just fail if it encounters any invalid sequence!
> Handling surrogates or other invalid values as anything other than 
> invalid turns the encoding into what has been called WTF-8 where W may 
> be for Windows! ;^>

:-)

I guess the idea behind Cygwin's filename mapping was to emulate Linux 
behavior as far as possible. AFAICS, Linux accepts any nonempty byte 
string without slash as a plain filename and leaves the interpretation 
(UTF-8?) to the userland.

Cygwin maps 0x20..0x7f and valid UTF-8 sequences to UTF-16. Control 
chars and bytes from invalid UTF-8 sequences are mapped to the U+F0xx 
range. It should handle UTF-8 sequences which lead to the surrogate 
range the same way but currently does not.


-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019