delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2021/06/26/01:33:57

X-Recipient: archive-cygwin AT delorie DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DED1D39AE833
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
s=default; t=1624685635;
bh=UW6EZFzabcdmfXf/CtXvvLWyzbnUEDvh1k6+MMSxGPc=;
h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe:
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
From;
b=XhozazdArPHXUDLo1WVsrMFXdVGtc5Lt4KDmYJwYgWyG0AeA1Ch7FguWqPjsdXu2/
ZrvsDZTK1jct0Mq+Zq9oJGTqDbQyQhycaTN7CjdHSItg+rnODxVrAIfSblrEAyskqw
taydpjHL0PP6LSgju3TjBRCzvq9xhAAvuxj11wlQ=
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E09C13847824
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20161025;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to;
bh=SFlgbeqlhXxtnJf3KvrC/jvqWR8rATTMGbVNmGQNdjA=;
b=rm8IB4v0lB1zkiqbRebX5LwoEF34rmfveDUPt3JlK6l5lLv9+UtypJWsAmC1RZhPv4
0dQMZk9b7rjF0Ef/IJZRzd/KkdVp7gyTrKwMRTkrAl/iDh71WbZ1lh71gBYed45QZiGF
UdOez9h0l/R3NvrjAbSAat59teOFphxm8+/6lweN/gCvLVYfks7K02+6IrOKvFmUkjb1
rDU7VZLNulnzKrv4VK8nBcUbAVRsuUW07Tn6Z7zG1CXQh+rF/jmZSDgX6nXT5gAmy1c1
KJR4+5bi1/qO6rmBlVqzFkpL3/T0J4fjp4kGr8hzZCZb1IXlFagZRfCqXbGYM3hsbBau
GBvw==
X-Gm-Message-State: AOAM533dC83uIdx+m9HmXVfKLtgaT70rp1N3RRxqUCRZWRCqCC85Hamk
9ts99kusEJavfWjcsWboh6Npy/UELr9UEXh3kGCZmFV9HS6pAQ==
X-Google-Smtp-Source: ABdhPJyty6d2Xm+7JSaANFa5IEna6+YaAhjN84FVwCK73+3CB5RpLH1/PkT083+c/xSJkeQ4SBOPeXB8CwwoMEv0rHE=
X-Received: by 2002:a05:622a:1747:: with SMTP id
l7mr12713093qtk.225.1624685604102;
Fri, 25 Jun 2021 22:33:24 -0700 (PDT)
MIME-Version: 1.0
References: <952ad3ba-34f4-c3a4-450c-263b16795c8d AT syping DOT de>
In-Reply-To: <952ad3ba-34f4-c3a4-450c-263b16795c8d@syping.de>
Date: Fri, 25 Jun 2021 23:33:12 -0600
Message-ID: <CAJ1FpuNu-fBHdH4NmCumSiCTQ=he5jswUXwnPvCkMP2xkdQPWg@mail.gmail.com>
Subject: Re: Cygwin, Unicode and "long" path names
To: cygwin <cygwin AT cygwin DOT com>
X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, FROM_LOCAL_NOVOWEL,
HK_RANDOM_ENVFROM, HK_RANDOM_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
TXREP autolearn=no autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
server2.sourceware.org
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.29
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
From: Doug Henderson via Cygwin <cygwin AT cygwin DOT com>
Reply-To: Doug Henderson <djndnbvg AT gmail DOT com>
Sender: "Cygwin" <cygwin-bounces+archive-cygwin=delorie DOT com AT cygwin DOT com>

)()On Fri, 25 Jun 2021 at 19:55, Vadim <vad AT syping DOT de> wrote:
>
> Ah, this beautiful topic. Windows 7 x64.
>
> This is the summary written as post-scriptum, tests and findings below:
>
> 1) Cygwin limits individual names to 255 bytes, Windows seems to follow
> UTF-16 chars and work fine: 256 bytes in 108 characters works.
>
> Basically, this becomes a bytes vs characters story.
>
> 2) Bash file name auto-expansion detects the file of that name, but it
> gets truncated to 255 bytes. find's behaviour is the same ("No such file
> or directory" due to trying to access a non-existing truncated name)
>
> 2.1) If you try to correct the above mistake by adding truncated
> characters, then the program (cat) will complain about "File name too long"
>
> 2.2) If there exists a folder with a 255-byte name, equal to the
> truncated name, then "find ." will do a listing on that folder twice
> (effectively hiding the long-named folder from tools without leaving an
> error message)
>
> 3) UNC Paths get the same treatment: File name too long.
>
> I expected Cygwin to handle these names without problems just like
> Windows, Explorer, cmd etc. do. Is this particular problem new or known?
> All I could find on the mailing list is around the time when Cygwin
> hadn't yet implemented Unicode support (UTF-8?), ~2004-2008.
>
> These names were created by youtube-dl.exe executed from within Cygwin.
>
> - Vadim

I believe this is the result of the difference between Pascal type
strings, which have a length-byte followed by data-bytes and C type
strings which have data-bytes followed by a zero-byte, or worse, in
the case of two byte characters, data-words followed by a zero-word.

For single byte characters both  P and C styles use 256 bytes. Using
the 255 length limit without accounting for the trailing zero-byte
could account for some of the observed problem.

More likely, the problems relates to double byte character sets. For
double byte characters, 255 bytes of UTF-16 characters or more likely
255 bytes of MCBS (multi-byte character set) or DBCS (double-byte
character set) can encode to more or less than 255 UTF-8 bytes
depending on the average bytes/character of the UTF-8 encoding. This
could account for the failure to handle all bytes of the NTFS filename
when converted to UTF-8. Converted Linux programs may fail to allocate
a large enough encoding buffer leading to the observed truncation.
Similarly for 510 bytes containing 255 words of DBCS characters.

Youtube-dl.exe is basically a windows Python 3 program with
C-extensions. Python 3 properly handles Unicode and the encoding and
decoding of the aforementioned character encodings.

I would look for library functions which perform decoding of NTFS file
names into UTF-8 names, verify their correctness, and follow the path
of the usage of their output through the system. I think this will
mean that using the windows 255 byte limit cannot be used at all in
any cygwin program that will handle international file names.
Unfortunately that sounds like a lot of work. In theory, if all 255
characters in the filename component required 4 byte UTF-8 encodings,
this would require about 1024 bytes. However this does not even touch
on emojis where a one character emoji can expand to as much as 35 or
so bytes! That basically means the end of static allocation for file
and directory names and name component buffers. That may be a major
job in the cygwin kernel, not to mention all the available packages!


HTH
Doug

-- 
Doug Henderson, Calgary, Alberta, Canada - from gmail.com

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019