delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/21/12:10:35

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Mon, 21 Sep 2009 18:10:14 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
Message-ID: <20090921161014.GI20981@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <h8bk5a$big$1 AT ger DOT gmane DOT org> <416096c60909101512l6e42ab72l4ba5fd792363eefd AT mail DOT gmail DOT com> <h8p50e$im8$1 AT ger DOT gmane DOT org>
MIME-Version: 1.0
In-Reply-To: <h8p50e$im8$1@ger.gmane.org>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Note-from-DJ: This may be spam

On Sep 16 00:38, Lapo Luchini wrote:
> Andy Koppe wrote:
> > Hmm, we've lost the \xDF somewhere, and I'd guess it was when the
> > filename got translated to UTF-16 in fopen(), which would explain what
> > you're seeing
> 
> More data: it's not simply "the last character", is something more
> complex than that.
> 
> % cat t.c
> int main() {
>     fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1
>     fopen("b-\xF6\xE4\xFC\xDFz", "w");
>     fopen("c-\xF6\xE4\xFC\xDFzz", "w");
>     fopen("d-\xF6\xE4\xFC\xDFzzz", "w");
>     fopen("e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF", "w");
>     return 0;
> }

Ok, I see what happens.  The problem is that the mechanism which is
supposed to handle invalid multibyte sequences handles the first such
byte, but misses to reset the multibyte shift state after the byte has
been handled.  Basically, resetting the shift state after such a
sequence has been encountered fixes that problem.

Unfortunately this is only the first half of a solution.  This is what
`ls' prints after running t:

  $ ls -l --show-control-chars
  total 21
  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 a-öäüß
  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 c-öäüßzz
  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 d-öäüßzzz
  -rw-r--r-- 1 corinna vinschen     0 Sep 21 17:35 e-öäüßöäüß

But this is what ls prints when setting $LANG to something "non-C":

  $ setenv LANG en	(implies codepage 1252)
  $ ls -l --show-control-chars
  ls: cannot access a-öäüß: No such file or directory
  ls: cannot access c-öäüßzz: No such file or directory
  ls: cannot access d-öäüßzzz: No such file or directory
  ls: cannot access e-öäüßöäüß: No such file or directory
  total 21
  -????????? ? ?       ?            ?                ? a-öäüß
  -????????? ? ?       ?            ?                ? c-öäüßzz
  -????????? ? ?       ?            ?                ? d-öäüßzzz
  -????????? ? ?       ?            ?                ? e-öäüßöäüß

As you might know, invalid bytes >= 0x80 are translated to UTF-16 by
transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
The problem now is that readdir() will return the transposed characters
as if they are the original characters.  ls uses some mbtowc function
to create a valid widechar string, and then uses the resulting widechar
string in some wctomb function to call stat().  However, *that* string
will use a valid mutlibyte sequence to represent the character and the
resulting filename is suddenly different from the actual filename on
disk and stat returns with errno set to ENOENT.
Since the conversion fro and to is independent of each other, there's
no way to detect whether the incoming string of a wctomb was originally
based on a transposed character or not.

I'm not sure if I could explain this clear enough...

So it looks like the current mechanism to handle invalid multibyte
sequences is too complicated for us.  As far as I can see, it would be
much simpler and less error prone to translate the invalid bytes simply
to the equivalent UTF-16 value.  That creates filenames with UTF-16
values from the ISO-8859-1 range.  I tested this with the files created
by the above testcase.  While the filenames appeared to be different
dependent on the used charset, ls always handled the files gracefully.

Any objections?  I can also just check it in and the entire locale
challenged part of the community can test it...


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019