delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/23/07:52:24

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-1.3 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_12,J_CHICKENPOX_82,SARE_MSGID_LONG40,SPF_PASS
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <20090922170709.GV20981@calimero.vinschen.de>
References: <h8bk5a$big$1 AT ger DOT gmane DOT org> <416096c60909101512l6e42ab72l4ba5fd792363eefd AT mail DOT gmail DOT com> <h8p50e$im8$1 AT ger DOT gmane DOT org> <20090921161014 DOT GI20981 AT calimero DOT vinschen DOT de> <416096c60909211154u5ddd5869v986011aa4ee13d57 AT mail DOT gmail DOT com> <20090922094523 DOT GR20981 AT calimero DOT vinschen DOT de> <416096c60909220912s5dd749bh5cfeb670b0e78c7a AT mail DOT gmail DOT com> <20090922170709 DOT GV20981 AT calimero DOT vinschen DOT de>
Date: Wed, 23 Sep 2009 12:52:06 +0100
Message-ID: <416096c60909230452l42aa2210nf22b07c20cd2e697@mail.gmail.com>
Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

2009/9/22 Corinna Vinschen:
>> >> Therefore, when converting a UTF-16 Windows filename to the current
>> >> charset, 0xDC?? words should be treated like any other UTF-16 word
>> >> that can't be represented in the current charset: it should be encoded
>> >> as a ^N sequence.

(I started writing this before seeing your patch to the singlebyte
codepage tables, which makes plenty of sense. Here goes anyway.)

Having actually looked at strfuncs.cc, my diagnosis was too
simplistic, because the U+DC?? codes are used not only for invalid
UTF-8 bytes, but for invalid bytes in any charset. This even includes
CP1252, which has a few holes in the 0x80..0x9F range.

Therefore, the complete solution would be something like this: when
sys_cp_wcstombs comes across a 0xDC?? code, it checks whether the byte
it encodes is indeed an invalid byte in the current charset. If it is,
it translates it into that invalid byte, because on the way back it
would once again be turned into the same 0xDC?? code. If the byte
would represent (part of) a valid character, however, it would need to
be encoded as a ^N sequence to ensure correct roundtripping.

Now that shouldn't be too difficult to implement for singlebyte
charsets, but it gets somewhat hairy for multibyte charsets, including
UTF-8 itself. Here's how I think it could be done though:

In sys_cp_wcstombs:

* On encountering a DC?? code, extract the encoded byte, and feed it
into f_mbtowc. A private mbstate for this is needed, starting in the
initial state for each filename. Switch on the result of f_mbtowc:
** case -2 (incomplete sequence): add the byte to a buffer for this purpose
** case -1 (invalid sequence): copy anything already in the buffer
plus the current byte into the target filename, as we can be sure that
they'll turn back into U-DCbb again on the way back.
** case >0 (valid sequence): encode buffer contents and current byte
as a ^N codes that don't represent valid UTF-8

* When encountering a non-DC?? code, copy any bytes left in the buffer
into the target filename.

Unfortunately the latter point still leaves a loophole, in case the
incomplete sequence from the buffer and the subsequent bytes combine
into something valid. Singlebyte charset aren 't affected though,
because they don't have continuation bytes. Nor is UTF-8, because it
was designed such that continuation bytes are distinct from initial
bytes. Which leaves the DBCS charsets.

However, it rather looks like DBCSs are an intractable problem here in
any case, because of issues like this:

http://support.microsoft.com/kb/170559: "There are some codes that are
not matched one-to-one between Shift-JIS (Japanese character set
supported by MS) and Unicode. When an application calls
MultiByteToWideChar() and WideCharToMultiByte() to perform code
conversion between Shift-JIS and Unicode, the function returns the
wrong code value in some cases."

Which leaves me scratching my head regarding the C locale. More later ...

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019