X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.3 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_12,J_CHICKENPOX_82,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <20090922170709.GV20981@calimero.vinschen.de> References: <416096c60909101512l6e42ab72l4ba5fd792363eefd AT mail DOT gmail DOT com> <20090921161014 DOT GI20981 AT calimero DOT vinschen DOT de> <416096c60909211154u5ddd5869v986011aa4ee13d57 AT mail DOT gmail DOT com> <20090922094523 DOT GR20981 AT calimero DOT vinschen DOT de> <416096c60909220912s5dd749bh5cfeb670b0e78c7a AT mail DOT gmail DOT com> <20090922170709 DOT GV20981 AT calimero DOT vinschen DOT de> Date: Wed, 23 Sep 2009 12:52:06 +0100 Message-ID: <416096c60909230452l42aa2210nf22b07c20cd2e697@mail.gmail.com> Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete? From: Andy Koppe To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com 2009/9/22 Corinna Vinschen: >> >> Therefore, when converting a UTF-16 Windows filename to the current >> >> charset, 0xDC?? words should be treated like any other UTF-16 word >> >> that can't be represented in the current charset: it should be encoded >> >> as a ^N sequence. (I started writing this before seeing your patch to the singlebyte codepage tables, which makes plenty of sense. Here goes anyway.) Having actually looked at strfuncs.cc, my diagnosis was too simplistic, because the U+DC?? codes are used not only for invalid UTF-8 bytes, but for invalid bytes in any charset. This even includes CP1252, which has a few holes in the 0x80..0x9F range. Therefore, the complete solution would be something like this: when sys_cp_wcstombs comes across a 0xDC?? code, it checks whether the byte it encodes is indeed an invalid byte in the current charset. If it is, it translates it into that invalid byte, because on the way back it would once again be turned into the same 0xDC?? code. If the byte would represent (part of) a valid character, however, it would need to be encoded as a ^N sequence to ensure correct roundtripping. Now that shouldn't be too difficult to implement for singlebyte charsets, but it gets somewhat hairy for multibyte charsets, including UTF-8 itself. Here's how I think it could be done though: In sys_cp_wcstombs: * On encountering a DC?? code, extract the encoded byte, and feed it into f_mbtowc. A private mbstate for this is needed, starting in the initial state for each filename. Switch on the result of f_mbtowc: ** case -2 (incomplete sequence): add the byte to a buffer for this purpose ** case -1 (invalid sequence): copy anything already in the buffer plus the current byte into the target filename, as we can be sure that they'll turn back into U-DCbb again on the way back. ** case >0 (valid sequence): encode buffer contents and current byte as a ^N codes that don't represent valid UTF-8 * When encountering a non-DC?? code, copy any bytes left in the buffer into the target filename. Unfortunately the latter point still leaves a loophole, in case the incomplete sequence from the buffer and the subsequent bytes combine into something valid. Singlebyte charset aren 't affected though, because they don't have continuation bytes. Nor is UTF-8, because it was designed such that continuation bytes are distinct from initial bytes. Which leaves the DBCS charsets. However, it rather looks like DBCSs are an intractable problem here in any case, because of issues like this: http://support.microsoft.com/kb/170559: "There are some codes that are not matched one-to-one between Shift-JIS (Japanese character set supported by MS) and Unicode. When an application calls MultiByteToWideChar() and WideCharToMultiByte() to perform code conversion between Shift-JIS and Unicode, the function returns the wrong code value in some cases." Which leaves me scratching my head regarding the C locale. More later ... Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple