X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 Date: Mon, 27 Jul 2009 22:56:34 +0100 Message-ID: <416096c60907271456x5e8cb3f7y64433d542ec6cdcb@mail.gmail.com> Subject: bug in mbrtowc? From: Andy Koppe To: Cygwin Tech List Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com I've encountered what looks like a bug in mbrtowc's handling of UTF-8. Here's an example: #include #include #include #include int main(void) { wchar_t wc; size_t ret; mbstate_t s = { 0 }; puts(setlocale(LC_CTYPE, "en_GB.UTF-8")); printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0)); printf("%i\n", mbrtowc(&wc, "\x94", 1, 0)); printf("%i\n", mbrtowc(&wc, "\x84", 1, 0)); printf("%x\n", wc); return 0; } The sequence E2 94 84 should translate to U+2514. Instead, the second and third calls to mbrtowc report encoding errors. It does work correctly if the three bytes are passed to mbrtowc() in one go: printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0)); Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple