| delorie.com/archives/browse.cgi | search |
| X-Recipient: | archive-cygwin AT delorie DOT com |
| X-SWARE-Spam-Status: | No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS |
| X-Spam-Check-By: | sourceware.org |
| MIME-Version: | 1.0 |
| Date: | Mon, 27 Jul 2009 22:56:34 +0100 |
| Message-ID: | <416096c60907271456x5e8cb3f7y64433d542ec6cdcb@mail.gmail.com> |
| Subject: | bug in mbrtowc? |
| From: | Andy Koppe <andy DOT koppe AT gmail DOT com> |
| To: | Cygwin Tech List <cygwin AT cygwin DOT com> |
| X-IsSubscribed: | yes |
| Mailing-List: | contact cygwin-help AT cygwin DOT com; run by ezmlm |
| List-Id: | <cygwin.cygwin.com> |
| List-Subscribe: | <mailto:cygwin-subscribe AT cygwin DOT com> |
| List-Archive: | <http://sourceware.org/ml/cygwin/> |
| List-Post: | <mailto:cygwin AT cygwin DOT com> |
| List-Help: | <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs> |
| Sender: | cygwin-owner AT cygwin DOT com |
| Mail-Followup-To: | cygwin AT cygwin DOT com |
| Delivered-To: | mailing list cygwin AT cygwin DOT com |
I've encountered what looks like a bug in mbrtowc's handling of UTF-8.
Here's an example:
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <wchar.h>
int main(void) {
wchar_t wc;
size_t ret;
mbstate_t s = { 0 };
puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0));
printf("%i\n", mbrtowc(&wc, "\x94", 1, 0));
printf("%i\n", mbrtowc(&wc, "\x84", 1, 0));
printf("%x\n", wc);
return 0;
}
The sequence E2 94 84 should translate to U+2514. Instead, the second
and third calls to mbrtowc report encoding errors. It does work
correctly if the three bytes are passed to mbrtowc() in one go:
printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0));
Andy
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
| webmaster | delorie software privacy |
| Copyright © 2019 by DJ Delorie | Updated Jul 2019 |