Mail Archives: cygwin/2009/07/28/01:23:16
2009/7/28 Pedro Izecksohn:
>> #include <stdio.h>
>> #include <locale.h>
>> #include <stdlib.h>
>> #include <wchar.h>
>>
>> int main(void) {
>> wchar_t wc;
>> size_t ret;
>> mbstate_t s =3D { 0 };
>> puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
>> printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0));
>> printf("%i\n", mbrtowc(&wc, "\x94", 1, 0));
>> printf("%i\n", mbrtowc(&wc, "\x84", 1, 0));
>> printf("%x\n", wc);
>> return 0;
>> }
>>
>> The sequence E2 94 84 should translate to U+2514. Instead, the second
>> and third calls to mbrtowc report encoding errors. It does work
>> correctly if the three bytes are passed to mbrtowc() in one go:
> =C2=A0From the "Linux Programmer=E2=80=99s Manual" (release 3.15 of the L=
inux man-pages):
> "If the n bytes starting at s do not contain a complete multibyte
> character, =C2=A0mbrtowc() =C2=A0returns =C2=A0(size_t) -2."
Correct. And the first call to mbrtowc() does just that. The problem
is that the second call returns -1, which signals an encoding error,
even though E2 94 is a valid yet incomplete sequence, i.e. it should
also return -2 and remember what it's seen so far in its internal
state. The third call should return 1 and write 0x2504 to wc.
Andy
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -