X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-0.2 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <20090728091413.GJ18621@calimero.vinschen.de> References: <416096c60907271456x5e8cb3f7y64433d542ec6cdcb AT mail DOT gmail DOT com> <20090728091413 DOT GJ18621 AT calimero DOT vinschen DOT de> Date: Tue, 28 Jul 2009 06:50:58 -0300 Message-ID: <94b5b62d0907280250q3321f62ft6cc542367dbc68d2@mail.gmail.com> Subject: Re: bug in mbrtowc? From: Pedro Izecksohn To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com The bug is in O.P.'s code as &s is not being passed to mbrtowc. I'm on Ubuntu. I do not have Cygwin here. I should consume some calories before trying to debug anything. On Tue, Jul 28, 2009 at 6:14 AM, Corinna Vinschen wrote: > On Jul 27 22:56, Andy Koppe wrote: >> I've encountered what looks like a bug in mbrtowc's handling of UTF-8. >> Here's an example: >> >> #include >> #include >> #include >> #include >> >> int main(void) { >> =C2=A0 wchar_t wc; >> =C2=A0 size_t ret; >> =C2=A0 mbstate_t s =3D { 0 }; >> =C2=A0 puts(setlocale(LC_CTYPE, "en_GB.UTF-8")); >> =C2=A0 printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0)); >> =C2=A0 printf("%i\n", mbrtowc(&wc, "\x94", 1, 0)); >> =C2=A0 printf("%i\n", mbrtowc(&wc, "\x84", 1, 0)); >> =C2=A0 printf("%x\n", wc); >> =C2=A0 return 0; >> } >> >> The sequence E2 94 84 should translate to U+2514. Instead, the second >> and third calls to mbrtowc report encoding errors. It does work >> correctly if the three bytes are passed to mbrtowc() in one go: >> >> =C2=A0 printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0)); > > That's a bug in the newlib function __utf8_mbtowc. =C2=A0I'm really surpr= ised > that this bug has never been reported before since it's in the code for > years, probably since it has been introduced in 2002. > > I'll follow up on the newlib list. > > > Thanks for the report and especially thanks for the testcase, > Corinna -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple