X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Tue, 28 Jul 2009 12:36:11 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: bug in mbrtowc? Message-ID: <20090728103611.GP18621@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <416096c60907271456x5e8cb3f7y64433d542ec6cdcb AT mail DOT gmail DOT com> <20090728091413 DOT GJ18621 AT calimero DOT vinschen DOT de> <416096c60907280324q5555a9e4he636a7504f44ebf7 AT mail DOT gmail DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <416096c60907280324q5555a9e4he636a7504f44ebf7@mail.gmail.com> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Jul 28 11:24, Andy Koppe wrote: > 2009/7/28 Corinna Vinschen: > > On Jul 27 22:56, Andy Koppe wrote: > >> I've encountered what looks like a bug in mbrtowc's handling of UTF-8. > >> Here's an example: > >> > >> #include > >> #include > >> #include > >> #include > >> > >> int main(void) { > >> wchar_t wc; > >> size_t ret; > >> mbstate_t s = { 0 }; > >> puts(setlocale(LC_CTYPE, "en_GB.UTF-8")); > >> printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0)); > >> printf("%i\n", mbrtowc(&wc, "\x94", 1, 0)); > >> printf("%i\n", mbrtowc(&wc, "\x84", 1, 0)); > >> printf("%x\n", wc); > >> return 0; > >> } > >> > >> The sequence E2 94 84 should translate to U+2514. Instead, the second > >> and third calls to mbrtowc report encoding errors. It does work > >> correctly if the three bytes are passed to mbrtowc() in one go: > >> > >> printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0)); > > > > That's a bug in the newlib function __utf8_mbtowc. I'm really surprised > > that this bug has never been reported before since it's in the code for > > years, probably since it has been introduced in 2002. > > I guess normallly programs just pass whole strings to mbrstowcs? > > I've had a look at the code, but didn't grasp it enough to suggest a > fix. I'd also wondered how mbrtowc() deals with non-BMP characters > given that wchar_t is only 16 bits wide, and was quite pleased to see > that it does have a special hack for turning them into UTF-16 > surrogates. > > Trouble is, the hack will also only work correctly if the whole UTF-8 > sequence for the non-BMP character is passed at once. If you pass the > bytes one-by-one instead, and assuming the bug above wasn't there, > you'd get this: Yes, I know. The real trouble is, I don't know how that can be fixed in a still sort-of-POSIXy way. Typical POSIX code doesn't know about UTF-16 and expects the wchar returned to be complete. So, for the time being, surrogates work if they are part of a string given to mbstowcs/mbsrtowcs/mbsnrtowcs, but they need very special care if the application uses mbrtowc. If you can come up with a nice patch, send it to the newlib list, please. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple