X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.8 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 Date: Tue, 28 Jul 2009 13:33:29 +0100 Message-ID: <416096c60907280533u2d975655tb957bc5cf05f9040@mail.gmail.com> Subject: Re: bug in mbrtowc? From: Andy Koppe To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com 2009/7/28 Corinna Vinschen: >> >> Trouble is, the hack will also only work correctly if the whole UTF-8 >> >> sequence for the non-BMP character is passed at once. If you pass the >> >> bytes one-by-one instead, and assuming the bug above wasn't there, >> >> you'd get this: >> > >> > Yes, I know. =C2=A0The real trouble is, I don't know how that can be f= ixed >> > in a still sort-of-POSIXy way. >> >> The way I'd suggested is sort-of-POSIXy, but perhaps not enough, >> because apps that check the mbrtowc() return code (and not the written >> wc) against zero will interpret a low surrogate as string end. An >> alternative might be to just return an error when there's no compliant >> way to return the low surrogate. Do you think either of these are >> worth pursuing? > > I'm thinking of faking a valid return of 1 (or 2, or 3) after the third b= yte > has been read. =C2=A0Three bytes are sufficient to create the first surro= gate > half in wc. Great idea! I wouldn't even say it's fake, because as you say, you definitely have a high surrogate after three bytes. So just return the number of bytes actually used. It's also valid to leave it in a non-initial state after that; consider it the surrogate shift state or some such. And if the first byte in the next call isn't actually a valid fourth byte, just return an error. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple