X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Tue, 28 Jul 2009 19:02:54 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: bug in mbrtowc? Message-ID: <20090728170254.GX18621@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <416096c60907280533u2d975655tb957bc5cf05f9040 AT mail DOT gmail DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <416096c60907280533u2d975655tb957bc5cf05f9040@mail.gmail.com> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Jul 28 13:33, Andy Koppe wrote: > 2009/7/28 Corinna Vinschen: > >> >> Trouble is, the hack will also only work correctly if the whole UTF-8 > >> >> sequence for the non-BMP character is passed at once. If you pass the > >> >> bytes one-by-one instead, and assuming the bug above wasn't there, > >> >> you'd get this: > >> > > >> > Yes, I know.  The real trouble is, I don't know how that can be fixed > >> > in a still sort-of-POSIXy way. > >> > >> The way I'd suggested is sort-of-POSIXy, but perhaps not enough, > >> because apps that check the mbrtowc() return code (and not the written > >> wc) against zero will interpret a low surrogate as string end. An > >> alternative might be to just return an error when there's no compliant > >> way to return the low surrogate. Do you think either of these are > >> worth pursuing? > > > > I'm thinking of faking a valid return of 1 (or 2, or 3) after the third byte > > has been read.  Three bytes are sufficient to create the first surrogate > > half in wc. > > Great idea! > > I wouldn't even say it's fake, because as you say, you definitely have > a high surrogate after three bytes. So just return the number of bytes > actually used. It's also valid to leave it in a non-initial state > after that; consider it the surrogate shift state or some such. And if > the first byte in the next call isn't actually a valid fourth byte, > just return an error. I propsed a patch: http://sourceware.org/ml/newlib/2009/msg00781.html Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple