delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/07/28/08:33:48

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-1.8 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
Date: Tue, 28 Jul 2009 13:33:29 +0100
Message-ID: <416096c60907280533u2d975655tb957bc5cf05f9040@mail.gmail.com>
Subject: Re: bug in mbrtowc?
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

2009/7/28 Corinna Vinschen:
>> >> Trouble is, the hack will also only work correctly if the whole UTF-8
>> >> sequence for the non-BMP character is passed at once. If you pass the
>> >> bytes one-by-one instead, and assuming the bug above wasn't there,
>> >> you'd get this:
>> >
>> > Yes, I know. =C2=A0The real trouble is, I don't know how that can be f=
ixed
>> > in a still sort-of-POSIXy way.
>>
>> The way I'd suggested is sort-of-POSIXy, but perhaps not enough,
>> because apps that check the mbrtowc() return code (and not the written
>> wc) against zero will interpret a low surrogate as string end. An
>> alternative might be to just return an error when there's no compliant
>> way to return the low surrogate. Do you think either of these are
>> worth pursuing?
>
> I'm thinking of faking a valid return of 1 (or 2, or 3) after the third b=
yte
> has been read. =C2=A0Three bytes are sufficient to create the first surro=
gate
> half in wc.

Great idea!

I wouldn't even say it's fake, because as you say, you definitely have
a high surrogate after three bytes. So just return the number of bytes
actually used. It's also valid to leave it in a non-initial state
after that; consider it the surrogate shift state or some such. And if
the first byte in the next call isn't actually a valid fourth byte,
just return an error.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019