delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/07/28/06:36:35

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Tue, 28 Jul 2009 12:36:11 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: bug in mbrtowc?
Message-ID: <20090728103611.GP18621@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <416096c60907271456x5e8cb3f7y64433d542ec6cdcb AT mail DOT gmail DOT com> <20090728091413 DOT GJ18621 AT calimero DOT vinschen DOT de> <416096c60907280324q5555a9e4he636a7504f44ebf7 AT mail DOT gmail DOT com>
MIME-Version: 1.0
In-Reply-To: <416096c60907280324q5555a9e4he636a7504f44ebf7@mail.gmail.com>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Jul 28 11:24, Andy Koppe wrote:
> 2009/7/28 Corinna Vinschen:
> > On Jul 27 22:56, Andy Koppe wrote:
> >> I've encountered what looks like a bug in mbrtowc's handling of UTF-8.
> >> Here's an example:
> >>
> >> #include <stdio.h>
> >> #include <locale.h>
> >> #include <stdlib.h>
> >> #include <wchar.h>
> >>
> >> int main(void) {
> >>   wchar_t wc;
> >>   size_t ret;
> >>   mbstate_t s = { 0 };
> >>   puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
> >>   printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0));
> >>   printf("%i\n", mbrtowc(&wc, "\x94", 1, 0));
> >>   printf("%i\n", mbrtowc(&wc, "\x84", 1, 0));
> >>   printf("%x\n", wc);
> >>   return 0;
> >> }
> >>
> >> The sequence E2 94 84 should translate to U+2514. Instead, the second
> >> and third calls to mbrtowc report encoding errors. It does work
> >> correctly if the three bytes are passed to mbrtowc() in one go:
> >>
> >>   printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0));
> >
> > That's a bug in the newlib function __utf8_mbtowc.  I'm really surprised
> > that this bug has never been reported before since it's in the code for
> > years, probably since it has been introduced in 2002.
> 
> I guess normallly programs just pass whole strings to mbrstowcs?
> 
> I've had a look at the code, but didn't grasp it enough to suggest a
> fix. I'd also wondered how mbrtowc() deals with non-BMP characters
> given that wchar_t is only 16 bits wide, and was quite pleased to see
> that it does have a special hack for turning them into UTF-16
> surrogates.
> 
> Trouble is, the hack will also only work correctly if the whole UTF-8
> sequence for the non-BMP character is passed at once. If you pass the
> bytes one-by-one instead, and assuming the bug above wasn't there,
> you'd get this:

Yes, I know.  The real trouble is, I don't know how that can be fixed
in a still sort-of-POSIXy way.  Typical POSIX code doesn't know about
UTF-16 and expects the wchar returned to be complete.

So, for the time being, surrogates work if they are part of a string
given to mbstowcs/mbsrtowcs/mbsnrtowcs, but they need very special
care if the application uses mbrtowc.

If you can come up with a nice patch, send it to the newlib list,
please.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019