delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/07/28/05:14:39

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Tue, 28 Jul 2009 11:14:13 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: bug in mbrtowc?
Message-ID: <20090728091413.GJ18621@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <416096c60907271456x5e8cb3f7y64433d542ec6cdcb AT mail DOT gmail DOT com>
MIME-Version: 1.0
In-Reply-To: <416096c60907271456x5e8cb3f7y64433d542ec6cdcb@mail.gmail.com>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Jul 27 22:56, Andy Koppe wrote:
> I've encountered what looks like a bug in mbrtowc's handling of UTF-8.
> Here's an example:
> 
> #include <stdio.h>
> #include <locale.h>
> #include <stdlib.h>
> #include <wchar.h>
> 
> int main(void) {
>   wchar_t wc;
>   size_t ret;
>   mbstate_t s = { 0 };
>   puts(setlocale(LC_CTYPE, "en_GB.UTF-8"));
>   printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0));
>   printf("%i\n", mbrtowc(&wc, "\x94", 1, 0));
>   printf("%i\n", mbrtowc(&wc, "\x84", 1, 0));
>   printf("%x\n", wc);
>   return 0;
> }
> 
> The sequence E2 94 84 should translate to U+2514. Instead, the second
> and third calls to mbrtowc report encoding errors. It does work
> correctly if the three bytes are passed to mbrtowc() in one go:
> 
>   printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0));

That's a bug in the newlib function __utf8_mbtowc.  I'm really surprised
that this bug has never been reported before since it's in the code for
years, probably since it has been introduced in 2002.

I'll follow up on the newlib list.


Thanks for the report and especially thanks for the testcase,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019