Date: Tue, 12 May 2009 18:54:04 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: newlib AT sourceware DOT org
Cc: cygwin AT cygwin DOT com
Subject: [Fwd: [1.7] wcwidth failing configure tests]
Message-ID: <20090512165404.GW21324@calimero.vinschen.de>
Reply-To: newlib AT sourceware DOT org
Mail-Followup-To: newlib AT sourceware DOT org, cygwin AT cygwin DOT com
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com

Forwarded to newlib.

----- Forwarded message from Eric Blake -----
> Date: Tue, 12 May 2009 16:02:04 +0000 (UTC)
> From: Eric Blake
> Subject:  [1.7] wcwidth failing configure tests
> To: cygwin AT cygwin DOT com
> 
> I noticed this failure in various configure scripts (findutils, coreutils, ...):
> 
> checking whether wcwidth works reasonably in UTF-8 locales... no
> 
> I've reduced it to a STC:
> 
> #include <locale.h>
> #include <wchar.h>
> int main ()
> {
>   int i = 0;
>   if (setlocale (LC_ALL, "fr_FR.UTF-8") != NULL)
>     {
>       if (wcwidth (0x0301) > 0)
>         i |= 1;
>       if (wcwidth (0x200B) > 0)
>         i |= 2;
>     }
>   return i;
> }
> 
> The return value should be 0 but is coming back as 3; 0x0301 is a combining 
> mark which should occupy no space on its own, and 0x200b is a 0-width space, 
> according to Unicode 5.1 (and earlier, to some extent).  And that probably 
> means that other places within wcwidth() are broken.
----- End forwarded message -----

wcwidth returns 1 if iswprint returns true.  I had a quick debug attempt
and it turns out that the entire range 0x0300..0x034f is marked as
printable in the u3 array in libc/ctype/utf8print.h.  The entire range
0x0300..0x034f are combining characters which are printable, but have
zero width.

200b..200d are all three zero-width characters but all three are also
printable.

Scanning the Unicode 5.1 standard, I see a couple of these characters,
which are printable but have zero width:

0300..036f
0483..0489
200b..200f
20d0..20ea
3099..309a
fe20..fe23 (not sure about them.  Each of them is the half of a full combined
	    char which doesn't make sense alone, afaics)
feff
and a couple of musical symbols in the 0x1d1xx range

How can we fix this problem?  Should we hardcode a check for the above
character values in wcwidth?

And here's another question.  The utf8*.h files claim they have been
generated from the unicode.txt file of the Unicode 3.2 standard.  Do we
have the script which generated the utf8*.h files?  Can we regenerate
the files to match the current Unicode 5.1 standard?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/