X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Tue, 12 May 2009 18:54:04 +0200 From: Corinna Vinschen To: newlib AT sourceware DOT org Cc: cygwin AT cygwin DOT com Subject: [Fwd: [1.7] wcwidth failing configure tests] Message-ID: <20090512165404.GW21324@calimero.vinschen.de> Reply-To: newlib AT sourceware DOT org Mail-Followup-To: newlib AT sourceware DOT org, cygwin AT cygwin DOT com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Forwarded to newlib. ----- Forwarded message from Eric Blake ----- > Date: Tue, 12 May 2009 16:02:04 +0000 (UTC) > From: Eric Blake > Subject: [1.7] wcwidth failing configure tests > To: cygwin AT cygwin DOT com > > I noticed this failure in various configure scripts (findutils, coreutils, ...): > > checking whether wcwidth works reasonably in UTF-8 locales... no > > I've reduced it to a STC: > > #include > #include > int main () > { > int i = 0; > if (setlocale (LC_ALL, "fr_FR.UTF-8") != NULL) > { > if (wcwidth (0x0301) > 0) > i |= 1; > if (wcwidth (0x200B) > 0) > i |= 2; > } > return i; > } > > The return value should be 0 but is coming back as 3; 0x0301 is a combining > mark which should occupy no space on its own, and 0x200b is a 0-width space, > according to Unicode 5.1 (and earlier, to some extent). And that probably > means that other places within wcwidth() are broken. ----- End forwarded message ----- wcwidth returns 1 if iswprint returns true. I had a quick debug attempt and it turns out that the entire range 0x0300..0x034f is marked as printable in the u3 array in libc/ctype/utf8print.h. The entire range 0x0300..0x034f are combining characters which are printable, but have zero width. 200b..200d are all three zero-width characters but all three are also printable. Scanning the Unicode 5.1 standard, I see a couple of these characters, which are printable but have zero width: 0300..036f 0483..0489 200b..200f 20d0..20ea 3099..309a fe20..fe23 (not sure about them. Each of them is the half of a full combined char which doesn't make sense alone, afaics) feff and a couple of musical symbols in the 0x1d1xx range How can we fix this problem? Should we hardcode a check for the above character values in wcwidth? And here's another question. The utf8*.h files claim they have been generated from the unicode.txt file of the Unicode 3.2 standard. Do we have the script which generated the utf8*.h files? Can we regenerate the files to match the current Unicode 5.1 standard? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/