X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Wed, 13 May 2009 21:38:16 +0200 From: Corinna Vinschen To: newlib AT sourceware DOT org, cygwin AT cygwin DOT com Subject: Re: [Fwd: [1.7] wcwidth failing configure tests] Message-ID: <20090513193816.GA7650@calimero.vinschen.de> Reply-To: newlib AT sourceware DOT org Mail-Followup-To: newlib AT sourceware DOT org, cygwin AT cygwin DOT com References: <20090512165404 DOT GW21324 AT calimero DOT vinschen DOT de> <416096c60905120956n5521929bm69586f5e6325a994 AT mail DOT gmail DOT com> <20090512173153 DOT GY21324 AT calimero DOT vinschen DOT de> <416096c60905131204r473ac1d3t4c811f7f0a4cb81f AT mail DOT gmail DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <416096c60905131204r473ac1d3t4c811f7f0a4cb81f@mail.gmail.com> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On May 13 20:04, Andy Koppe wrote: > 2009/5/12 Corinna Vinschen: > >> Trouble is, there's the thorny issue of the "CJK Ambiguous Width" > >> category of characters, which consists of things like Greek and > >> Cyrillic letters as well as line drawing symbols. Those have a width > >> of 1 in Western use, yet with CJK fonts they have a width of 2. That's > >> why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. > > > > We should use the standard variation alone, imho. > > I'm not sure that CJK users would be happy with that. See MinTTY issue > 88 for my misguided attempts to dismiss this as a legacy issue: > http://code.google.com/p/mintty/issues/detail?id=88 > > In comment 8 on that, "deenheart" mentioned that he was working on a > fix for wcwidth(). I don't know what he had in mind, but I'd suspect > something based on an environment variable setting. > > > And we need some workaround for UTF-16 systems like Cygwin. > > Unfortunately, surrogate pairs only work well as part of a string, not > > as standalone chars.  So wcwidth would return -1 for each single char, > > but wcswidth could be tweaked to handle them gracefully. > > Looking at the ranges in wcwidth.c, it might be possible to decide the > width of a surrogate pair based on the high surrogate only, and then > treat the low surrogate as a combining character with length 0. How should that work? The first half of the surrogate pair has not enough information to decide that. For instance, take the ranges 0x10A01, 0x10A03 }, { 0x10A05, 0x10A06 }. The information about the low 10 bits of the Unicode value is in the second half of the pair. From the first half you don't know if the char is perhaps the 0x10A04 value or one of the other. So you need both halves to make a decision. A surrogate pair half alone is also always invalid. That's something you can't handle in wcwidth. Corinna -- Corinna Vinschen Cygwin Project Co-Leader Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/