X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-2.6 required=5.0 tests=BAYES_00 X-Spam-Check-By: sourceware.org Date: Wed, 20 May 2009 18:52:34 +0200 (CEST) Message-Id: <200905201652.n4KGqYGm000509@mail.bln1.bf.nsn-intra.net> From: Thomas Wolff To: newlib AT sourceware DOT org, cygwin AT cygwin DOT com Subject: Re: [Fwd: [1.7] wcwidth failing configure tests] References: <20090512165404 DOT GW21324 AT calimero DOT vinschen DOT de> <416096c60905120956n5521929bm69586f5e6325a994 AT mail DOT gmail DOT com> <20090512173153 DOT GY21324 AT calimero DOT vinschen DOT de> <3f0ad08d0905140858j17c7b374paa649f18ef18178d AT mail DOT gmail DOT com> Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Corinna Vinschen wrote: > On May 12 17:56, Andy Koppe wrote: > > > And here's another question. ?The utf8*.h files claim they have been > > > generated from the unicode.txt file of the Unicode 3.2 standard. ?Do we > > > have the script which generated the utf8*.h files? ?Can we regenerate > > > the files to match the current Unicode 5.1 standard? I've updated my editor mined to Unicode 5.1 data already. I can provide an according wcwidth function if that's desired. I also have scripts for semi-automatic generation of this information, however "semi" as I said, to be improved. > > There's Markus Kuhn's wcwidth implementation, which says it's based on > > Unicode 5.0: > > > > http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c > > This looks nice. I'm sure Markus will update to 5.1 one day too... > > Trouble is, there's the thorny issue of the "CJK Ambiguous Width" > > category of characters, which consists of things like Greek and > > Cyrillic letters as well as line drawing symbols. Those have a width > > of 1 in Western use, yet with CJK fonts they have a width of 2. That's > > why Markus Kuhn's code includes the mk_wcswidth_cjk() variant. > > We should use the standard variation alone, imho. > > And we need some workaround for UTF-16 systems like Cygwin. > Unfortunately, surrogate pairs only work well as part of a string, not > as standalone chars. So wcwidth would return -1 for each single char, > but wcswidth could be tweaked to handle them gracefully. This gets me to the related question how to output non-BMP characters; currently, the cygwin console display them all as two square boxes, using two screen columns. This indicates that probably just the single surrogate characters are being output. Could proper non-BMP character display be achieved by simply combining the surrogates and outputting them to Windows as a true Unicode character? (The Windows function would need to be 32 bit which I don't know, the string elements could stay as they are.) Just an idea which might lead to a simple solution. > On May 15 00:58, IWAMURO Motonori wrote: > > 2009/5/13 Corinna Vinschen : > > >> Trouble is, there's the thorny issue of the "CJK Ambiguous Width" > > >> ... (see above) > > > We should use the standard variation alone, imho. > > I don't think so. > > > > 1) It is very very inconvenient for me :-) > > > > 2) Unicode Standard Annex #11 > > http://www.unicode.org/unicode/reports/tr11/ recommends: > > > 5 Recommendations > > (snip) > > > When processing or displaying data > > (snip) > > > Ambiguous characters behave like wide or narrow characters depending > > > on the context (language tag, script identification, associated > > > font, source of data, or explicit markup; all can provide the > > > context). If the context cannot be established reliably, they should > > > be treated as narrow characters by default. > > > > The recommendation is independent of legacy encoding. > > > > I think that a new locale category that specifies the "context" is necessary. > > Because the "context" influences only the display or text layout. > > > > However, there is no such standard now. > > > > Therefore, I propose to use *_cjk() when the language part of LC_CTYPE > > is 'ja', 'ko', 'vi' or 'zh'. The problem with this is 1. As you say, there is no standard. 2. If you wish to handle character widths compliant with the terminal your application is running in, there is no guarantee that your assumption of CJK width (or the actual locale setting if that model would be implemented) does indeed reflect the terminal's width properties. 3. In mintty, you can dynamically change width properties by selecting different fonts; mintty changes CJK width behaviour according to certain font properties. "static" configuration in your shell using a locale variable would not reflect this change I see two ways to handle this: a) Ask Andy (author of mintty) to not do this switching; however, I don't know what display consequences that might have. On the other hand, other terminals don't switch either. Or maybe mintty could at leasts issue a warning on CJK width switching, or maintain two separate font lists, or... b) Determine the actual CJK width behaviour dynamically. That's what mined does (in addition to other width property detection in general). That's why it can handle the alternative quite seamlessly. > That would be fine with me, but tests for the actual language are not > used anywhere in newlib, so that's something very new. So I would suggest not to introduce it before the concept is sufficiently discussed. And I'm not happy with the idea of a cygwin-specific solution (or workaround). Kind regards, Thomas -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/