X-Recipient: archive-cygwin AT delorie DOT com DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:subject:reply-to:to:references:message-id :date:mime-version:in-reply-to:content-type :content-transfer-encoding; q=dns; s=default; b=Fm2a0Gr21XYrMQH5 LYigrbSrPidynzBla7iQuZpM+7XP04fe9txaZp/SFjXtfgNKRtY/g73Jymwehk21 uP276dMaO5vyo7pijL+y/XMp7AfR6aq4+/QWhoTdju7e+lhT0ThcUzjiR2co7zM6 4xdANwyjePTGlyG0alAzN7rloF8= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:subject:reply-to:to:references:message-id :date:mime-version:in-reply-to:content-type :content-transfer-encoding; s=default; bh=07u9C1v/3lyPLK4pCprMue lNa6U=; b=yaCukMeuRZQyMCE8qsACLsC00uuAZ3/7RzdKOisdXGzV7F4m2qRoxz CJzOphs0nkp1F+a8esEVKXzSMel/jJedTant12NK61cFpMQron43EJ4rB0pMpjbZ DgecF9IPQXheFFrB1OBfE0xhKSSkDJ152It0XpMEPlU/ln2teKDgg= Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW autolearn=no version=3.3.2 spammy=Hx-spam-relays-external:sk:smtp-ou, H*RU:!192.168.1.100!, H*RU:sk:smtp-ou, H*r:sk:smtp-ou X-HELO: smtp-out-so.shaw.ca X-Authority-Analysis: v=2.2 cv=B4DJ6KlM c=1 sm=1 tr=0 a=MVEHjbUiAHxQW0jfcDq5EA==:117 a=MVEHjbUiAHxQW0jfcDq5EA==:17 a=IkcTkHD0fZMA:10 a=JMnnF-g5AAAA:8 a=te1EGT4yAAAA:8 a=US7-Rng0AAAA:8 a=7GStbvgMAAAA:8 a=z0uG624dAAAA:8 a=8ojzKsOLb2Yxtd_LfQ0A:9 a=zaLTAmSQKsBDdYsW:21 a=cqiVTIfBzRO9Hg75:21 a=QEXdDO2ut3YA:10 a=MIlUOSWcqtxlsnWNI7We:22 a=RRElR4r2U1jGY2dU47NL:22 a=RCpFSEPCRiHwXyn-TuLs:22 a=bgd9Iqch1-7RybpTBNxN:22 a=XYTzjgE7hB3o1y3dZZOX:22 From: Brian Inglis Subject: Re: Unicode width data inconsistent/outdated Reply-To: Brian DOT Inglis AT SystematicSw DOT ab DOT ca To: cygwin AT cygwin DOT com References: <20170726080859 DOT GA24312 AT calimero DOT vinschen DOT de> <5d3cb047-49f8-26a6-d816-387a71486e99 AT cygwin DOT com> <20170726095016 DOT GA25666 AT calimero DOT vinschen DOT de> <289bd98b-e644-888d-07f8-8965b6538373 AT towo DOT net> <20170728195826 DOT GI24013 AT calimero DOT vinschen DOT de> <1244bd24-bb27-d185-1f24-61beae02c2cd AT towo DOT net> <20170804170156 DOT GL25551 AT calimero DOT vinschen DOT de> <30486790-c59d-9a78-6000-b3c20fb86d9d AT towo DOT net> <20170807092820 DOT GQ25551 AT calimero DOT vinschen DOT de> Message-ID: <401b6d26-35cb-3026-afde-6bd5d09b2d71@SystematicSw.ab.ca> Date: Mon, 7 Aug 2017 13:07:15 -0600 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20170807092820.GQ25551@calimero.vinschen.de> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-CMAE-Envelope: MS4wfLNeiF8jLJIfERmmX/eF6HBFESNjrV8T3hU7+e4rR87lwbJtwCS17AApdC1s8eFTR2UPJTTzq3dAyfKoiHPKncRZ9qySYdXYZjRI4OhJDjSljdqWzNUW lF9182n2HILSIY3xCGNneqrvcE4g6rGjZSxNraFc8UdYLejrAtFeZ3JzIfiHPVM2Qa34s+5Aw4HmPA== X-IsSubscribed: yes On 2017-08-07 03:28, Corinna Vinschen wrote: > On Aug 5 21:06, Thomas Wolff wrote: >> Am 04.08.2017 um 19:01 schrieb Corinna Vinschen: >>> On Aug 3 21:44, Thomas Wolff wrote: >>>> My attempt would be to base the functions on a common table of character categories instead. >>> Keep in mind that the table is not loaded into memory on demand, as on >>> Linux. Rather it will be part of the Cygwin DLL, and worse in case >>> newlib, any target using the wctype functions. >> Maybe we could change that (load on demand, or put them in a shared library >> perhaps), but... > > That won't work for embedded targets, especially small ones. > > If you want to go that route, you would have to extend struct __locale_t > or lc_ctype_T (in newlib/libc/locale/setlocale.h) to contain pointers to > conversion tables (Cygwin-only), and the __set_lc_ctype_from_win function > or a new function inside Cygwin (but called from __ctype_load_locale) > could load the tables. > > Then you could create new iswXXX, towXXX, and wcwidth functions inside > Cygwin using these tables, rather than relying on the newlib code. > > Alternatively, if RTEMS is interested as well, we may strive for a > newlib solution which is opt-in. Loading tables (or even big tables at > all) isn't a good solution for very small targets. > >>> The idea here is that the tables take less space than a full-fledged >>> category table. The tables in utf8print.h and utf8alpha.h and the code >>> in iswalpha and iswprint combined are 10K, code and data of the >>> tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K, >>> covering Unicode 5.2 with 107K codepoints. >>> >>> A category table would have to contain the category bits for the entire >>> Unicode codepoint range. The number of potential bits is > 8 as far as I >>> know so it needs 2 bytes per char, but let's make that 1 byte for now. >>> For Unicode 5.2 only the table would be at least 107K, and that would >>> only cover the iswXXX functions. >> I have a working version now, and it uses much less as the category table is >> range-based. >> Another table is needed for case conversion. Size estimates are as follows >> (based on Unicode 5.2 for a fair comparison, going up a little bit for 10.0 >> of course): >> >> Categories: 2313 entries (10.0: 2715) >> each entry needs 9 bytes, total 20817 bytes >> I don't know whether that expands by some word-alignment. >> I could pack entries to 7 bytes, or even 6 bytes if that helps (total 16191 >> or 13878). >> >> Case conversion: 2062 entries (10.0: 2621) >> each entry needs 12 bytes, total 24744 >> packed 8 bytes, total 16496 >> >> The Categories table could be boiled down to 1223 entries (penalty: double >> runtime for iswupper and iswlower) >> The Case conversion table could be transformed to a compact form >> Case conversion compact: 1201 entries >> each entry needs 16 bytes, total 19216 >> packed 12 or 11 (or even 10), total 14412 (or 12010) >> So I think the increase is acceptable for the benefit of simple and >> automatic generation > > So we're at 40K+ plus code then. > > newlib: embedded targets, looking for small sized solutions. Simple > and automatic generation is not the main goal. > >> and also more efficient processing by some of the >> functions. Also they would apply to more functions, e.g. iswdigit which >> would confirm all Unicode digits, not just the ASCII ones. > > Don't do that. There's a collision with C99 if you define other > characters than ASCII digits to return nonzero from iswdigit. Comment > from inside Glibc: > > % The "digit" class must only contain the BASIC LATIN digits, says ISO C 99 > % (sections 7.25.2.1.5 and 5.2.1). > >>>>> int wcwidth(wint_t c); >>>> Why not revert to wcwidth(wint_t)? >>>> I think for cygwin it is the only solution that makes wcwidth work for >>>> non-BMP characters and is also compatible (unlike some proposals discussed >>>> later in the quoted thread). >>> We can do this, but it may result in complaints from the other >>> newlib consumers. If in doubt, use #ifdef __CYGWIN__ >> Which other platforms do actually use newlib? > > Lots of embedded and bare-metal tagets. > >>>> Issue 2 is the handling of titlecase characters (e.g. "Nj" as one Unicode >>>> character U+01CB). The current implementation considers them to be both >>>> upper and lower (iswupper: return towlower (c) != c); I'd rather consider >>>> them as neither upper nor lower (iswalpha (c) && towupper (c) == c). >>>> https://linux.die.net/man/3/iswupper allows both interpretations: >>>>> The wide-character class "upper" contains *at least* those characters wc >>>>> which are equal to towupper(wc) and different from towlower(wc). >>> Susv4 says "The iswupper() [...] functions shall test whether wc is a >>> wide-character code representing a character of class upper." Whatever >>> does that correctly with a low footprint is fine. >> The question here is how "character of class upper" is defined, and how to >> interpret pre-Unicode assumptions in a Unicode context. > > In theory, do it as glibc does and you're fine. Implementation considerations for handling the Unicode tables described in http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf and implemented in https://www.strchr.com/multi-stage_tables ICU icu4[cj] uses a folded trie of the properties, where the unique property combinations are indexed, strings of those indices are generated for fixed size groups of character codes, unique values of those strings are then indexed, and those indices assigned to each character code group. The result is a multi-level indexing operation that returns the required property combination for each character. https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode The FOX Toolkit uses a similar approach, splitting the 21 bit character code into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to eliminate redundancy. ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf -- Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple