delorie.com/archives/browse.cgi | search |
X-Recipient: | archive-cygwin AT delorie DOT com |
DomainKey-Signature: | a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id |
:list-unsubscribe:list-subscribe:list-archive:list-post | |
:list-help:sender:subject:to:references:from:message-id:date | |
:mime-version:in-reply-to:content-type | |
:content-transfer-encoding; q=dns; s=default; b=dO0xq9ffEL3IsoQf | |
PsTAqedAbEvJCfxO7HQZXQyZDxbnYGO4OTzlIDtOhfr801Nwn1UGsIvlcfG0TWFP | |
5eGGUXPimMkyvrtNmS/VRnAXY2KbejV7uH48IuwWkxzcpunH2a3hffph+iHm0rgs | |
CpyS8Go8Q9c7Auf1nwJRD0qhuDI= | |
DKIM-Signature: | v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id |
:list-unsubscribe:list-subscribe:list-archive:list-post | |
:list-help:sender:subject:to:references:from:message-id:date | |
:mime-version:in-reply-to:content-type | |
:content-transfer-encoding; s=default; bh=+KyWPiAIZNv4BNUL07wvjx | |
QM394=; b=wcqVElEt8dPnxOhM5pexwmQ3EXoSeAjtGVFLF4981WDk0M62ORB236 | |
m44l6Fn9NJu4xGqnnCIAjsFp47eULzVHM+jlM2OL6ardosYdOO5BpiCaUtEJyOTa | |
uOnyiixbAjx8OOg3ikKbBNrNOgh6iHrYSPV4nBupr0WDCph3frHJg= | |
Mailing-List: | contact cygwin-help AT cygwin DOT com; run by ezmlm |
List-Id: | <cygwin.cygwin.com> |
List-Subscribe: | <mailto:cygwin-subscribe AT cygwin DOT com> |
List-Archive: | <http://sourceware.org/ml/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs> |
Sender: | cygwin-owner AT cygwin DOT com |
Mail-Followup-To: | cygwin AT cygwin DOT com |
Delivered-To: | mailing list cygwin AT cygwin DOT com |
Authentication-Results: | sourceware.org; auth=none |
X-Virus-Found: | No |
X-Spam-SWARE-Status: | No, score=-4.9 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_1,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM autolearn=ham version=3.3.2 spammy= |
X-HELO: | mout.kundenserver.de |
Subject: | Re: Unicode width data inconsistent/outdated |
To: | cygwin AT cygwin DOT com |
References: | <f3c1b415-7a26-8bbe-a67f-5619d356f058 AT towo DOT net> <20170726080859 DOT GA24312 AT calimero DOT vinschen DOT de> <5d3cb047-49f8-26a6-d816-387a71486e99 AT cygwin DOT com> <20170726095016 DOT GA25666 AT calimero DOT vinschen DOT de> <289bd98b-e644-888d-07f8-8965b6538373 AT towo DOT net> <20170728195826 DOT GI24013 AT calimero DOT vinschen DOT de> <1244bd24-bb27-d185-1f24-61beae02c2cd AT towo DOT net> <20170804170156 DOT GL25551 AT calimero DOT vinschen DOT de> <30486790-c59d-9a78-6000-b3c20fb86d9d AT towo DOT net> <20170807092820 DOT GQ25551 AT calimero DOT vinschen DOT de> <401b6d26-35cb-3026-afde-6bd5d09b2d71 AT SystematicSw DOT ab DOT ca> <9f7a8d16-6ebc-52ff-15ae-b1a52d23986b AT towo DOT net> <0f8f1535-ed48-d170-7e57-c554bec23942 AT SystematicSw DOT ab DOT ca> |
From: | Thomas Wolff <towo AT towo DOT net> |
Message-ID: | <4c342b2a-25e0-3fc4-a077-be2cc54d117c@towo.net> |
Date: | Tue, 8 Aug 2017 02:28:54 +0200 |
User-Agent: | Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 |
MIME-Version: | 1.0 |
In-Reply-To: | <0f8f1535-ed48-d170-7e57-c554bec23942@SystematicSw.ab.ca> |
X-UI-Out-Filterresults: | notjunk:1;V01:K0:kt0kXfIsxEo=:4LpL4xuR4I+zPWwskEudx4 4FkKtRd3jinRtsqeG3pocl0ekfLfxZ9wCb7II1HAsuiHiIUOZYYpFvMKMVpO8qVypxE/lKg8/ S2kiZfBAFBjoyDtDwGrTf2YmYaXYFBG5YJXwOgb+YHTgw1QzB7fPOBO0B0khuJ4krN+JGJ/yl SCmCUdCoBGUbbRcnObZ+VXU9avTOojOYGaB/4f6JjNg1hWhcmDeCFMej1KocLZLxsWf9/H/h3 0uC6/SRVKs5AQ8AOAzoic+1i/EY7AXmtSFl0QvuTiLJLS99sJfZ+IB15fwlHOi211B/LT2Uqj jPrEMYTS1ifi+dH2dfaFQ3zkvwHJaZw+iRKXUOJJ+oiQWInP8vB2YT8Zb9rG/eqHg0t3OpduB gfArXQ3Q8Ir76OsgtAIf/bQdZOQUt+tlmA+Z3j8dWCY1f//x7SSpyVggrDPdFGSEm1JMJoqvr qP0yGIRTlbOLH3ThwS2gAubjJvQiBU1RemQt2WSboNtzk03zmJhTB35jqpKX7ZIU6LDwCEoDb 5idjxQoJxH7ppRwKlV/zEx7/Pl8y1eJj33Wq8xl72C7vvGsmbh6t1/wXRhNPgRCH1p/13h8ro gpEM8Zu04/orkBItg5IJlxWMOsHb0vVG5c+sLNrb/J9A5+sX34ZFSbjRE8P/hutItss6ue0eH TIk6Bkf4cP9UrHD4Yns3wLCBPtNYIpnXpE0H4VoZXuNKV+DXzHs4aOCEBSUNgaHf25w8zdHlI /blk9OOAydnaARQQHASGrq/YOdR0Rkgu6OuuhmatgCSaF77Ye1wHctE8Psg= |
X-IsSubscribed: | yes |
Am 07.08.2017 um 23:29 schrieb Brian Inglis: > On 2017-08-07 13:30, Thomas Wolff wrote: >> Am 07.08.2017 um 21:07 schrieb Brian Inglis: >>> Implementation considerations for handling the Unicode tables described in >>> http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf >>> and implemented in >>> https://www.strchr.com/multi-stage_tables >>> >>> ICU icu4[cj] uses a folded trie of the properties, where the unique property >>> combinations are indexed, strings of those indices are generated for fixed size >>> groups of character codes, unique values of those strings are then indexed, and >>> those indices assigned to each character code group. The result is a multi-level >>> indexing operation that returns the required property combination for each >>> character. >>> >>> https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode >>> >>> >>> The FOX Toolkit uses a similar approach, splitting the 21 bit character code >>> into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to >>> eliminate redundancy. >>> >>> ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf >>> >> Thanks for the interesting links, I'll chech them out. >> But such multi-level tables don't really help without a given procedure how to >> update them (that's only available for the lowest level, not for the >> code-embedded levels). > Unicode estimates property tables can be reduced to 7-8KB using these > techniques, including using minimal int sizes for indices and array elements e.g > char, short if you can keep the indices small, rather than pointers. > > Creation scripts used by PCRE and Python projects are linked from the bottom of > the second link above. Source and docs for these packages and ICU is available > under Cygwin, and FOX Toolkit is available in some distros and by FTP. > >> Also, as I've demonstrated, my more straight-forward and more efficient approach >> will even use less total space than the multi-level approach if packed table >> entries are used. > Unicode recommends the double table index approach as a means of eliminating the > massive redundancy that exists in char property entries and char groups, and > using small integers instead of pointers, that can be optimized to meet > conformance levels and platform speed and size limits, at the cost of an annual > review of properties and rebuild. The amount of redundancy removed by this > approach is estimated in the FOX Toolkit doc and ranges across orders of > magnitude. Unfortunately none of these docs or sources quote sizes for any > Unicode release! > > My own first take on these was to use run length encoded bitstrings for each > binary property, similar to database bitmap indices, but the grouping of > property blocks in Unicode, and their recommendation, persuaded me their > approach was likely backed by a bunch of supporting corps' and devs' R&D, and is > similar to those used for decades in database queries handling (lots of) small > value set equivalence class columns to reduce memory pressure while speeding up > selections. I am not quite sure what you're trying to suggest or recommend now, but the thing is, I just wanted to get an update of width data in the first place, which is an easy and undisputed changed; then Corinna pointed out that the ctype functions are based on old Unicode data too, so I made an attempt to update them too. I use the approach that I also use for two other projects (mined and mintty) and I didn't mean this to become a research project for me :/ I am certainly willing to consider specs and all that to achieve a suitable result, but I don't feel like implementing any fancy algorithm recommended by Unicode with unconvincing rationale, especially after I've calculated that my method uses even less memory. Thomas -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |