delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2017/08/07/17:29:47

X-Recipient: archive-cygwin AT delorie DOT com
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:reply-to:subject:to:references:from:message-id
:date:mime-version:in-reply-to:content-type
:content-transfer-encoding; q=dns; s=default; b=f45cG3ekI1d8iclx
L5SF6IgAi8LxFkSYScbbej3F0X66lkCSQP/3aLJNESTEkOWsVO5jELQ35FP2Jt56
+jF73pz09neZAAMF2iBYv9WI2qHmdj3WFRRiOcRYZLkGtEXWKq5rarYqWW9+etLj
IfATfC0SHmruNy0NMSUlxcrkECc=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:reply-to:subject:to:references:from:message-id
:date:mime-version:in-reply-to:content-type
:content-transfer-encoding; s=default; bh=HEQGFCkvQW0WaTV14JuTkt
XYqFs=; b=LvM4cf5dQacfcnuOwRx2heBf217eCTTlM+nfTIhjAiZei0HCAqQCoH
88E3eQ4fFfAlgzkeid+dKnFgkmJRyMMc1S34IRTu6ukiHMZjG7+XxLbrs7I4KlHT
C5ix/RPM6zyUxduWeNzKtwpQBVOeQhPtz8JCAWIq/g2PkY4UlZhnA=
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-6.4 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_1,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2 spammy=recommendation, 07082017, Hx-languages-length:3071, Hx-spam-relays-external:sk:smtp-ou
X-HELO: smtp-out-no.shaw.ca
X-Authority-Analysis: v=2.2 cv=HahkdmM8 c=1 sm=1 tr=0 a=MVEHjbUiAHxQW0jfcDq5EA==:117 a=MVEHjbUiAHxQW0jfcDq5EA==:17 a=IkcTkHD0fZMA:10 a=te1EGT4yAAAA:8 a=US7-Rng0AAAA:8 a=7GStbvgMAAAA:8 a=z0uG624dAAAA:8 a=70QB8q7fMzdtTXEgtVkA:9 a=7Zwj6sZBwVKJAoWSPKxL6X1jA+E=:19 a=QEXdDO2ut3YA:10 a=RRElR4r2U1jGY2dU47NL:22 a=RCpFSEPCRiHwXyn-TuLs:22 a=bgd9Iqch1-7RybpTBNxN:22 a=XYTzjgE7hB3o1y3dZZOX:22
Reply-To: Brian DOT Inglis AT SystematicSw DOT ab DOT ca
Subject: Re: Unicode width data inconsistent/outdated
To: cygwin AT cygwin DOT com
References: <f3c1b415-7a26-8bbe-a67f-5619d356f058 AT towo DOT net> <20170726080859 DOT GA24312 AT calimero DOT vinschen DOT de> <5d3cb047-49f8-26a6-d816-387a71486e99 AT cygwin DOT com> <20170726095016 DOT GA25666 AT calimero DOT vinschen DOT de> <289bd98b-e644-888d-07f8-8965b6538373 AT towo DOT net> <20170728195826 DOT GI24013 AT calimero DOT vinschen DOT de> <1244bd24-bb27-d185-1f24-61beae02c2cd AT towo DOT net> <20170804170156 DOT GL25551 AT calimero DOT vinschen DOT de> <30486790-c59d-9a78-6000-b3c20fb86d9d AT towo DOT net> <20170807092820 DOT GQ25551 AT calimero DOT vinschen DOT de> <401b6d26-35cb-3026-afde-6bd5d09b2d71 AT SystematicSw DOT ab DOT ca> <9f7a8d16-6ebc-52ff-15ae-b1a52d23986b AT towo DOT net>
From: Brian Inglis <Brian DOT Inglis AT SystematicSw DOT ab DOT ca>
Message-ID: <0f8f1535-ed48-d170-7e57-c554bec23942@SystematicSw.ab.ca>
Date: Mon, 7 Aug 2017 15:29:17 -0600
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <9f7a8d16-6ebc-52ff-15ae-b1a52d23986b@towo.net>
X-CMAE-Envelope: MS4wfPYk+EdcN+D4RY2xXjt2gDTYBHpQV9I0oW/VrwolkvZcefXO9qSBCIELbmFyy+27PvxayZHuQCl4A27LhgKyGub7fN/kXKyGHXOG+cypWulzWbfvj1DO fdp1zfkDt5X57ccoQO8HRBJpKFQHuXFbhqCZ//r1fXCB3GjxcvcKN24+30k71lyOFnIBXqFZpNS2Fw==
X-IsSubscribed: yes

On 2017-08-07 13:30, Thomas Wolff wrote:
> Am 07.08.2017 um 21:07 schrieb Brian Inglis:
>> Implementation considerations for handling the Unicode tables described in
>>     http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf
>> and implemented in
>>     https://www.strchr.com/multi-stage_tables
>>
>> ICU icu4[cj] uses a folded trie of the properties, where the unique property
>> combinations are indexed, strings of those indices are generated for fixed size
>> groups of character codes, unique values of those strings are then indexed, and
>> those indices assigned to each character code group. The result is a multi-level
>> indexing operation that returns the required property combination for each
>> character.
>>
>> https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode
>>
>>
>> The FOX Toolkit uses a similar approach, splitting the 21 bit character code
>> into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to
>> eliminate redundancy.
>>
>> ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf
>>
> Thanks for the interesting links, I'll chech them out.
> But such multi-level tables don't really help without a given procedure how to
> update them (that's only available for the lowest level, not for the
> code-embedded levels).

Unicode estimates property tables can be reduced to 7-8KB using these
techniques, including using minimal int sizes for indices and array elements e.g
char, short if you can keep the indices small, rather than pointers.

Creation scripts used by PCRE and Python projects are linked from the bottom of
the second link above. Source and docs for these packages and ICU is available
under Cygwin, and FOX Toolkit is available in some distros and by FTP.

> Also, as I've demonstrated, my more straight-forward and more efficient approach
> will even use less total space than the multi-level approach if packed table
> entries are used.

Unicode recommends the double table index approach as a means of eliminating the
massive redundancy that exists in char property entries and char groups, and
using small integers instead of pointers, that can be optimized to meet
conformance levels and platform speed and size limits, at the cost of an annual
review of properties and rebuild. The amount of redundancy removed by this
approach is estimated in the FOX Toolkit doc and ranges across orders of
magnitude. Unfortunately none of these docs or sources quote sizes for any
Unicode release!

My own first take on these was to use run length encoded bitstrings for each
binary property, similar to database bitmap indices, but the grouping of
property blocks in Unicode, and their recommendation, persuaded me their
approach was likely backed by a bunch of supporting corps' and devs' R&D, and is
similar to those used for decades in database queries handling (lots of) small
value set equivalence class columns to reduce memory pressure while speeding up
selections.

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019