Mail Archives: cygwin/2017/08/05/15:06:34
X-Recipient: | archive-cygwin AT delorie DOT com
|
DomainKey-Signature: | a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
|
| :list-unsubscribe:list-subscribe:list-archive:list-post
|
| :list-help:sender:subject:to:references:from:message-id:date
|
| :mime-version:in-reply-to:content-type
|
| :content-transfer-encoding; q=dns; s=default; b=fjI5An5bVJYEA3kU
|
| 8uf5OqQSzY6Z8sumDez9MlijLary+vELjw85ZU5mOjlLjwDECRQDjqt5SpAYuiae
|
| dycSiVaOyp/fDtCwOHHs7/MxKqNqZfi0ZjPTyfqXCUK59Ma/Pgv9hV1X9PLmhXF4
|
| 7rnh+Aueg0nXmhFWNsEYQR+vC8U=
|
DKIM-Signature: | v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
|
| :list-unsubscribe:list-subscribe:list-archive:list-post
|
| :list-help:sender:subject:to:references:from:message-id:date
|
| :mime-version:in-reply-to:content-type
|
| :content-transfer-encoding; s=default; bh=5dOz/pfI7fEVgcjlCx3HAY
|
| IEMsM=; b=XnJTf/KsHN3JLSXXkE9p/snQ8B2U/gf5e6B9PB86NCUYuyGo5orTpk
|
| pzq4SILqqmCDuiJJpFTL0K0msJDGbVLiwIU/rFIIcNQwvrRtiPz8Czc53gT5F7Bv
|
| 2BvxSzCwlA2zrufKDLie3FKcfPecI7YqBRZxXDdDDGb3HbI2LWcAs=
|
Mailing-List: | contact cygwin-help AT cygwin DOT com; run by ezmlm
|
List-Id: | <cygwin.cygwin.com>
|
List-Subscribe: | <mailto:cygwin-subscribe AT cygwin DOT com>
|
List-Archive: | <http://sourceware.org/ml/cygwin/>
|
List-Post: | <mailto:cygwin AT cygwin DOT com>
|
List-Help: | <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
|
Sender: | cygwin-owner AT cygwin DOT com
|
Mail-Followup-To: | cygwin AT cygwin DOT com
|
Delivered-To: | mailing list cygwin AT cygwin DOT com
|
Authentication-Results: | sourceware.org; auth=none
|
X-Virus-Found: | No
|
X-Spam-SWARE-Status: | No, score=-0.4 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM autolearn=no version=3.3.2 spammy=U*cygwin-patches, cygwinpatchescygwincom, cygwin-patches AT cygwin DOT com, estimates
|
X-HELO: | mout.kundenserver.de
|
Subject: | Re: Unicode width data inconsistent/outdated
|
To: | cygwin AT cygwin DOT com
|
References: | <f3c1b415-7a26-8bbe-a67f-5619d356f058 AT towo DOT net> <20170726080859 DOT GA24312 AT calimero DOT vinschen DOT de> <5d3cb047-49f8-26a6-d816-387a71486e99 AT cygwin DOT com> <20170726095016 DOT GA25666 AT calimero DOT vinschen DOT de> <289bd98b-e644-888d-07f8-8965b6538373 AT towo DOT net> <20170728195826 DOT GI24013 AT calimero DOT vinschen DOT de> <1244bd24-bb27-d185-1f24-61beae02c2cd AT towo DOT net> <20170804170156 DOT GL25551 AT calimero DOT vinschen DOT de>
|
From: | Thomas Wolff <towo AT towo DOT net>
|
Message-ID: | <30486790-c59d-9a78-6000-b3c20fb86d9d@towo.net>
|
Date: | Sat, 5 Aug 2017 21:06:10 +0200
|
User-Agent: | Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
|
MIME-Version: | 1.0
|
In-Reply-To: | <20170804170156.GL25551@calimero.vinschen.de>
|
X-UI-Out-Filterresults: | notjunk:1;V01:K0:Fpp+ycYxA18=:GOCrSKBT1Gi51ip/wAfHHf 9N+tzpXNRor7zN65kiAMZDTqNIXhOWC1mZ56oqQmq6Nm4T0gzYuyfJsWV6/fwYolLyB2KwXC3 0wf1W4yMUXtM/fNu0/l6c/6+wIilP8YVSPNFLYT9P7XuhT035toxKKmnBLBIlQoVnBpHPfXtu 3rJdRaGBFq5Kqe9zqUQXOSx936yjj0njjMA0ddQL2E5EXYp/PujVUOQxiCcDS4joveMmzNMTa n30PslwdxEW3Rm8TTEgZdkDJYa94E9wRQY3g/xahj09RiqpYCJih1RAQHw5UjQwXtKg2UGMyT +2rqvQtbsOpJOXEsbsk33WzRPjyN8v9W0S9ODGjX6ibZwxhOGFee2bIVvZT4y90ET2niRezTP p00xkUGQr8xnCZyXlifiILInvLoRQl1En+gUWyBMecoFfHhmYAPSGCH7HlzqxgEEXC9oo5nXh GWqosJv57VRGm4gtRhYaxX6o/pxw9ebQNK5NuYO/Asx4QQ5XPWFGgSUANdlnMFImhFtkTm/VN 5dFNlOWFHGjNl0Uyty2/sNVtH6+GFuJf3n3efgVIB9VZZjzOBzlkru3HtlWmeeZuXhRB2WWN/ 1yplBC00M2HkpAoCxBrs26cu0m28CFn5C+FLbMl7UvNOpWpHOb+qNqGoGJrd2b5OdYwlfzLXl KkNpnltBt8HVACQT98fAwGRWXneiR3D11CBN6LCOwgMVLMH/hhY9bFczBt41/8AXcIUWNog8Q pDn32J6ELtnmfAOJ0w2SV+R/gzP8ZBb5s8uXWm+0x9DyONgZd9JMGx90LAs=
|
X-IsSubscribed: | yes
|
Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
> On Aug 3 21:44, Thomas Wolff wrote:
>> Am 28.07.2017 um 21:58 schrieb Corinna Vinschen:
>>> On Jul 26 23:43, Thomas Wolff wrote:
>>>> Am 26.07.2017 um 11:50 schrieb Corinna Vinschen:
>>>>> On Jul 26 03:16, Yaakov Selkowitz wrote:
>>>>>> On 2017-07-26 03:08, Corinna Vinschen wrote:
>>>>>>> On Jul 26 08:49, Thomas Wolff wrote:
>>>>>>>> It would be good to keep wcwidth/wcswidth in sync with the installed
>>>>>>>> Unicode data version (package unicode-ucd).
>>>>>>>> Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
>>>>>>>> it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
>>>>>>>> be used.
>>>>>>>> I can provide some scripts to generate the respective tables if desired.
>>>>>>>> Thomas
>>>>>>> If you can update the newlib files this way and send matching patches
>>>>>>> to the newlib list, this would be highly appreciated.
>>>>>> Thomas, I just updated unicode-ucd to 10.0 for this purpose.
>>>> Thanks.
>>>>> Oh, and, btw, the comment in wcwidth.c isn't quite correct. The
>>>>> cwstate in newlib is on Unicode 5.2, see newlib/libc/ctype/towupper.c.
>>>> Oh, a number of other embedded tables. To make the tow* and isw* functions
>>>> more easily adaptable to Unicode updates, there will be some revisions to do
>>>> here. And the to* and is* ones (without 'w') even refer to locales in a way
>>>> I do not understand. Maybe I'll restrict my effort to wcwidth first...
>>> The to* and is* ones (without 'w') don't matter at all and you don't
>>> have to touch them.
>>>
>>> The Unicode stuff only affects the tow and isw functions.
>>>
>>> As for how to fetch the data, you may want to have a look into
>>> newlib/libc/ctype/utf8alpha.h and newlib/libc/ctype/utf8print.h. The
>>> header comments contain the awk scripts used to collect the data.
>> But there are no instructions to adapt the embedded conditional statements
>> referring to those data...
> Tables are ...
I had an impression how the tables work. Yet there is no automatic
mechanism to generate the data-based conditionals in the code which
would need to be adapted too for Unicode updates. Therefore:
>> My attempt would be to base the functions on a common table of character categories instead.
> Keep in mind that the table is not loaded into memory on demand, as on
> Linux. Rather it will be part of the Cygwin DLL, and worse in case
> newlib, any target using the wctype functions.
Maybe we could change that (load on demand, or put them in a shared
library perhaps), but...
> The idea here is that the tables take less space than a full-fledged
> category table. The tables in utf8print.h and utf8alpha.h and the code
> in iswalpha and iswprint combined are 10K, code and data of the
> tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K,
> covering Unicode 5.2 with 107K codepoints.
>
> A category table would have to contain the category bits for the entire
> Unicode codepoint range. The number of potential bits is > 8 as far as I
> know so it needs 2 bytes per char, but let's make that 1 byte for now.
> For Unicode 5.2 only the table would be at least 107K, and that would
> only cover the iswXXX functions.
I have a working version now, and it uses much less as the category
table is range-based.
Another table is needed for case conversion. Size estimates are as
follows (based on Unicode 5.2 for a fair comparison, going up a little
bit for 10.0 of course):
Categories: 2313 entries (10.0: 2715)
each entry needs 9 bytes, total 20817 bytes
I don't know whether that expands by some word-alignment.
I could pack entries to 7 bytes, or even 6 bytes if that helps (total
16191 or 13878).
Case conversion: 2062 entries (10.0: 2621)
each entry needs 12 bytes, total 24744
packed 8 bytes, total 16496
The Categories table could be boiled down to 1223 entries (penalty:
double runtime for iswupper and iswlower)
The Case conversion table could be transformed to a compact form
Case conversion compact: 1201 entries
each entry needs 16 bytes, total 19216
packed 12 or 11 (or even 10), total 14412 (or 12010)
So I think the increase is acceptable for the benefit of simple and
automatic generation and also more efficient processing by some of the
functions. Also they would apply to more functions, e.g. iswdigit which
would confirm all Unicode digits, not just the ASCII ones.
> ...
>> Also, there are 3 other issues:
>>
>> Issue 1 is about handling non-BMP characters by wcwidth.
>> This has been discussed before.
>> [...]
>> ...
>>
>>
>> While wcswidth works already (using internal __wcwidth), and the isw* and
>> tow* functions work as well because they use wint_t, wcwidth is the only
>> function (inconsistently insisting on wchar_t) that does not work.
> Trying to be close to the standard here.
>
>> But note https://linux.die.net/man/3/wcwidth which says
>>> Note that glibc before 2.2.5 used the prototype
>>> int wcwidth(wint_t c);
>> Why not revert to wcwidth(wint_t)?
>> I think for cygwin it is the only solution that makes wcwidth work for
>> non-BMP characters and is also compatible (unlike some proposals discussed
>> later in the quoted thread).
> We can do this, but it may result in complaints from the other
> newlib consumers. If in doubt, use #ifdef __CYGWIN__
Which other platforms do actually use newlib?
>
>> Issue 2 is the handling of titlecase characters (e.g. "Nj" as one Unicode
>> character U+01CB). The current implementation considers them to be both
>> upper and lower (iswupper: return towlower (c) != c); I'd rather consider
>> them as neither upper nor lower (iswalpha (c) && towupper (c) == c).
>> https://linux.die.net/man/3/iswupper allows both interpretations:
>>> The wide-character class "upper" contains *at least* those characters wc
>>> which are equal to towupper(wc) and different from towlower(wc).
> Susv4 says "The iswupper() [...] functions shall test whether wc is a
> wide-character code representing a character of class upper." Whatever
> does that correctly with a low footprint is fine.
The question here is how "character of class upper" is defined, and how
to interpret pre-Unicode assumptions in a Unicode context.
>> Issue 3 is the special conversion jp2uc which seems to be half-bred; there
>> is no such handling for Chinese or Korean.
> This shouldn't matter to you, just keep it in place. It's a historical,
> low footprint conversion for japanese characters without pulling in the
> unicode stuff. Not used on Cygwin so just ignore.
I had noticed meanwhile that this is not active in Cygwin, but it's
broken anyway for multiple reasons:
* platforms for which wchar_t is not Unicode should be explicitly listed
* if used, the transformation needs to be applied to all non-Unicode
locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
* for towupper and towlower, the result must be back-transformed
into the respective locale encoding
* particulary the locale-specific _l functions inconsistently do not
use the transformation but have this note:
> We're using a locale-independent representation of upper/lower case
> based on Unicode data. Thus, the locale doesn't matter.
So I'd suggest to drop that stuff unless someone would like to fix it.
Should I send my proposal to newlib AT sourceware DOT org or
cygwin-patches AT cygwin DOT com?
Thomas
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -