Mail Archives: cygwin/2017/08/07/15:27:40

delorie.com/archives/browse.cgi

search

Mail Archives: cygwin/2017/08/07/15:27:40

X-Recipient: archive-cygwin AT delorie DOT com

DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id

:list-unsubscribe:list-subscribe:list-archive:list-post

:list-help:sender:subject:to:references:from:message-id:date

:mime-version:in-reply-to:content-type

:content-transfer-encoding; q=dns; s=default; b=qrRlejPixIhRX9UX

N7kiBHBUJlj5g6Jl6Ts8iviO2KGZNBsHcqQnw37Fg3vmcbtqxiRevEYwFX0OE4/3

t7agee8xkciOmUPMsKSnYhjDv/IFK2JBPdBUassgr0y+QvbMWrT4VyMTbNJA/L2s

Oyu7MP2E/Ha4qET7sZ8GyLzr424=

DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id

:list-unsubscribe:list-subscribe:list-archive:list-post

:list-help:sender:subject:to:references:from:message-id:date

:mime-version:in-reply-to:content-type

:content-transfer-encoding; s=default; bh=5jpQykflKLVkmWtJ+s382p

U5aFc=; b=G5eZMn/TQKjmD8vKpqtJl8D4bsoLmfGJmu8y108VOyGuHB4haOef5I

pjKRROSb7Qucz6poODVchZau0Y0Om+P6364md+sX17RHyI+55oZGssiuJjjn0YKv

gOgWLaYRw1CA4/HL6VLLehMdIymnwUX8jELxqo1XdEsV0cTQvsCYk=

Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm

List-Id: <cygwin.cygwin.com>

List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>

List-Archive: <http://sourceware.org/ml/cygwin/>

List-Post: <mailto:cygwin AT cygwin DOT com>

List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>

Sender: cygwin-owner AT cygwin DOT com

Mail-Followup-To: cygwin AT cygwin DOT com

Delivered-To: mailing list cygwin AT cygwin DOT com

Authentication-Results: sourceware.org; auth=none

X-Virus-Found: No

X-Spam-SWARE-Status: No, score=-0.4 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM autolearn=no version=3.3.2 spammy=holiday, back-conversion, Hx-spam-relays-external:212.227.126.134, backconversion

X-HELO: mout.kundenserver.de

Subject: Re: Unicode width data inconsistent/outdated

To: cygwin AT cygwin DOT com

References: <f3c1b415-7a26-8bbe-a67f-5619d356f058 AT towo DOT net> <20170726080859 DOT GA24312 AT calimero DOT vinschen DOT de> <5d3cb047-49f8-26a6-d816-387a71486e99 AT cygwin DOT com> <20170726095016 DOT GA25666 AT calimero DOT vinschen DOT de> <289bd98b-e644-888d-07f8-8965b6538373 AT towo DOT net> <20170728195826 DOT GI24013 AT calimero DOT vinschen DOT de> <1244bd24-bb27-d185-1f24-61beae02c2cd AT towo DOT net> <20170804170156 DOT GL25551 AT calimero DOT vinschen DOT de> <30486790-c59d-9a78-6000-b3c20fb86d9d AT towo DOT net> <20170807092820 DOT GQ25551 AT calimero DOT vinschen DOT de>

From: Thomas Wolff <towo AT towo DOT net>

Message-ID: <3eb4ee2f-f62c-cb19-3e4b-10cc57852ba9@towo.net>

Date: Mon, 7 Aug 2017 21:27:16 +0200

User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

MIME-Version: 1.0

In-Reply-To: <20170807092820.GQ25551@calimero.vinschen.de>

X-UI-Out-Filterresults: notjunk:1;V01:K0:CZrGtiZkTeM=:GM5gK97ocntp6/hKqmit2M RUrYbeDi8A51YbE6I2oOrwHldlapUn2/6HjDdZJUxJN6FDhMjn+zdIjdSe3fJLS/eoOurWtvQ wKiDuMajoVPylyox54FSVriEPjJj26L6fAblrt67fgUJiR1OgYFkvTeccdbgOpFDfBsEYmBSa 28JsbqyaAMjvbWvBB0fHaVMQDDySHmcyAg5eagYJZO7nremVOPTNcgUKqGyP02qNfo+VFJUnR Hss44UqAThHPl06Zu0v2mA//rEKAJ+KsqXWKE+tkcQtk/2CWYdi/5c9PCNpPQVKBrR59koA1y Q+O/aQ8I7Ayk4QlNmFvT+svRnfLt5Lbo9sJWRcdLFXIwXVufvS915SUp+qkIuDHbNQ0Jvl7tn g+EnyVLU8XNyivtr/hSInRMv30MFsZSZoitHcs4TjQ6b5SxOLkVeXtK2xr1VxqoHHr+XwQIcj ah4ePCILTBh06ZQGt/GST8Dk6fzdG4b11KYQyN/91V5xpBuKICjaiegLgHl2y6i8A//ed1Yqh hTZIXw4chNXAkFwxBkKLlRbnwYgRP/W35GqHsDjdElNtF5VWaFN/GHViEXxIKIyiN2cw8fw7t 0OtiCwCUUDtjShbG7dlP0ATiamoi1GPlDUsJCcFYpCjP4zomDgON2Og2hym0pc6H4kGqklCL9 ZuqT03ksEK6NzW1T+cPQljp859FhUEgA0Tcn07/qRTupt54fOC/taqKZ2JjG1QdpEGYQpOhM6 tEY9EuYmNAT8ao7xA6lJpYIkfPaznZV7CeJ/d5dIjGN0pRFlOKbim2Rr28k=

X-IsSubscribed: yes

Am 07.08.2017 um 11:28 schrieb Corinna Vinschen:
> On Aug  5 21:06, Thomas Wolff wrote:
>> Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
>>> On Aug  3 21:44, Thomas Wolff wrote:
>>>> My attempt would be to base the functions on a common table of character categories instead.
>>> ...Keep in mind that the table is not loaded into memory on demand, as on
>>> Linux.  Rather it will be part of the Cygwin DLL, and worse in case
>>> newlib, any target using the wctype functions.
>> Maybe we could change that (load on demand, or put them in a shared library
>> perhaps), but...
> That won't work for embedded targets, especially small ones.
>
> If you want to go that route, you would have to extend struct __locale_t
> or lc_ctype_T (in newlib/libc/locale/setlocale.h) to contain pointers to
> conversion tables (Cygwin-only), and the __set_lc_ctype_from_win function
> or a new function inside Cygwin (but called from __ctype_load_locale)
> could load the tables.
>
> Then you could create new iswXXX, towXXX, and wcwidth functions inside
> Cygwin using these tables, rather than relying on the newlib code.
>
> Alternatively, if RTEMS is interested as well, we may strive for a
> newlib solution which is opt-in.  Loading tables (or even big tables at
> all) isn't a good solution for very small targets.
>
>>> The idea here is that the tables take less space than a full-fledged
>>> category table.  The tables in utf8print.h and utf8alpha.h and the code
>>> in iswalpha and iswprint combined are 10K, code and data of the
>>> tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K,
>>> covering Unicode 5.2 with 107K codepoints.
>>>
>>> A category table would have to contain the category bits for the entire
>>> Unicode codepoint range.  The number of potential bits is > 8 as far as I
>>> know so it needs 2 bytes per char, but let's make that 1 byte for now.
>>> For Unicode 5.2 only the table would be at least 107K, and that would
>>> only cover the iswXXX functions.
>> I have a working version now, and it uses much less as the category table is
>> range-based.
>> Another table is needed for case conversion. Size estimates are as follows
>> (based on Unicode 5.2 for a fair comparison, going up a little bit for 10.0
>> of course):
>>
>> Categories: 2313 entries (10.0: 2715)
>> each entry needs 9 bytes, total 20817 bytes
>> I don't know whether that expands by some word-alignment.
>> I could pack entries to 7 bytes, or even 6 bytes if that helps (total 16191
>> or 13878).
>>
>> Case conversion: 2062 entries (10.0: 2621)
>> each entry needs 12 bytes, total 24744
>> packed 8 bytes, total 16496
>>
>> The Categories table could be boiled down to 1223 entries (penalty: double
>> runtime for iswupper and iswlower)
>> The Case conversion table could be transformed to a compact form
>> Case conversion compact: 1201 entries
>> each entry needs 16 bytes, total 19216
>> packed 12 or 11 (or even 10), total 14412 (or 12010)
>> So I think the increase is acceptable for the benefit of simple and
>> automatic generation
> So we're at 40K+ plus code then.
No, if I implement the packed versions, it's 19.3K, so even smaller the 
currently.

> newlib: embedded targets, looking for small sized solutions.  Simple
> and automatic generation is not the main goal.
>
>> and also more efficient processing by some of the
>> functions. Also they would apply to more functions, e.g. iswdigit which
>> would confirm all Unicode digits, not just the ASCII ones.
> Don't do that.  There's a collision with C99 if you define other
> characters than ASCII digits to return nonzero from iswdigit.  ...
OK.

>>>> Issue 3 is the special conversion jp2uc which seems to be half-bred; there
>>>> is no such handling for Chinese or Korean.
>>> This shouldn't matter to you, just keep it in place.  It's a historical,
>>> low footprint conversion for japanese characters without pulling in the
>>> unicode stuff.  Not used on Cygwin so just ignore.
>> I had noticed meanwhile that this is not active in Cygwin, but it's broken
>> anyway for multiple reasons:
>>     * platforms for which wchar_t is not Unicode should be explicitly listed
>>     * if used, the transformation needs to be applied to all non-Unicode
>> locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
>>     * for towupper and towlower, the result must be back-transformed into the
>> respective locale encoding
>>     * particulary the locale-specific _l functions inconsistently do not use
>> the transformation but have this note:
> No, no, no.  The functionality is restricted to certain use-cases and
> always was.  It was a paid-for customer extension back in the day and it
> was *sufficient* for the use-cases.  It's not clear how many newlib
> users are still using it, but it's not a good idea to remove it without
> checking first.  That means, ask on the newlib mailing list how many are
> using the historical jp2uc code, and if we don't get a reply within,
> say, a month, we can probably nuke it.
OK, let's make such a request after holiday time.
But, even if this shall persist as a special solution, it's still broken 
and should be fixed.
Can we then substitute the current table with calling the iconvdata 
functions? In that case, as I said, the back-conversion would be 
available too, and I could fix that and add the missing handling of the 
_l functions, for a consistent solution.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -

webmaster	delorie software privacy
Copyright � 2019 by DJ Delorie	Updated Jul 2019

X-Recipient:	archive-cygwin AT delorie DOT com
DomainKey-Signature:	a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:subject:to:references:from:message-id:date
	:mime-version:in-reply-to:content-type
	:content-transfer-encoding; q=dns; s=default; b=qrRlejPixIhRX9UX
	N7kiBHBUJlj5g6Jl6Ts8iviO2KGZNBsHcqQnw37Fg3vmcbtqxiRevEYwFX0OE4/3
	t7agee8xkciOmUPMsKSnYhjDv/IFK2JBPdBUassgr0y+QvbMWrT4VyMTbNJA/L2s
	Oyu7MP2E/Ha4qET7sZ8GyLzr424=
DKIM-Signature:	v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:subject:to:references:from:message-id:date
	:mime-version:in-reply-to:content-type
	:content-transfer-encoding; s=default; bh=5jpQykflKLVkmWtJ+s382p
	U5aFc=; b=G5eZMn/TQKjmD8vKpqtJl8D4bsoLmfGJmu8y108VOyGuHB4haOef5I
	pjKRROSb7Qucz6poODVchZau0Y0Om+P6364md+sX17RHyI+55oZGssiuJjjn0YKv
	gOgWLaYRw1CA4/HL6VLLehMdIymnwUX8jELxqo1XdEsV0cTQvsCYk=
Mailing-List:	contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id:	<cygwin.cygwin.com>
List-Subscribe:	<mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive:	<http://sourceware.org/ml/cygwin/>
List-Post:	<mailto:cygwin AT cygwin DOT com>
List-Help:	<mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender:	cygwin-owner AT cygwin DOT com
Mail-Followup-To:	cygwin AT cygwin DOT com
Delivered-To:	mailing list cygwin AT cygwin DOT com
Authentication-Results:	sourceware.org; auth=none
X-Virus-Found:	No
X-Spam-SWARE-Status:	No, score=-0.4 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM autolearn=no version=3.3.2 spammy=holiday, back-conversion, Hx-spam-relays-external:212.227.126.134, backconversion
X-HELO:	mout.kundenserver.de
Subject:	Re: Unicode width data inconsistent/outdated
To:	cygwin AT cygwin DOT com
References:	<f3c1b415-7a26-8bbe-a67f-5619d356f058 AT towo DOT net> <20170726080859 DOT GA24312 AT calimero DOT vinschen DOT de> <5d3cb047-49f8-26a6-d816-387a71486e99 AT cygwin DOT com> <20170726095016 DOT GA25666 AT calimero DOT vinschen DOT de> <289bd98b-e644-888d-07f8-8965b6538373 AT towo DOT net> <20170728195826 DOT GI24013 AT calimero DOT vinschen DOT de> <1244bd24-bb27-d185-1f24-61beae02c2cd AT towo DOT net> <20170804170156 DOT GL25551 AT calimero DOT vinschen DOT de> <30486790-c59d-9a78-6000-b3c20fb86d9d AT towo DOT net> <20170807092820 DOT GQ25551 AT calimero DOT vinschen DOT de>
From:	Thomas Wolff <towo AT towo DOT net>
Message-ID:	<3eb4ee2f-f62c-cb19-3e4b-10cc57852ba9@towo.net>
Date:	Mon, 7 Aug 2017 21:27:16 +0200
User-Agent:	Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version:	1.0
In-Reply-To:	<20170807092820.GQ25551@calimero.vinschen.de>
X-UI-Out-Filterresults:	notjunk:1;V01:K0:CZrGtiZkTeM=:GM5gK97ocntp6/hKqmit2M RUrYbeDi8A51YbE6I2oOrwHldlapUn2/6HjDdZJUxJN6FDhMjn+zdIjdSe3fJLS/eoOurWtvQ wKiDuMajoVPylyox54FSVriEPjJj26L6fAblrt67fgUJiR1OgYFkvTeccdbgOpFDfBsEYmBSa 28JsbqyaAMjvbWvBB0fHaVMQDDySHmcyAg5eagYJZO7nremVOPTNcgUKqGyP02qNfo+VFJUnR Hss44UqAThHPl06Zu0v2mA//rEKAJ+KsqXWKE+tkcQtk/2CWYdi/5c9PCNpPQVKBrR59koA1y Q+O/aQ8I7Ayk4QlNmFvT+svRnfLt5Lbo9sJWRcdLFXIwXVufvS915SUp+qkIuDHbNQ0Jvl7tn g+EnyVLU8XNyivtr/hSInRMv30MFsZSZoitHcs4TjQ6b5SxOLkVeXtK2xr1VxqoHHr+XwQIcj ah4ePCILTBh06ZQGt/GST8Dk6fzdG4b11KYQyN/91V5xpBuKICjaiegLgHl2y6i8A//ed1Yqh hTZIXw4chNXAkFwxBkKLlRbnwYgRP/W35GqHsDjdElNtF5VWaFN/GHViEXxIKIyiN2cw8fw7t 0OtiCwCUUDtjShbG7dlP0ATiamoi1GPlDUsJCcFYpCjP4zomDgON2Og2hym0pc6H4kGqklCL9 ZuqT03ksEK6NzW1T+cPQljp859FhUEgA0Tcn07/qRTupt54fOC/taqKZ2JjG1QdpEGYQpOhM6 tEY9EuYmNAT8ao7xA6lJpYIkfPaznZV7CeJ/d5dIjGN0pRFlOKbim2Rr28k=
X-IsSubscribed:	yes