delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2021/11/27/02:24:44

X-Recipient: archive-cygwin AT delorie DOT com
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8F6C93858D39
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
header.from=SystematicSw.ab.ca
Authentication-Results: sourceware.org;
spf=none smtp.mailfrom=systematicsw.ab.ca
X-Authority-Analysis: v=2.4 cv=FrgWQknq c=1 sm=1 tr=0 ts=61a1dd12
a=T+ovY1NZ+FAi/xYICV7Bgg==:117 a=T+ovY1NZ+FAi/xYICV7Bgg==:17
a=IkcTkHD0fZMA:10 a=CCpqsmhAAAAA:8 a=fFEOjooe64AjwK5xVnUA:9 a=QEXdDO2ut3YA:10
a=ul9cdbp4aOFLsgKbc677:22
Message-ID: <528c7bd3-e39a-5b7a-5819-5a6b4e3c71c5@SystematicSw.ab.ca>
Date: Sat, 27 Nov 2021 00:24:02 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Subject: Re: raise(-1) has stopped returning an error recently
To: cygwin AT cygwin DOT com
References: <YZsoj6UvpF6pcbtt AT slk1 DOT local DOT net>
<YZtwMZ1LUbx+b5+s AT calimero DOT vinschen DOT de>
<YZuVy5+nbzPtiqdw AT calimero DOT vinschen DOT de> <YZyl69ODRcBVnMed AT slk1 DOT local DOT net>
<YZy5bRsZuulb6FUV AT calimero DOT vinschen DOT de>
<42c9bb90-dd78-edfa-99ff-f65f7e000956 AT SystematicSw DOT ab DOT ca>
<YZ1tAfzwlW8C84z4 AT slk1 DOT local DOT net> <YZ4FGpEDDar45HC7 AT calimero DOT vinschen DOT de>
<643c1cb7-9b18-25cf-62b0-8085c8fab137 AT Shaw DOT ca>
<YZ+HkgPIwmCuTcJr AT calimero DOT vinschen DOT de>
From: Brian Inglis <Brian DOT Inglis AT SystematicSw DOT ab DOT ca>
Organization: Systematic Software
In-Reply-To: <YZ+HkgPIwmCuTcJr@calimero.vinschen.de>
X-CMAE-Envelope: MS4xfFmeBBPnHPIZbAAlXJjwIQfk0GJieEdAX0oGbMv4FYULRgeLQJISOvfEfD995CbaQSOERK0g04qFisWtUslAsmCUnJc3Ud2B4XwjQc+iFBZ2ACb2rxHk
wJvxTITty5ZkHxIlOoP0wVrl439xdmJQTL4jtvvdfItJZaX/mi6H3jwjQXJJVOZEl4/nlzZIAWppdd7Yk42vvBUj5ZDEkUctINM=
X-Spam-Status: No, score=-1161.6 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
KAM_LAZY_DOMAIN_SECURITY, NICE_REPLY_A, RCVD_IN_BARRACUDACENTRAL,
RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
SPF_NONE, TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
server2.sourceware.org
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.29
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
Reply-To: cygwin AT cygwin DOT com
Errors-To: cygwin-bounces+archive-cygwin=delorie DOT com AT cygwin DOT com
Sender: "Cygwin" <cygwin-bounces+archive-cygwin=delorie DOT com AT cygwin DOT com>
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 1AR7Ohxb029359

On 2021-11-25 05:54, Corinna Vinschen via Cygwin wrote:
> On Nov 24 11:01, Brian Inglis via Cygwin wrote:
>> On 2021-11-24 02:25, Corinna Vinschen via Cygwin wrote:
>>>> On Tue, Nov 23, 2021 at 11:18:25AM -0700, Brian Inglis wrote:
>>>>> Do Cygwin and/or Windows support surrogate pairs in UTF-8?
>>>
>>> You mean UTF-16.  UTF-8 doesn't know surrogate pairs, UTF-16 does.
>>> Originally there was UCS-2, 16 bits, with only 65536 code points.
>>> However, Unicode left the BMP already with version 2.0 in 1996, so
>>> UTF-16 and surrogate pairs became necessary.  Windows as well as Cygwin
>>> support them.
>>
>> How does Cygwin support UTF-16 locales with surrogate pairs?
> 
> UTF-16 locales?  There's no such thing.  UTF-16 is just the 16 bit
> representation for Unicode, and as such, is independent of the locale.
> On the user side, Cygwin only supports UTF-8 as Unicode representation.
> Internally you can then convert them to wchar_t which is UTF-16.
> 
>> Are they the "native" locales inherited from Windows if others are not
>> specified e.g. UTF-8, some OEM SBCS or MBCS?
> 
> Just try `locale -av' and you'll see all supported locales and their
> respective default codeset.  All of them can be used with .utf8
> specifier to use UTF-8 instead of the default codeset.  Some of them
> use UTF-8 as default codeset anyway, e. g., fa_IR or yo_NG.
> 
>>>> There are 3 tests in surrogate-pair and only the 3rd one failed. So I guess
>>>> surrogate pairs in UTF-8 "mostly work".
>>>
>>> UTF-16.  The surrogate stuff is evil at times.  Have a look at the
>>> __utf8_wctomb function in
>>> https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=newlib/libc/stdlib/wctomb_r.c
>>> Lone surrogate halfs in an input stream are a problem, for instance.
>>
>> Thus the confusion with grep surrogate pair tests which appear to be running
>> under a UTF-8 locale: see attached surrogate pair extract from cygport
>> --debug grep.cygport check.
> 
> An STC in plain C might be helpful.

I think I might finally have got the point of the test, not knowing much 
about legacy UTF-16 UCS encoding nor surrogate pairs.

 From what I can see:

𐐅  U+010405  f0 90 90 85  DESERET CAPITAL LETTER LONG OO

fails to match itself, presumably others do also.

Presumably this is converted internally on some platforms, including 
Cygwin, to a UTF-16 surrogate pair, and a grep comparison fails, 
although a bash comparison succeeds.

$ printf '\U10405\n' | iconv -f utf-8 -t utf-16be | xxd -g2
00000000: d801 dc05 000a
$ printf '\U10405\n' > t
$ grep -f t t; echo $?
1
$ oo=`printf '\U10405\n'`; [ $oo = $oo ] && echo same || echo diff
same

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in binary units and prefixes, physical quantities in SI.]

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright 2019   by DJ Delorie     Updated Jul 2019