DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 47KHNn1V2832438 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=lzKIVTJS X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3C4DC38432D5 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1724174627; bh=wsHo5sSiwx64qyUmp4QALZSLWSL4xqqn0MqzN/kFhjI=; h=Date:Subject:To:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=lzKIVTJSC181wn2/rNUsO8doMZgLYLici0qSWH/+1GAs0xB+vLtqwac7IsSBNixlF 9eNGlx7H/QGC+FWfMVYJQDvVtx7r1+lFpjbgnHW6wI6lTvHJUBUHNqugFH4M5JOwbV N7Ez3+uYG5Y+eSfxygzTSLRBSjf/17IvcdqEJppU= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7A3FB38449C8 ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 7A3FB38449C8 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724174604; cv=none; b=QuIbrzGf3ijkRcBqJ37wI4OVxm+74SDblzHJKKBotSzKVM+VYQmViiGSqQDFmP9gkpYkvu/NtWDFanE1q8dqrb/9g5QOO9FotfkhK84XVRBMHfcFgHQVjjQ1jyrNwWsuO2wSe2TQIGM8M/m6Sa9pOJe07VH6o/8HOKTebuc5hhI= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724174604; c=relaxed/simple; bh=UWwTQft7MDVSFrij6awLWt/Jtwdq8XDmFTywa5+Kgnk=; h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=MldWdBFwYLMXzoNCCPeW42lHAALXAQQteZIgas2f7hob/+j3jMdpcmL05l/Z6Ls6dB9+qC1Rv04Xju0tDHUCXYxczN9khhyHL15K/xp1FdEsY620wTryxZjcItRheI9pYY3sAjeu2tCtiExxWT2NRNnQohrF2XE6Yd2wFFCePfM= ARC-Authentication-Results: i=1; server2.sourceware.org X-Authority-Analysis: v=2.4 cv=FpSm/Hrq c=1 sm=1 tr=0 ts=66c4d109 a=DxHlV3/gbUaP7LOF0QAmaA==:117 a=DxHlV3/gbUaP7LOF0QAmaA==:17 a=IkcTkHD0fZMA:10 a=mDV3o1hIAAAA:8 a=NEAV23lmAAAA:8 a=uPZiAMpXAAAA:8 a=A9ow9u_A8i-ota2XPncA:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 a=KqPPlAtkUdYA:10 a=E30OOFpPiAcA:10 a=_d5mRZ0IvnoA:10 a=_FVE-zBwftR9WsbkzFJk:22 Message-ID: <13e1b3da-1cf4-4d9c-bd0d-4ac20e2e5443@Shaw.ca> Date: Tue, 20 Aug 2024 11:23:21 -0600 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Fwd: odd behavior of length(), match() and field splitting with multi-byte characters Content-Language: en-CA To: cygwin AT cygwin DOT com References: <2562c4c9-d89e-4ba7-a3aa-f425d5e87842 AT comcast DOT net> <29569dbf-b5d7-43ea-8b3d-7f491da7ffa7 AT comcast DOT net> Autocrypt: addr=Brian DOT Inglis AT Shaw DOT ca; keydata= xjMEXopx9BYJKwYBBAHaRw8BAQdAPq8FIaW+Bz7xnfyJ1gHQyf2EZo5sAwSPy/bRAcLeWl/N I0JyaWFuIEluZ2xpcyA8QnJpYW4uSW5nbGlzQFNoYXcuY2E+wpYEExYIAD4WIQTG63sbl+cr 2nyOuZiKvQKcH1E27wUCXopx9AIbAwUJCWYBgAULCQgHAgYVCgkICwIEFgIDAQIeAQIXgAAK CRCKvQKcH1E276DmAP91Bt8kfJhKHYb9b2sao2fxwJFsl1GlRi516WKI0OkphQEA+ULITsPs blfzSq+GgI7q4LPfRfTLy4Oo3gorlnhnfgnOOAReinH0EgorBgEEAZdVAQUBAQdAepgIsLwm GQicfoIBaB9xHp63MQJqVCPbgPzESTg7EEwDAQgHwn0EGBYIACYWIQTG63sbl+cr2nyOuZiK vQKcH1E27wUCXopx9AIbDAUJCWYBgAAKCRCKvQKcH1E27+zoAP4u2ivMQBAqaMeLOilqRWgy nV2ATImz1p2v1H5P4kBiDwD3caPK1cxU5lijzuSDCjgtIpgF/avHbjA32fxJdIRwAA== Organization: Inglis In-Reply-To: X-CMAE-Envelope: MS4xfEucQv0+9/hI6MA67MRbM8ylrzr+4VK/qW56KCB2zWbFUmTvAHAU0UKfHeDzC8A2j9THa1blYa6ttkAvBK8sJF40DxLPu6XAXXHAI4H+VoNeL1KHvxBN DCubGjqjBlQlq55yxn/9l31our9pB1GVOSRJ02GPdXJkE3QtAB1ObSB99S/5+r93onglw7RQuIH0hg== X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00, BODY_8BITS, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Brian Inglis via Cygwin Reply-To: Brian DOT Inglis AT Shaw DOT ca Content-Type: text/plain; charset="utf-8"; Format="flowed" Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 47KHNn1V2832438 There do seem to be anomalies in Cygwin handling of SMP characters, perhaps due to conversion to or misinterpretation as UTF-16/UCS-2 surrogates? 🔍 U+01f50d f0 9f 94 8d d83d dd0d 🔎 U+01f50e f0 9f 94 8e d83d dd0e $ wc -lwcmL <<< 🔎 1 0 3 5 0 $ wc -lwcmL <<< 🔍 1 0 3 5 0 On 2024-08-20 04:58, Ed Morton via Cygwin wrote: > Is there any more information I can provide for someone to be able to look into > this bug? > >     Ed. > > On 7/6/2024 7:26 AM, Ed Morton wrote: >> I posted the below bug report to the GNU awk bugs mailing list, >> https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00000.html, the >> feedback there is that it's a cygwin or MSYS2 port issue, could you please >> take a look? I'm also posting this at >> https://github.com/msys2/mingw-packages/issues per the advice from the GNU bug >> list. >> >> Regards, >> >>     Ed Morton. >> >> -------- Forwarded Message -------- >> Subject:     odd behavior of length(), match() and field splitting with >> multi-byte characters >> Date:     Mon, 1 Jul 2024 05:56:02 -0500 >> From:     Ed Morton >> To:     bug-gawk AT gnu DOT org >> >> >> >> Configuration Information [Automatically generated, do not change]: >> Machine: x86_64 >> OS: cygwin >> Compiler: gcc >> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security >> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=4 >> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1 -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1 -DNDEBUG >> uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64 2024-04-03 >> 17:25 UTC x86_64 Cygwin >> Machine Type: x86_64-pc-cygwin >> >> Gawk Version: 5.3.0 >> >> Attestation 1: >>         I have read https://www.gnu.org/software/gawk/manual/html_node/Bugs.html. >>         Yes >> >> Attestation 2: >>         I have not modified the sources before building gawk. >>         True >> >> Description: >>         gawk is reporting odd lengths and matches of strings >>         when multi-byte characters are involved. >> >> Repeat-By: >>         Someone on StackOverflow asked about a couple of issues they saw that, >> so far at least, no-one there can explain and seem to just be bugs. >> >>         1) >> https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138715434_78676444 and https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138720207_78676444: >> >>         If we output 4 multi-byte characters as 10 bytes using: >> >>             $ echo '61F09F948DF09F948E62' | xxd -r -p > file1 >>             $ >> >>         and run the following gawk command on it we get the output shown: >> >>             $ LC_ALL=en_US.utf8 gawk '{print(length($0))}' file1 >>             6 >>             $ >> >>         i.e. 6 instead of 4. If we run >> >>             $ printf 'F0989A9F' | xxd -r -p | LC_ALL=en_US.utf8 awk -F '' >> '{print NF, length(); for (i=1; i<=NF; i++) print $i}' | cat -A >>             2 2$ >>             M-pM-^XM-^Z$ >>             M-^_$ >>             $ >> >>         it shows that what is intended to be single a 4-byte character is >> being treated as 2 characters, one 3 bytes and the other 1 byte. >> >>         2) >> https://stackoverflow.com/questions/78690533/why-does-the-match-function-not-work-in-this-particular-situation >> >>         If we create some input using: >> >>             $ echo >> '3C6469763E3C6469763E5F3C2F6469763E5F3C68313E6162636465665F3C2F68313E5F3C2F6469763E3C6469763EF09F93853C2F6469763E0A' | xxd -r -p > file2 >> >>         and then run this on it we get the expected output shown:: >> >>             $ LC_ALL=en_US.utf8 gawk '{match($0,/^.*_

(.*)_<\/h1>.*$/,a); >> print a[1]}' file2 >>             abcdef >>             $ >> >>         but if we add the `IGNORECASE` flag we get a blank line output: >> >>             $  LC_ALL=en_US.utf8 gawk -vIGNORECASE=1 >> '{match($0,/^.*_

(.*)_<\/h1>.*$/,a); print a[1]}' file2 >> >>             $ >> >>         unless we also remove the end of string delimiter, `$`, from the end >> of the regexp: >> >>             $ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1 >> '{match($0,/^.*_

(.*)_<\/h1>.*/,a); print a[1]}' file2 >>             abcdef >>             $ >> > -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut -- Antoine de Saint-Exupéry -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple