| delorie.com/archives/browse.cgi | search |
| DKIM-Filter: | OpenDKIM Filter v2.11.0 delorie.com 47KB08PD2737947 |
| Authentication-Results: | delorie.com; |
| dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=A8M5ukQM | |
| X-Recipient: | archive-cygwin AT delorie DOT com |
| DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 0F075384608F |
| DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; |
| s=default; t=1724151606; | |
| bh=qacWsVtv8rAtGEetEhjs75Z6txA840GqCq1nAow9tcQ=; | |
| h=Date:Subject:To:References:In-Reply-To:List-Id:List-Unsubscribe: | |
| List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: | |
| From; | |
| b=A8M5ukQMWi1Xl0Eg/m84jfjd3F8rZsmEHBBnJGsJgkCLbTBzfhFMSdoSnjd80PdGP | |
| ItYN/tjowBTcfmkdvBCVxZreEka3EMPwtwKhAWspUMP1VBrcrZ5DDShJyyymTRV+xT | |
| jcYlMDy0/RPkwmTwtqaRafWAJin2iKghFe8QM92M= | |
| X-Original-To: | cygwin AT cygwin DOT com |
| Delivered-To: | cygwin AT cygwin DOT com |
| DMARC-Filter: | OpenDMARC Filter v1.4.2 sourceware.org BC090384A06E |
| ARC-Filter: | OpenARC Filter v1.0.0 sourceware.org BC090384A06E |
| ARC-Seal: | i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1724151540; cv=none; |
| b=UHJU1/YNwvXDRj9bfryJSWytfaOOQ6ZZ0F+7zUDshHxo1kb4z7fqnLwjl+pP4YyxmxrcP+506gqWPg4zjBbTVN7bDo8+krdD/B5kOxd12vaS78qO68HTBGmj1f7TGaCh8tsHzfGEDeTr9G4l4frZOoGe8sWzy0ZozZr5reYhidM= | |
| ARC-Message-Signature: | i=1; a=rsa-sha256; d=sourceware.org; s=key; |
| t=1724151540; c=relaxed/simple; | |
| bh=iMe/A9GtK2iggB+1J+Dk4f0TSmhtASKFaif43KOrvxU=; | |
| h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:From:To; | |
| b=Mi7eM2pksPn+sfPUb0P4SGxf8OFDEY5WiVDSA/HzjITG1h1imisrsbW/y6NEKKQptrZXruq5ihucx+VzW1CwObawN9DzTRc/F5zIXSog/AasCy9qQU1yL47z26t47WBqPeCPTDhu46N8cd3p1qvg9prrJ54krAQuY5smbN+mkgo= | |
| ARC-Authentication-Results: | i=1; server2.sourceware.org |
| Message-ID: | <c65a0cae-1a6f-4904-bc9e-620303fed9d7@comcast.net> |
| Date: | Tue, 20 Aug 2024 05:58:34 -0500 |
| MIME-Version: | 1.0 |
| User-Agent: | Mozilla Thunderbird |
| Subject: | Re: Fwd: odd behavior of length(), match() and field splitting with |
| multi-byte characters | |
| To: | cygwin AT cygwin DOT com |
| References: | <2562c4c9-d89e-4ba7-a3aa-f425d5e87842 AT comcast DOT net> |
| <29569dbf-b5d7-43ea-8b3d-7f491da7ffa7 AT comcast DOT net> | |
| In-Reply-To: | <29569dbf-b5d7-43ea-8b3d-7f491da7ffa7@comcast.net> |
| X-Antivirus: | Avast (VPS 240819-2, 8/19/2024), Outbound message |
| X-Antivirus-Status: | Clean |
| X-CMAE-Envelope: | MS4xfPzqK73NQUx8H3wTjHWvnwD3UhTvAAHxm2rsJANFjmcjHrSEOT4U7qM6lhNmlUFkLEBUcJzXa0i9PfSbDvWuSSSt/WkI7QarSW1GV2mlpWlQej48nLgJ |
| YVQTo7gceDkGu8gK2OO+b7aa07Dp20xTV+LF9r3x/cZsLSU9eFnc/KaYJv7TmLHA9BkA7zllihK+T4kclSE0SsdDCJtvqyT8LNumIsHLTozt4oQv4dpD9+2b | |
| +o4eQx+ffUoczjhqIS1ENw== | |
| X-Spam-Status: | No, score=-0.4 required=5.0 tests=BAYES_00, BODY_8BITS, |
| DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, | |
| HTML_MESSAGE, KAM_LOTSOFHASH, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP, | |
| T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 | |
| X-Spam-Checker-Version: | SpamAssassin 3.4.6 (2021-04-09) on |
| server2.sourceware.org | |
| X-Content-Filtered-By: | Mailman/MimeDel 2.1.30 |
| X-BeenThere: | cygwin AT cygwin DOT com |
| X-Mailman-Version: | 2.1.30 |
| List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
| List-Unsubscribe: | <https://cygwin.com/mailman/options/cygwin>, |
| <mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe> | |
| List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
| List-Post: | <mailto:cygwin AT cygwin DOT com> |
| List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
| List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
| <mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
| From: | Ed Morton via Cygwin <cygwin AT cygwin DOT com> |
| Reply-To: | Ed Morton <mortoneccc AT comcast DOT net> |
| Errors-To: | cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com |
| Sender: | "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com> |
| X-MIME-Autoconverted: | from base64 to 8bit by delorie.com id 47KB08PD2737947 |
Is there any more information I can provide for someone to be able to
look into this bug?
   Ed.
On 7/6/2024 7:26 AM, Ed Morton wrote:
> I posted the below bug report to the GNU awk bugs mailing list,
> https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00000.html, the
> feedback there is that it's a cygwin or MSYS2 port issue, could you
> please take a look? I'm also posting this at
> https://github.com/msys2/mingw-packages/issues per the advice from the
> GNU bug list.
>
> Regards,
>
> Â Â Â Ed Morton.
>
> -------- Forwarded Message --------
> Subject: odd behavior of length(), match() and field splitting with
> multi-byte characters
> Date: Mon, 1 Jul 2024 05:56:02 -0500
> From: Ed Morton
> To: bug-gawk AT gnu DOT org <bug-gawk AT gnu DOT org>
>
>
>
> Configuration Information [Automatically generated, do not change]:
> Machine: x86_64
> OS: cygwin
> Compiler: gcc
> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
> --param=ssp-buffer-size=4
> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1
> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1
> -DNDEBUG
> uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64
> 2024-04-03 17:25 UTC x86_64 Cygwin
> Machine Type: x86_64-pc-cygwin
>
> Gawk Version: 5.3.0
>
> Attestation 1:
> Â Â Â Â Â Â Â I have read
> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
> Â Â Â Â Â Â Â Yes
>
> Attestation 2:
> Â Â Â Â Â Â Â I have not modified the sources before building gawk.
> Â Â Â Â Â Â Â True
>
> Description:
> Â Â Â Â Â Â Â gawk is reporting odd lengths and matches of strings
> Â Â Â Â Â Â Â when multi-byte characters are involved.
>
> Repeat-By:
> Â Â Â Â Â Â Â Someone on StackOverflow asked about a couple of issues they
> saw that, so far at least, no-one there can explain and seem to just
> be bugs.
>
> Â Â Â Â Â Â Â 1)
> https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138715434_78676444
> and
> https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138720207_78676444:
>
> Â Â Â Â Â Â Â If we output 4 multi-byte characters as 10 bytes using:
>
> Â Â Â Â Â Â Â Â Â Â Â $ echo '61F09F948DF09F948E62' | xxd -r -p > file1
> Â Â Â Â Â Â Â Â Â Â Â $
>
> Â Â Â Â Â Â Â and run the following gawk command on it we get the output shown:
>
> Â Â Â Â Â Â Â Â Â Â Â $ LC_ALL=en_US.utf8 gawk '{print(length($0))}' file1
> Â Â Â Â Â Â Â Â Â Â Â 6
> Â Â Â Â Â Â Â Â Â Â Â $
>
> Â Â Â Â Â Â Â i.e. 6 instead of 4. If we run
>
> Â Â Â Â Â Â Â Â Â Â Â $ printf 'F0989A9F' | xxd -r -p | LC_ALL=en_US.utf8 awk -F
> '' '{print NF, length(); for (i=1; i<=NF; i++) print $i}' | cat -A
> Â Â Â Â Â Â Â Â Â Â Â 2 2$
> Â Â Â Â Â Â Â Â Â Â Â M-pM-^XM-^Z$
> Â Â Â Â Â Â Â Â Â Â Â M-^_$
> Â Â Â Â Â Â Â Â Â Â Â $
>
> Â Â Â Â Â Â Â it shows that what is intended to be single a 4-byte character
> is being treated as 2 characters, one 3 bytes and the other 1 byte.
>
> Â Â Â Â Â Â Â 2)
> https://stackoverflow.com/questions/78690533/why-does-the-match-function-not-work-in-this-particular-situation
>
> Â Â Â Â Â Â Â If we create some input using:
>
> Â Â Â Â Â Â Â Â Â Â Â $ echo
> '3C6469763E3C6469763E5F3C2F6469763E5F3C68313E6162636465665F3C2F68313E5F3C2F6469763E3C6469763EF09F93853C2F6469763E0A'
> | xxd -r -p > file2
>
> Â Â Â Â Â Â Â and then run this on it we get the expected output shown::
>
> Â Â Â Â Â Â Â Â Â Â Â $ LC_ALL=en_US.utf8 gawk
> '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2
> Â Â Â Â Â Â Â Â Â Â Â abcdef
> Â Â Â Â Â Â Â Â Â Â Â $
>
> Â Â Â Â Â Â Â but if we add the `IGNORECASE` flag we get a blank line output:
>
> Â Â Â Â Â Â Â Â Â Â Â $Â LC_ALL=en_US.utf8 gawk -vIGNORECASE=1
> '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2
>
> Â Â Â Â Â Â Â Â Â Â Â $
>
> Â Â Â Â Â Â Â unless we also remove the end of string delimiter, `$`, from
> the end of the regexp:
>
> Â Â Â Â Â Â Â Â Â Â Â $ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1
> '{match($0,/^.*_<h1>(.*)_<\/h1>.*/,a); print a[1]}' file2
> Â Â Â Â Â Â Â Â Â Â Â abcdef
> Â Â Â Â Â Â Â Â Â Â Â $
>
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
| webmaster | delorie software privacy |
| Copyright © 2019 by DJ Delorie | Updated Jul 2019 |