delorie.com/archives/browse.cgi | search |
DKIM-Filter: | OpenDKIM Filter v2.11.0 delorie.com 466CRqTl3208158 |
Authentication-Results: | delorie.com; |
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=oQ/2jw+A | |
X-Recipient: | archive-cygwin AT delorie DOT com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 7B19F3858C35 |
DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; |
s=default; t=1720268870; | |
bh=Du4Yf9BJ4I74LObRQ2qcf32mPtogg3seL8PTxSyrtpE=; | |
h=Date:Subject:References:To:In-Reply-To:List-Id:List-Unsubscribe: | |
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: | |
From; | |
b=oQ/2jw+AdAe1vuJIO38/7h6eT6RNYcMaJeHMQRmvGVot16aS7wR+s8vCl0uo0VkTV | |
IlGd9zdtZksxVdbpLWAlt7Nnf7zT+cpYMo8uIy7kT59Ptgkr21UE+pkz73eNeivvbM | |
yidlLJ4LqZanh2pzaOT2nGufsljK85Vpw6IRuyMY= | |
X-Original-To: | cygwin AT cygwin DOT com |
Delivered-To: | cygwin AT cygwin DOT com |
DMARC-Filter: | OpenDMARC Filter v1.4.2 sourceware.org BF0613858C98 |
ARC-Filter: | OpenARC Filter v1.0.0 sourceware.org BF0613858C98 |
ARC-Seal: | i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720268816; cv=none; |
b=X7HuZvy60TIcI/JC+dURy3l4Biu0IhLv6Kvw7X08cv8UN36vlJjEeRjrCgyfYenGtF52cZ+X02XyhqGZf47Yk3N3bFu5YDDUUTkfES+rL3tzqQ9eAQANHfwHTmwmUgl0Aq7kqdJutSNEYVwuavsL4t3IPgmPL17tHbDjN1944nw= | |
ARC-Message-Signature: | i=1; a=rsa-sha256; d=sourceware.org; s=key; |
t=1720268816; c=relaxed/simple; | |
bh=jx3E4cByGmO6y0grPwXhP/xpk4EyAt5DNPuCfBvGIVs=; | |
h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; | |
b=Rg9AKO0cLkRfQuJIrNStGqH5P0x6cHIVzcaCzFsTk9VkHSbWDUnA64NUMZNPPnfwp5iLMA1Rha+H0ZPBddaMNp3byogdWuMByO4W9osN/NTtYMflzLP+DXZDB45nCiUZJCXVdrswuxc48Rs7ZALG/FKQ3bjgjejPeOGsUKjTvv8= | |
ARC-Authentication-Results: | i=1; server2.sourceware.org |
Message-ID: | <29569dbf-b5d7-43ea-8b3d-7f491da7ffa7@comcast.net> |
Date: | Sat, 6 Jul 2024 07:26:31 -0500 |
MIME-Version: | 1.0 |
User-Agent: | Mozilla Thunderbird |
Subject: | Fwd: odd behavior of length(), match() and field splitting with |
multi-byte characters | |
References: | <2562c4c9-d89e-4ba7-a3aa-f425d5e87842 AT comcast DOT net> |
To: | cygwin AT cygwin DOT com |
In-Reply-To: | <2562c4c9-d89e-4ba7-a3aa-f425d5e87842@comcast.net> |
X-Forwarded-Message-Id: | <2562c4c9-d89e-4ba7-a3aa-f425d5e87842 AT comcast DOT net> |
X-Antivirus: | Avast (VPS 240706-2, 7/6/2024), Outbound message |
X-Antivirus-Status: | Clean |
X-CMAE-Envelope: | MS4xfP5colr1wXXFkBJVWf1tKd+TOTBx+q9dl4jirZ0hMyqOI7TPIgqnuw9+a0dW+c5k0OCMrFBdUZ2OQkLi31mFkiLorz+918appJd+OfAw02+GHeVf33Xu |
WAx5R9S8a6YYnslwPs0fzaZ/OInda2p28sGXOUeURG9N6PJBh7ofFh3ay7IbwaQ3ohGQTYriy+0vpQfqkp2r1NuWpECRAPH7fqfKQ1xwA0TXcuIHBJztl6ya | |
rh1cl5/bfqjb5lC0KBd/OA== | |
X-Spam-Status: | No, score=-0.4 required=5.0 tests=BAYES_00, BODY_8BITS, |
DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, | |
HTML_MESSAGE, KAM_LOTSOFHASH, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, | |
TXREP autolearn=ham autolearn_force=no version=3.4.6 | |
X-Spam-Checker-Version: | SpamAssassin 3.4.6 (2021-04-09) on |
server2.sourceware.org | |
X-Content-Filtered-By: | Mailman/MimeDel 2.1.30 |
X-BeenThere: | cygwin AT cygwin DOT com |
X-Mailman-Version: | 2.1.30 |
List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
List-Unsubscribe: | <https://cygwin.com/mailman/options/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe> | |
List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
From: | Ed Morton via Cygwin <cygwin AT cygwin DOT com> |
Reply-To: | Ed Morton <mortoneccc AT comcast DOT net> |
Errors-To: | cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com |
Sender: | "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com> |
X-MIME-Autoconverted: | from base64 to 8bit by delorie.com id 466CRqTl3208158 |
I posted the below bug report to the GNU awk bugs mailing list, https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00000.html, the feedback there is that it's a cygwin or MSYS2 port issue, could you please take a look? I'm also posting this at https://github.com/msys2/mingw-packages/issues per the advice from the GNU bug list. Regards,    Ed Morton. -------- Forwarded Message -------- Subject: odd behavior of length(), match() and field splitting with multi-byte characters Date: Mon, 1 Jul 2024 05:56:02 -0500 From: Ed Morton To: bug-gawk AT gnu DOT org <bug-gawk AT gnu DOT org> Configuration Information [Automatically generated, do not change]: Machine: x86_64 OS: cygwin Compiler: gcc Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong --param=ssp-buffer-size=4 -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1 -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1 -DNDEBUG uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64 2024-04-03 17:25 UTC x86_64 Cygwin Machine Type: x86_64-pc-cygwin Gawk Version: 5.3.0 Attestation 1:        I have read https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.        Yes Attestation 2:        I have not modified the sources before building gawk.        True Description:        gawk is reporting odd lengths and matches of strings        when multi-byte characters are involved. Repeat-By:        Someone on StackOverflow asked about a couple of issues they saw that, so far at least, no-one there can explain and seem to just be bugs.        1) https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138715434_78676444 and https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138720207_78676444:        If we output 4 multi-byte characters as 10 bytes using:            $ echo '61F09F948DF09F948E62' | xxd -r -p > file1            $        and run the following gawk command on it we get the output shown:            $ LC_ALL=en_US.utf8 gawk '{print(length($0))}' file1            6            $        i.e. 6 instead of 4. If we run            $ printf 'F0989A9F' | xxd -r -p | LC_ALL=en_US.utf8 awk -F '' '{print NF, length(); for (i=1; i<=NF; i++) print $i}' | cat -A            2 2$            M-pM-^XM-^Z$            M-^_$            $        it shows that what is intended to be single a 4-byte character is being treated as 2 characters, one 3 bytes and the other 1 byte.        2) https://stackoverflow.com/questions/78690533/why-does-the-match-function-not-work-in-this-particular-situation        If we create some input using:            $ echo '3C6469763E3C6469763E5F3C2F6469763E5F3C68313E6162636465665F3C2F68313E5F3C2F6469763E3C6469763EF09F93853C2F6469763E0A' | xxd -r -p > file2        and then run this on it we get the expected output shown::            $ LC_ALL=en_US.utf8 gawk '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2            abcdef            $        but if we add the `IGNORECASE` flag we get a blank line output:            $ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1 '{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2            $        unless we also remove the end of string delimiter, `$`, from the end of the regexp:            $ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1 '{match($0,/^.*_<h1>(.*)_<\/h1>.*/,a); print a[1]}' file2            abcdef            $ -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |