X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DF11A385802F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1690812393; bh=sg+8bo1pFvpnzt9cufa4foni8IlK14F6F9cbihWXtJ0=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=bKpuHrouIcuRVYBGPrlnjeEsde8ZUDSpJbBb3jSXBy/P04npoZJe31a0JObdqAt1S vilZ3vuiSWnYDJpKNbytVwcvZ9UoBt7CgXmnCqsuPCujhT2cuA3tfRBMnkaFiKsjOw U0rSAqw5diNyGgp0FOn1XMJZ0Zrzwu6xqVbTaFBw= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 03A4B3858CD1 ARC-Seal: i=1; a=rsa-sha256; t=1690812373; cv=none; d=strato.com; s=strato-dkim-0002; b=k+qjgrQCoHpVYK6ZsQSuB0R5mEUZk/AorIf5bfHuUSlbzRyvXQHhWpbDQdU/xzbiMy 3Wf4qwxbLJeLaIuy03bGkZBGyfZfDf4+cHjfKcSAv5PLSOMd9fCU2/GPpoXkib4BWxv+ 3GKh3eb7bHLrOxkql4CrpA3Ina9HNUVedzi/TYsHLWN8TCsoGlNq21SCeMBO0XAnkHml yAku9WEsqQdyW0sRZsJ9GJHglrPCyr+w8UWj4fs2WNX2r39Jy1kWI2OxtQO+WST6XxSF YswOtlALv0hPIhBX0chFUgBOjGW24Fo+R6z3HUnfUnldDIN1KQwsib552xp1qtgPOuIY jumQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; t=1690812373; s=strato-dkim-0002; d=strato.com; h=References:In-Reply-To:Message-ID:Date:Subject:To:From:Cc:Date:From: Subject:Sender; bh=0YnhAmIePkkYrlAKPA7qC2tcUsPDgIN4qXENjfLTimM=; b=EanQDx3mf1naZX370ijRzZGAEi/VPc1VeOML9+9c52RWp5sjg5iATxSr+8Z+WLG01h KGA4PLBVAAXXYeLB8ZNhepN72gVxww7hqVsrd8cBn9PIWQQHW90LriZd43+DadZ7tImp mOohO8ka964WDUN6RlYRZRj8k5GlVNxPmBzn0xo7c2iLsBtWgTeW1CeMkuyInB7PIKCl dOZxulRkxwVdmUbFFESrSRbNWuLNGvVxxONYNj6CLGOuzLgnZKca4ROugTGCxzTy2EM0 1/rZnFE1ZOSHNWhm+hY3SPPkgNok7scQz8i3n6gkjeOC74iVQN5WLnSlNk/1JChz7+i+ yCQQ== ARC-Authentication-Results: i=1; strato.com; arc=none; dkim=none X-RZG-CLASS-ID: mo00 X-RZG-AUTH: ":Ln4Re0+Ic/6oZXR1YgKryK8brlshOcZlIWs+iCP5vnk6shH0WWb0LN8XZoH94zq68+3cfpPHj6C6mIk6D1piuCc2EubRrsS9rw==" To: cygwin AT cygwin DOT com Subject: Re: character class "alpha" Date: Mon, 31 Jul 2023 16:06:13 +0200 Message-ID: <5176597.IBPj4gxFZX@nimes> In-Reply-To: References: <3884636 DOT 3uDm00564X AT nimes> MIME-Version: 1.0 X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Bruno Haible via Cygwin Reply-To: Bruno Haible Content-Type: text/plain; charset="iso-8859-1" Errors-To: cygwin-bounces+archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id 36VE6YPp029403 Corinna Vinschen wrote: > I have a problem with the c32isalpha function. > > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE, > because it expects the character to be an alphabetic character. This is not a big problem. You can see in the test-c32isalpha.c file that this test is disabled for many platforms, in particular glibc. There's no problem with disabling it on Cygwin as well. > The Cygwin unicode information is automatically generated from the > Unicode data file UnicodeData.txt, fresh from their homepage. iswalpha > in newlib is checking for the Unicode categories, using the expression: > > return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt > || cat == CAT_Lm || cat == CAT_Lo > || cat == CAT_Nl // Letter_Number > ; > > with CAT_foo being equivalent to Unicode category foo. > > Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an > alphabetic character. This is not wrong. However, see the comments in the generator of the gnulib tables: https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/gen-uni-tables.c;h=0dceedc06cd72f886807fd575a2c4dba99cd147a;hb=HEAD#l5789 /* Consider all the non-ASCII digits as alphabetic. ISO C 99 forbids us to have them in category "digit", but we want iswalnum to return true on them. */ Likewise in the generator of the glibc tables: https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode_utils.py;h=5af03113a2f1f063769752ea426fcaf6f6ba9e95;hb=HEAD#l274 The original comment (from 2000) was: /* SUSV2 gives us some freedom for the "digit" category, but ISO C 99 takes it away: 7.25.2.1.5: The iswdigit function tests for any wide character that corresponds to a decimal-digit character (as defined in 5.2.1). 5.2.1: the 10 decimal digits 0 1 2 3 4 5 6 7 8 9 */ return (ch >= 0x0030 && ch <= 0x0039); The question is: In which category do you put these non-ASCII digits? "print" and "graph", sure. But other than that? "punct" or "alnum"? "punct" seems wrong. If you, like me, decide to put them in "alnum", then you they need to be in "alpha" or "digit" (per POSIX https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswalnum.html ). But ISO C 23 § 7.4.1.5 + § 5.2.1 does not allow them in category "digit". Bruno -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple