X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7F1DC3858280 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1690825617; bh=Sbt66MoYGyKvauN7xAW2/cKRhs5RNhFvbrXP9P8srOE=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=BO/u0zJ3PhDg/HNCRbJj6yYHNxf6e/dnnvdoN87LjzDwrNx1lTymDb/RGcEE3zjZM TO3fXiRUo+E50IJVmXhv4lPtxZxG9mo38IGWNao3wtO6XjWfmI+WWYA0Rqc+CbsChl 3CHuBteg1JY9yygcpY4ak3rIRozCZ9Ozx1sbn/BQ= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D55F73858CD1 Date: Mon, 31 Jul 2023 19:46:20 +0200 To: Bruno Haible Subject: Re: character class "alpha" Message-ID: Mail-Followup-To: Bruno Haible , cygwin AT cygwin DOT com References: <3884636 DOT 3uDm00564X AT nimes> <5176597 DOT IBPj4gxFZX AT nimes> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <5176597.IBPj4gxFZX@nimes> X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Corinna Vinschen via Cygwin Reply-To: cygwin AT cygwin DOT com Cc: Corinna Vinschen , cygwin AT cygwin DOT com Content-Type: text/plain; charset="utf-8" Errors-To: cygwin-bounces+archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 36VHkx4v027138 On Jul 31 16:06, Bruno Haible via Cygwin wrote: > Corinna Vinschen wrote: > > I have a problem with the c32isalpha function. > > > > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE, > > because it expects the character to be an alphabetic character. > > This is not a big problem. You can see in the test-c32isalpha.c file > that this test is disabled for many platforms, in particular glibc. Which is interesting, because I actually tried that today on glibc, and for iswalpha (0xff11) it returns 1. So it actually behaves as the testcase expects. > There's no problem with disabling it on Cygwin as well. I'd rather make Cygwin do the same as glibc. > > The Cygwin unicode information is automatically generated from the > > Unicode data file UnicodeData.txt, fresh from their homepage. iswalpha > > in newlib is checking for the Unicode categories, using the expression: > > > > return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt > > || cat == CAT_Lm || cat == CAT_Lo > > || cat == CAT_Nl // Letter_Number > > ; > > > > with CAT_foo being equivalent to Unicode category foo. > > > > Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an > > alphabetic character. > > This is not wrong. However, see the comments in the generator of the > gnulib tables: > > https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/gen-uni-tables.c;h=0dceedc06cd72f886807fd575a2c4dba99cd147a;hb=HEAD#l5789 > > /* Consider all the non-ASCII digits as alphabetic. > ISO C 99 forbids us to have them in category "digit", > but we want iswalnum to return true on them. */ > > Likewise in the generator of the glibc tables: > > https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode_utils.py;h=5af03113a2f1f063769752ea426fcaf6f6ba9e95;hb=HEAD#l274 > > The original comment (from 2000) was: > > /* SUSV2 gives us some freedom for the "digit" category, but ISO C 99 > takes it away: > 7.25.2.1.5: > The iswdigit function tests for any wide character that corresponds > to a decimal-digit character (as defined in 5.2.1). > 5.2.1: > the 10 decimal digits 0 1 2 3 4 5 6 7 8 9 > */ > return (ch >= 0x0030 && ch <= 0x0039); > > The question is: In which category do you put these non-ASCII digits? > "print" and "graph", sure. But other than that? "punct" or "alnum"? > "punct" seems wrong. If you, like me, decide to put them in "alnum", > then you they need to be in "alpha" or "digit" (per POSIX > https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswalnum.html ). > But ISO C 23 § 7.4.1.5 + § 5.2.1 does not allow them in category "digit". Thanks for the description. It was clear to me that they don't belong into the ISO C digit category, but other than that... So, if we change the expression in iswalpha_l to something like return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt || cat == CAT_Lm || cat == CAT_Lo || cat == CAT_Nl // Letter_Number /* Also all digits not allowed to be called digits per ISO C 99 */ || (cat == CAT_Nd && !(c >= (wint_t)'0' && c <= (wint_t)'9')); ; we're good? Thanks, Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple