delorie.com/archives/browse.cgi | search |
X-Recipient: | archive-cygwin AT delorie DOT com |
DomainKey-Signature: | a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id |
:list-unsubscribe:list-subscribe:list-archive:list-post | |
:list-help:sender:date:from:to:subject:message-id:reply-to | |
:references:mime-version:content-type:content-transfer-encoding | |
:in-reply-to; q=dns; s=default; b=Vh17D6p2CEhgPYVnZQvsZ6B2Osxnr5 | |
nv9rEjXv1MdcQPJACn+DZSwGJNWMo1fP2TvEpQJuXo1OoDW1KOSjdbXd2lLppkuf | |
i/MCuY7CpZlypXd1jFwhBUWhWfQYi2WVkXMmvppXtppF7yycY6GoWtgL2qvvp/RQ | |
MCvpyK3rzkKKQ= | |
DKIM-Signature: | v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id |
:list-unsubscribe:list-subscribe:list-archive:list-post | |
:list-help:sender:date:from:to:subject:message-id:reply-to | |
:references:mime-version:content-type:content-transfer-encoding | |
:in-reply-to; s=default; bh=LwAGZrUtJa5Fh+trTE0cy+U7qoU=; b=EysY | |
rBJP4ZyHMr/r9ZFPo4wYdR6O+Gale+FK01Z9RMewo02o7uY0uIOJyjdRoZDuFzYd | |
D7AifH43R+my4IZHFyKmZEebPDLQ5tCZboGLshX0f5syGAXbEnm3Tam8t2uKp6Br | |
9L2t1rdILuqS3MJq8kyWQB93YKieCDEJNVXS8cQ= | |
Mailing-List: | contact cygwin-help AT cygwin DOT com; run by ezmlm |
List-Id: | <cygwin.cygwin.com> |
List-Subscribe: | <mailto:cygwin-subscribe AT cygwin DOT com> |
List-Archive: | <http://sourceware.org/ml/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs> |
Sender: | cygwin-owner AT cygwin DOT com |
Mail-Followup-To: | cygwin AT cygwin DOT com |
Delivered-To: | mailing list cygwin AT cygwin DOT com |
X-Spam-SWARE-Status: | No, score=0.2 required=5.0 tests=AWL,BAYES_50,RDNS_NONE,TW_EG,TW_NX autolearn=no version=3.3.1 |
Date: | Sun, 21 Jul 2013 21:39:53 +0200 |
From: | Corinna Vinschen <corinna-cygwin AT cygwin DOT com> |
To: | cygwin AT cygwin DOT com |
Subject: | Re: regex library fails git tests |
Message-ID: | <20130721193953.GC2661@calimero.vinschen.de> |
Reply-To: | cygwin AT cygwin DOT com |
Mail-Followup-To: | cygwin AT cygwin DOT com |
References: | <ksepor$cag$1 AT ger DOT gmane DOT org> |
MIME-Version: | 1.0 |
In-Reply-To: | <ksepor$cag$1@ger.gmane.org> |
User-Agent: | Mutt/1.5.21 (2010-09-15) |
On Jul 20 15:52, Mark Levedahl wrote: > Current git fails two sets of tests on cygwin due apparently to > problems in the regex library. One set of tests does language based > word-matching, and has a common failure during regex compilation. > The suffix clause ("|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+") is > common to all of these, removing that clause eliminates the regcomp > failure. > > A test case extracted from the git sources is below - this works > correctly on Fedora 18, fails on Cygwin: > > $ gcc test-regex.c > $ ./a.out > failed regcomp() for pattern '[^<>= ]+|[^[:space:]]|[â–’-â–’][â–’-â–’]+' > > The failure disappears when the suffix clause is removed from pat_html. > > This is happening on a current installation: > $ uname -a > CYGWIN_NT-5.1 virt-winxp 1.7.21(0.267/5/3) 2013-07-15 12:17 i686 Cygwin Thanks for the testcase. The problem is this: Cygwin's regex is taken from FreeBSD, so it's not identical to the glibc implementation on Linux. The FreeBSD implementation converts all input chars to wchar_t and then handles everything, the pattern as well as the input string, in wchar_t to be locale- and codeset independent. You application does not call setlocale, so the locale is "C" or "POSIX" and the codeset is ANSI_X3.4-1968 (aka ASCII). The conversion to wchar_t is performed by calling the mbrtowc function. This function behaves on Cygwin the same as on Linux: If the current locale's codeset is ASCII, and if the input character is >= 0x80, mbrtowc returns -1 with errno set to EILSEQ. This happens on Cygwin. The regcomp routine converting the input string to wchar_t calls mbrtowc, and mbrtowc returns -1 (EILSEQ) because the input character is >= 0x80 in the bracket expression. Even though the mbrtowc functions behave the same in Cygwin and glibc, the glibc implementation of regcomp apparently does not call mbrtowc under all circumstances, namely not in the "C"/"POSIX" locale or if the locale's codeset is ASCII. Therefore it does not treat the chars >= 0x80 as invalid characters. So, what I did now was this: I added a workaround to Cygwin's regcomp. If the current codeset is ASCII, the characters in the pattern are converted to wchar_t by simply using their unsigned value verbatim. This allows to compile (and test) the patterns in the git testcases. However, please note that this behaviour, while being provided by glibc and now by Cygwin, is *not* standards-compliant. In the narrow sense the characters beyond 0x7f are still invalid ASCII chars, and other functions working with wchar_t strings won't be as forgiving when using invalid input. HTH, Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Maintainer cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |