X-Recipient: archive-cygwin AT delorie DOT com DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:subject:message-id:reply-to :references:mime-version:content-type:content-transfer-encoding :in-reply-to; q=dns; s=default; b=Vh17D6p2CEhgPYVnZQvsZ6B2Osxnr5 nv9rEjXv1MdcQPJACn+DZSwGJNWMo1fP2TvEpQJuXo1OoDW1KOSjdbXd2lLppkuf i/MCuY7CpZlypXd1jFwhBUWhWfQYi2WVkXMmvppXtppF7yycY6GoWtgL2qvvp/RQ MCvpyK3rzkKKQ= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:subject:message-id:reply-to :references:mime-version:content-type:content-transfer-encoding :in-reply-to; s=default; bh=LwAGZrUtJa5Fh+trTE0cy+U7qoU=; b=EysY rBJP4ZyHMr/r9ZFPo4wYdR6O+Gale+FK01Z9RMewo02o7uY0uIOJyjdRoZDuFzYd D7AifH43R+my4IZHFyKmZEebPDLQ5tCZboGLshX0f5syGAXbEnm3Tam8t2uKp6Br 9L2t1rdILuqS3MJq8kyWQB93YKieCDEJNVXS8cQ= Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com X-Spam-SWARE-Status: No, score=0.2 required=5.0 tests=AWL,BAYES_50,RDNS_NONE,TW_EG,TW_NX autolearn=no version=3.3.1 Date: Sun, 21 Jul 2013 21:39:53 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: regex library fails git tests Message-ID: <20130721193953.GC2661@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) On Jul 20 15:52, Mark Levedahl wrote: > Current git fails two sets of tests on cygwin due apparently to > problems in the regex library. One set of tests does language based > word-matching, and has a common failure during regex compilation. > The suffix clause ("|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+") is > common to all of these, removing that clause eliminates the regcomp > failure. > > A test case extracted from the git sources is below - this works > correctly on Fedora 18, fails on Cygwin: > > $ gcc test-regex.c > $ ./a.out > failed regcomp() for pattern '[^<>= ]+|[^[:space:]]|[▒-▒][▒-▒]+' > > The failure disappears when the suffix clause is removed from pat_html. > > This is happening on a current installation: > $ uname -a > CYGWIN_NT-5.1 virt-winxp 1.7.21(0.267/5/3) 2013-07-15 12:17 i686 Cygwin Thanks for the testcase. The problem is this: Cygwin's regex is taken from FreeBSD, so it's not identical to the glibc implementation on Linux. The FreeBSD implementation converts all input chars to wchar_t and then handles everything, the pattern as well as the input string, in wchar_t to be locale- and codeset independent. You application does not call setlocale, so the locale is "C" or "POSIX" and the codeset is ANSI_X3.4-1968 (aka ASCII). The conversion to wchar_t is performed by calling the mbrtowc function. This function behaves on Cygwin the same as on Linux: If the current locale's codeset is ASCII, and if the input character is >= 0x80, mbrtowc returns -1 with errno set to EILSEQ. This happens on Cygwin. The regcomp routine converting the input string to wchar_t calls mbrtowc, and mbrtowc returns -1 (EILSEQ) because the input character is >= 0x80 in the bracket expression. Even though the mbrtowc functions behave the same in Cygwin and glibc, the glibc implementation of regcomp apparently does not call mbrtowc under all circumstances, namely not in the "C"/"POSIX" locale or if the locale's codeset is ASCII. Therefore it does not treat the chars >= 0x80 as invalid characters. So, what I did now was this: I added a workaround to Cygwin's regcomp. If the current codeset is ASCII, the characters in the pattern are converted to wchar_t by simply using their unsigned value verbatim. This allows to compile (and test) the patterns in the git testcases. However, please note that this behaviour, while being provided by glibc and now by Cygwin, is *not* standards-compliant. In the narrow sense the characters beyond 0x7f are still invalid ASCII chars, and other functions working with wchar_t strings won't be as forgiving when using invalid input. HTH, Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Maintainer cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple