delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2013/07/21/15:40:19

X-Recipient: archive-cygwin AT delorie DOT com
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:date:from:to:subject:message-id:reply-to
:references:mime-version:content-type:content-transfer-encoding
:in-reply-to; q=dns; s=default; b=Vh17D6p2CEhgPYVnZQvsZ6B2Osxnr5
nv9rEjXv1MdcQPJACn+DZSwGJNWMo1fP2TvEpQJuXo1OoDW1KOSjdbXd2lLppkuf
i/MCuY7CpZlypXd1jFwhBUWhWfQYi2WVkXMmvppXtppF7yycY6GoWtgL2qvvp/RQ
MCvpyK3rzkKKQ=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:date:from:to:subject:message-id:reply-to
:references:mime-version:content-type:content-transfer-encoding
:in-reply-to; s=default; bh=LwAGZrUtJa5Fh+trTE0cy+U7qoU=; b=EysY
rBJP4ZyHMr/r9ZFPo4wYdR6O+Gale+FK01Z9RMewo02o7uY0uIOJyjdRoZDuFzYd
D7AifH43R+my4IZHFyKmZEebPDLQ5tCZboGLshX0f5syGAXbEnm3Tam8t2uKp6Br
9L2t1rdILuqS3MJq8kyWQB93YKieCDEJNVXS8cQ=
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
X-Spam-SWARE-Status: No, score=0.2 required=5.0 tests=AWL,BAYES_50,RDNS_NONE,TW_EG,TW_NX autolearn=no version=3.3.1
Date: Sun, 21 Jul 2013 21:39:53 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: regex library fails git tests
Message-ID: <20130721193953.GC2661@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <ksepor$cag$1 AT ger DOT gmane DOT org>
MIME-Version: 1.0
In-Reply-To: <ksepor$cag$1@ger.gmane.org>
User-Agent: Mutt/1.5.21 (2010-09-15)

On Jul 20 15:52, Mark Levedahl wrote:
> Current git fails two sets of tests on cygwin due apparently to
> problems in the regex library. One set of tests does language based
> word-matching, and has a common failure during regex compilation.
> The suffix clause ("|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+") is
> common to all of these, removing that clause eliminates the regcomp
> failure.
> 
> A test case extracted from the git sources is below - this works
> correctly on Fedora 18, fails on Cygwin:
> 
> $ gcc test-regex.c
> $ ./a.out
> failed regcomp() for pattern '[^<>=     ]+|[^[:space:]]|[â–’-â–’][â–’-â–’]+'
> 
> The failure disappears when the suffix clause is removed from pat_html.
> 
> This is happening on a current installation:
> $ uname -a
> CYGWIN_NT-5.1 virt-winxp 1.7.21(0.267/5/3) 2013-07-15 12:17 i686 Cygwin

Thanks for the testcase.  The problem is this:  Cygwin's regex is taken
from FreeBSD, so it's not identical to the glibc implementation on Linux.
The FreeBSD implementation converts all input chars to wchar_t and then
handles everything, the pattern as well as the input string, in wchar_t
to be locale- and codeset independent.

You application does not call setlocale, so the locale is "C" or "POSIX"
and the codeset is ANSI_X3.4-1968 (aka ASCII).  The conversion to wchar_t
is performed by calling the mbrtowc function.  This function behaves on
Cygwin the same as on Linux:  If the current locale's codeset is ASCII,
and if the input character is >= 0x80, mbrtowc returns -1 with errno set
to EILSEQ.

This happens on Cygwin.  The regcomp routine converting the input string
to wchar_t calls mbrtowc, and mbrtowc returns -1 (EILSEQ) because the
input character is >= 0x80 in the bracket expression.

Even though the mbrtowc functions behave the same in Cygwin and glibc,
the glibc implementation of regcomp apparently does not call mbrtowc
under all circumstances, namely not in the "C"/"POSIX" locale or if the
locale's codeset is ASCII.  Therefore it does not treat the chars >= 0x80
as invalid characters.

So, what I did now was this:  I added a workaround to Cygwin's regcomp.
If the current codeset is ASCII, the characters in the pattern are
converted to wchar_t by simply using their unsigned value verbatim.
This allows to compile (and test) the patterns in the git testcases.

However, please note that this behaviour, while being provided by glibc
and now by Cygwin, is *not* standards-compliant.  In the narrow sense
the characters beyond 0x7f are still invalid ASCII chars, and other
functions working with wchar_t strings won't be as forgiving when using
invalid input.


HTH,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019