Mailing-List: contact cygwin-help AT sourceware DOT cygnus DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT sources DOT redhat DOT com Delivered-To: mailing list cygwin AT sources DOT redhat DOT com To: Eli Zaretskii Cc: dj AT redhat DOT com, gcc AT gcc DOT gnu DOT org, gdb AT sources DOT redhat DOT com, binutils AT sources DOT redhat DOT com, cygwin AT sources DOT redhat DOT com Subject: Re: Another RFC: regex in libiberty References: <200106080127 DOT VAA01308 AT greed DOT delorie DOT com> <9003-Fri08Jun2001100651+0300-eliz AT is DOT elta DOT co DOT il> From: Jim Blandy Date: 12 Jun 2001 00:49:28 -0500 In-Reply-To: "Eli Zaretskii"'s message of Fri, 08 Jun 2001 10:06:51 +0300 Message-ID: Lines: 93 X-Mailer: Gnus v5.3/Emacs 19.34 "Eli Zaretskii" writes: > One notorious problem with GNU regex is that it is quite slow for many > simple jobs, such as matching a simple regular expression with no > backtracking. It seems that the main reason for this slowness is the > fact that GNU regex supports null characters in strings. For > examnple, Sed 3.02 compiled with GNU regex is about 2-4 times slower > on simple jobs than the same Sed compiled with Spencer's regex > library. (The DJGPP port of Sed is actually distributed with two > executables, one build with GNU regex, the other with Spencer's, for > this very reason.) I'm suspicious of this assertion. I've worked on GNU regexp in the past, and I don't see any reason this should be so. However, I was messing around with regexps a lot when GNU regexp suddenly became slow on certain regexps. I looked into the cause, and it turned out that this was because GNU regexp had been made to comply with the POSIX regexp specification. POSIX regexp semantics require that the regexp match the longest possible string (I may have the details wrong, but it's something like that). For backtracking regexp engines (the GNU, Henry Spencer, and Perl regexp matchers are all of this design), that innocent-sounding constraint basically requires insanely slow behavior. GNU regexp has a flag that allows you to turn this behavior off, and get the traditional, faster, non-POSIX-compliant behavior. So I don't see any reason the GNU regexp library couldn't serve all the GPL'd software's needs. ---- The details, for the curious: Suppose you have a regexp like this (assume the obvious metacharacters) (FOObar|FOO)(barbar)+ ---a-- -b- ---c-- <= labels for pieces of the regexp which you're matching against the string FOObarbarbarbar 0 3 6 9 12 and let's suppose you're calling the regexp library in a manner which asks "does a prefix of this string match this regexp?" (That is, you're not asking "does this regexp match this entire string?") The traditional behavior is for the regexp engine to match part ---a-- of the regexp against data[0..5], match one repetition of part ---c-- against data[6..8], and say it's done. The Perl regexp matcher will return this match. But this isn't the behavior POSIX requires. POSIX says you must return the *longest possible* match. So a POSIX-compliant matcher must match -b- against data[0..2], and then two repetitions of ---c-- against data[3..8] and data[9..14]. This is a longer match. To find the longest match, in general, a backtracking matcher has to generate every possible match, and return the longest one it found. This is what GNU regexp does. So, just how bad is this? Well, suppose you're matching a regexp like: .*.*.*.*.*.* against a string like aaaaaaaaaaaaaaaaaaaa To generate every possible match, you have to choose every possible way to divide up those twenty a's amongst six .* patterns. I think this is 20 choose 5, or 1.9 million, matches you have to try. In general, I think the time to match POSIXly can increase exponentially in the length of your regexp, given a long enough data string. If you have a smart regexp compiler (I understand Perl is pretty clever), then you could probably handle this regexp with a bit more aplomb. But I'll bet that I can find a regexp where you really do have to try all matchings, no matter how smart your regexp compiler is. (So think of that the next time you propose some innocent-sounding constraint, like "longest match always"!) Anyway, the outcome is that all the really popular regexp matchers either don't implement the POSIX behavior, or provide options to turn it off. For GNU regexp, you can use the RE_NO_POSIX_BACKTRACKING flag, and you'll get the traditional not-always-the-longest-match nice fast behavior. Perl simply documents the traditional behavior ("The [Perl regexp] engine thinks locally and acts globally," as the Camel book puts it). -- Want to unsubscribe from this list? Check out: http://cygwin.com/ml/#unsubscribe-simple