Mail Archives: cygwin/2001/06/12/01:52:21
"Eli Zaretskii" <eliz AT is DOT elta DOT co DOT il> writes:
> One notorious problem with GNU regex is that it is quite slow for many
> simple jobs, such as matching a simple regular expression with no
> backtracking. It seems that the main reason for this slowness is the
> fact that GNU regex supports null characters in strings. For
> examnple, Sed 3.02 compiled with GNU regex is about 2-4 times slower
> on simple jobs than the same Sed compiled with Spencer's regex
> library. (The DJGPP port of Sed is actually distributed with two
> executables, one build with GNU regex, the other with Spencer's, for
> this very reason.)
I'm suspicious of this assertion. I've worked on GNU regexp in the
past, and I don't see any reason this should be so.
However, I was messing around with regexps a lot when GNU regexp
suddenly became slow on certain regexps. I looked into the cause, and
it turned out that this was because GNU regexp had been made to comply
with the POSIX regexp specification. POSIX regexp semantics require
that the regexp match the longest possible string (I may have the
details wrong, but it's something like that). For backtracking regexp
engines (the GNU, Henry Spencer, and Perl regexp matchers are all of
this design), that innocent-sounding constraint basically requires
insanely slow behavior.
GNU regexp has a flag that allows you to turn this behavior off, and
get the traditional, faster, non-POSIX-compliant behavior. So I don't
see any reason the GNU regexp library couldn't serve all the GPL'd
software's needs.
----
The details, for the curious:
Suppose you have a regexp like this (assume the obvious
metacharacters)
(FOObar|FOO)(barbar)+
---a-- -b- ---c-- <= labels for pieces of the regexp
which you're matching against the string
FOObarbarbarbar
0 3 6 9 12
and let's suppose you're calling the regexp library in a manner which
asks "does a prefix of this string match this regexp?" (That is,
you're not asking "does this regexp match this entire string?")
The traditional behavior is for the regexp engine to match part ---a--
of the regexp against data[0..5], match one repetition of part ---c--
against data[6..8], and say it's done. The Perl regexp matcher will
return this match.
But this isn't the behavior POSIX requires. POSIX says you must
return the *longest possible* match. So a POSIX-compliant matcher
must match -b- against data[0..2], and then two repetitions of ---c--
against data[3..8] and data[9..14]. This is a longer match.
To find the longest match, in general, a backtracking matcher has to
generate every possible match, and return the longest one it found.
This is what GNU regexp does.
So, just how bad is this? Well, suppose you're matching a regexp like:
.*.*.*.*.*.*
against a string like
aaaaaaaaaaaaaaaaaaaa
To generate every possible match, you have to choose every possible
way to divide up those twenty a's amongst six .* patterns. I think
this is 20 choose 5, or 1.9 million, matches you have to try. In
general, I think the time to match POSIXly can increase exponentially
in the length of your regexp, given a long enough data string.
If you have a smart regexp compiler (I understand Perl is pretty
clever), then you could probably handle this regexp with a bit more
aplomb. But I'll bet that I can find a regexp where you really do
have to try all matchings, no matter how smart your regexp compiler
is.
(So think of that the next time you propose some innocent-sounding
constraint, like "longest match always"!)
Anyway, the outcome is that all the really popular regexp matchers
either don't implement the POSIX behavior, or provide options to turn
it off. For GNU regexp, you can use the RE_NO_POSIX_BACKTRACKING
flag, and you'll get the traditional not-always-the-longest-match nice
fast behavior. Perl simply documents the traditional behavior ("The
[Perl regexp] engine thinks locally and acts globally," as the Camel
book puts it).
--
Want to unsubscribe from this list?
Check out: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -