delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/11/06/08:52:11

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Fri, 6 Nov 2009 14:51:52 +0100
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file
Message-ID: <20091106135152.GK26344@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net>
MIME-Version: 1.0
In-Reply-To: <4AF42027.80604@towo.net>
User-Agent: Mutt/1.5.20 (2009-06-14)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Nov  6 14:09, Thomas Wolff wrote:
> Christopher Faylor wrote:
> >On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote:
> >>aputerguy wrote:
> >>>Running grep on a 20MB file with ~100,000 matches takes an incredible almost
> >>>8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5
> >>>(on a 2nd machine).
> >>I've seen nasty behavior with grep that isnt' cygwin specific.  Try
> >>"pcregrep" and see if you have the same issue.
> >>
> >>I found it to be about ~100 times faster under _some_ searches though
> >>2-3x is more typical.  The gnu re-parser isn't real efficient under
> >>some circumstances.
> >>
> >>If you find a big difference, you might also want to report it to the
> >>bug-grep AT gnu DOT org mailing list, but last time I did, they told me
> >>"that's the way it is" due to some posix conformance thing...
> >
> >The fact that it behaves differently between Cygwin 1.5 and 1.7 would
> >suggest that this isn't a grep problem.
> This is likely to be triggered by the transition to UTF-8 as a
> default charset. The same problem is observed on Linux, with grep as
> well as with sed.
> That's why I have changed most of my shell scripts to use something like
> LC_ALL=C grep or LC_ALL=C sed
> where possible. Please try this.

Or try LANG=C.ASCII since LANG=C will still return UTF-8 as charset
when calling nl_langinfo(CHARSET).

> The problem *is* with grep (and sed), however, because there is no
> good reason that UTF-8 should give us a penalty of being 100times
> slower on most search operations, this is just poor programming of
> grep and sed.

The penalty on Linux is much smaller, about 15-20%.  It looks like
grep is calling malloc for every input line if MB_CUR_MAX is > 1.
Then it evaluates for each byte in the line whether the byte is a
single byte or the start of a multibyte sequence using mbrtowc on
every charatcer on the input line.  Then, for each potential match,
it checks if it's the start byte of a multibyte sequence and ignores
all other matches.  Eventually, it calls free, and the game starts
over for the next line.

It appears that either our malloc is that slow, or the mbrtowc call.
But I can't really believe the latter.  The function should be quite
fast, as far as I can see...


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019