delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/11/06/08:10:16

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.3 required=5.0 tests=AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Message-ID: <4AF42027.80604@towo.net>
Date: Fri, 06 Nov 2009 14:09:59 +0100
From: Thomas Wolff <towo AT towo DOT net>
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: cygwin AT cygwin DOT com
Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file
References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx>
In-Reply-To: <20091106033243.GB30410@ednor.casa.cgf.cx>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

Christopher Faylor wrote:
> On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote:
>   
>> aputerguy wrote:
>>     
>>> Running grep on a 20MB file with ~100,000 matches takes an incredible almost
>>> 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5
>>> (on a 2nd machine).
>>>       
>> I've seen nasty behavior with grep that isnt' cygwin specific.  Try
>> "pcregrep" and see if you have the same issue.
>>
>> I found it to be about ~100 times faster under _some_ searches though
>> 2-3x is more typical.  The gnu re-parser isn't real efficient under
>> some circumstances.
>>
>> If you find a big difference, you might also want to report it to the
>> bug-grep AT gnu DOT org mailing list, but last time I did, they told me
>> "that's the way it is" due to some posix conformance thing...
>>     
>
> The fact that it behaves differently between Cygwin 1.5 and 1.7 would
> suggest that this isn't a grep problem.
>   
This is likely to be triggered by the transition to UTF-8 as a default 
charset. The same problem is observed on Linux, with grep as well as 
with sed.
That's why I have changed most of my shell scripts to use something like
LC_ALL=C grep or LC_ALL=C sed
where possible. Please try this.

The problem *is* with grep (and sed), however, because there is no good 
reason that UTF-8 should give us a penalty of being 100times slower on 
most search operations, this is just poor programming of grep and sed.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019