X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Fri, 6 Nov 2009 14:51:52 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file Message-ID: <20091106135152.GK26344@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4AF42027.80604@towo.net> User-Agent: Mutt/1.5.20 (2009-06-14) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Nov 6 14:09, Thomas Wolff wrote: > Christopher Faylor wrote: > >On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote: > >>aputerguy wrote: > >>>Running grep on a 20MB file with ~100,000 matches takes an incredible almost > >>>8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5 > >>>(on a 2nd machine). > >>I've seen nasty behavior with grep that isnt' cygwin specific. Try > >>"pcregrep" and see if you have the same issue. > >> > >>I found it to be about ~100 times faster under _some_ searches though > >>2-3x is more typical. The gnu re-parser isn't real efficient under > >>some circumstances. > >> > >>If you find a big difference, you might also want to report it to the > >>bug-grep AT gnu DOT org mailing list, but last time I did, they told me > >>"that's the way it is" due to some posix conformance thing... > > > >The fact that it behaves differently between Cygwin 1.5 and 1.7 would > >suggest that this isn't a grep problem. > This is likely to be triggered by the transition to UTF-8 as a > default charset. The same problem is observed on Linux, with grep as > well as with sed. > That's why I have changed most of my shell scripts to use something like > LC_ALL=C grep or LC_ALL=C sed > where possible. Please try this. Or try LANG=C.ASCII since LANG=C will still return UTF-8 as charset when calling nl_langinfo(CHARSET). > The problem *is* with grep (and sed), however, because there is no > good reason that UTF-8 should give us a penalty of being 100times > slower on most search operations, this is just poor programming of > grep and sed. The penalty on Linux is much smaller, about 15-20%. It looks like grep is calling malloc for every input line if MB_CUR_MAX is > 1. Then it evaluates for each byte in the line whether the byte is a single byte or the start of a multibyte sequence using mbrtowc on every charatcer on the input line. Then, for each potential match, it checks if it's the start byte of a multibyte sequence and ignores all other matches. Eventually, it calls free, and the game starts over for the next line. It appears that either our malloc is that slow, or the mbrtowc call. But I can't really believe the latter. The function should be quite fast, as far as I can see... Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple