delorie.com/archives/browse.cgi | search |
X-Recipient: | archive-cygwin AT delorie DOT com |
X-SWARE-Spam-Status: | No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,SPF_SOFTFAIL |
X-Spam-Check-By: | sourceware.org |
Message-ID: | <4AF42B15.9050100@byu.net> |
Date: | Fri, 06 Nov 2009 06:56:37 -0700 |
From: | Eric Blake <ebb9 AT byu DOT net> |
User-Agent: | Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.23) Gecko/20090812 Thunderbird/2.0.0.23 Mnenhy/0.7.6.666 |
MIME-Version: | 1.0 |
To: | cygwin AT cygwin DOT com, Grep Development List <bug-grep AT gnu DOT org> |
Subject: | Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file |
References: | <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> <20091106135152 DOT GK26344 AT calimero DOT vinschen DOT de> |
In-Reply-To: | <20091106135152.GK26344@calimero.vinschen.de> |
X-IsSubscribed: | yes |
Mailing-List: | contact cygwin-help AT cygwin DOT com; run by ezmlm |
List-Id: | <cygwin.cygwin.com> |
List-Subscribe: | <mailto:cygwin-subscribe AT cygwin DOT com> |
List-Archive: | <http://sourceware.org/ml/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs> |
Sender: | cygwin-owner AT cygwin DOT com |
Mail-Followup-To: | cygwin AT cygwin DOT com |
Delivered-To: | mailing list cygwin AT cygwin DOT com |
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 According to Corinna Vinschen on 11/6/2009 6:51 AM: >> The problem *is* with grep (and sed), however, because there is no >> good reason that UTF-8 should give us a penalty of being 100times >> slower on most search operations, this is just poor programming of >> grep and sed. > > The penalty on Linux is much smaller, about 15-20%. It looks like > grep is calling malloc for every input line if MB_CUR_MAX is > 1. > Then it evaluates for each byte in the line whether the byte is a > single byte or the start of a multibyte sequence using mbrtowc on > every charatcer on the input line. Then, for each potential match, > it checks if it's the start byte of a multibyte sequence and ignores > all other matches. Eventually, it calls free, and the game starts > over for the next line. Adding bug-grep, since this slowdown caused by additional mallocs is definitely the sign of a poor algorithm that could be improved by reusing existing buffers. - -- Don't work too hard, make some time for fun as well! Eric Blake ebb9 AT byu DOT net -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkr0KxUACgkQ84KuGfSFAYCOCACgvjz2v65vK8DIcGg6zfnLQgcT tfQAmwbpWbriBJSv0rjYobYgsh4KXOiZ =B3nZ -----END PGP SIGNATURE----- -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |