X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Fri, 6 Nov 2009 15:26:44 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com Cc: bug-grep AT gnu DOT org Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file Message-ID: <20091106142644.GL26344@calimero.vinschen.de> Mail-Followup-To: cygwin AT cygwin DOT com, bug-grep AT gnu DOT org References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> <20091106135152 DOT GK26344 AT calimero DOT vinschen DOT de> <4AF42B15 DOT 9050100 AT byu DOT net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4AF42B15.9050100@byu.net> User-Agent: Mutt/1.5.20 (2009-06-14) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Nov 6 06:56, Eric Blake wrote: > According to Corinna Vinschen on 11/6/2009 6:51 AM: > >> The problem *is* with grep (and sed), however, because there is no > >> good reason that UTF-8 should give us a penalty of being 100times > >> slower on most search operations, this is just poor programming of > >> grep and sed. > > > > The penalty on Linux is much smaller, about 15-20%. It looks like > > grep is calling malloc for every input line if MB_CUR_MAX is > 1. > > Then it evaluates for each byte in the line whether the byte is a > > single byte or the start of a multibyte sequence using mbrtowc on > > every charatcer on the input line. Then, for each potential match, > > it checks if it's the start byte of a multibyte sequence and ignores > > all other matches. Eventually, it calls free, and the game starts > > over for the next line. > > Adding bug-grep, since this slowdown caused by additional mallocs is > definitely the sign of a poor algorithm that could be improved by reusing > existing buffers. I created a simple testcase: ==== SNIP === #include #include #include #include int main (int argc, char **argv) { const char in[] = "The quick brown fox jumps over the lazy dog"; int line, i; mbstate_t mbs; size_t mbclen; size_t size = sizeof (in); wchar_t wc; int lines = argc > 1 ? atoi (argv[1]) : 1000; int do_malloc = 1; int do_mbrtowc = 1; if (argc > 2) do_malloc = atoi (argv[2]); if (argc > 3) do_mbrtowc = atoi (argv[3]); printf ("with malloc: %d, with mbrtowc: %d\n", do_malloc, do_mbrtowc); memset (&mbs, 0, sizeof mbs); for (line = 0; line < lines; ++line) { char *x; if (do_malloc) x = malloc (size); if (do_mbrtowc) for (i = 0; i < size; i += mbclen) if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0) break; if (do_malloc) free (x); } return 0; } ==== SNAP ==== Under Cygwin (tcsh time output): $ setenv LANG en_US.UTF-8 $ time ./mb 1000000 1 0 with malloc: 1, with mbrtowc: 0 0.328u 0.031s 0:00.34 102.9% 0+0k 0+0io 1834pf+0w $ time ./mb 1000000 0 1 with malloc: 0, with mbrtowc: 1 1.921u 0.092s 0:02.09 96.1% 0+0k 0+0io 1827pf+0w $ time ./mb 1000000 1 1 with malloc: 1, with mbrtowc: 1 2.062u 0.140s 0:02.15 102.3% 0+0k 0+0io 1839pf+0w Running on the same CPU under Linux: $ setenv LANG en_US.UTF-8 $ time ./mb 1000000 1 0 with malloc: 1, with mbrtowc: 0 0.088u 0.004s 0:00.09 88.8% 0+0k 0+0io 0pf+0w $ time ./mb 1000000 0 1 with malloc: 0, with mbrtowc: 1 1.836u 0.000s 0:01.85 98.9% 0+0k 0+0io 0pf+0w $ time ./mb 1000000 1 1 with malloc: 1, with mbrtowc: 1 1.888u 0.000s 0:01.93 97.4% 0+0k 0+0io 0pf+0w So, while Linux is definitely faster, the number are still comparable for 1 million iterations. That still doens't explain why grep is a multitude slower when using UTF-8 as charset. Puzzled, Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple