X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-2.3 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: sourceware.org Message-ID: <4AF439F0.8060203@towo.net> Date: Fri, 06 Nov 2009 16:00:00 +0100 From: Thomas Wolff User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: cygwin AT cygwin DOT com, bug-grep AT gnu DOT org Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> <20091106135152 DOT GK26344 AT calimero DOT vinschen DOT de> <4AF42B15 DOT 9050100 AT byu DOT net> <20091106142644 DOT GL26344 AT calimero DOT vinschen DOT de> In-Reply-To: <20091106142644.GL26344@calimero.vinschen.de> Content-Type: multipart/mixed; boundary="------------080905000306010604040903" X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com --------------080905000306010604040903 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Corinna Vinschen wrote: > On Nov 6 06:56, Eric Blake wrote: > >> According to Corinna Vinschen on 11/6/2009 6:51 AM: >> >>>> The problem *is* with grep (and sed), however, because there is no >>>> good reason that UTF-8 should give us a penalty of being 100times >>>> slower on most search operations, this is just poor programming of >>>> grep and sed. >>>> >>> The penalty on Linux is much smaller, about 15-20%. On my Linux system, the penalty is a factor between 6 and 8. >>> It looks like >>> grep is calling malloc for every input line if MB_CUR_MAX is > 1. >>> Then it evaluates for each byte in the line whether the byte is a >>> single byte or the start of a multibyte sequence using mbrtowc on >>> every charatcer on the input line. Then, for each potential match, >>> it checks if it's the start byte of a multibyte sequence and ignores >>> all other matches. Eventually, it calls free, and the game starts >>> over for the next line. >>> >> Adding bug-grep, since this slowdown caused by additional mallocs is >> definitely the sign of a poor algorithm that could be improved by reusing >> existing buffers. >> > > I created a simple testcase: > > ==== SNIP === > ... > ==== SNAP ==== > I extended your test program to demonstrate the inefficiency of the standard mbrtowc function. Instead I use a function from my editor (mined) to extract a Unicode character from a UTF-8 sequence. This is the simple case only, not converting character sets other than UTF-8 but that's the same thing mbrtowc does in the sample invocation. Program attached. Results below. > Under Cygwin (tcsh time output): > > $ setenv LANG en_US.UTF-8 > $ time ./mb 1000000 1 0 > with malloc: 1, with mbrtowc: 0 > 0.328u 0.031s 0:00.34 102.9% 0+0k 0+0io 1834pf+0w > $ time ./mb 1000000 0 1 > with malloc: 0, with mbrtowc: 1 > 1.921u 0.092s 0:02.09 96.1% 0+0k 0+0io 1827pf+0w > $ time ./mb 1000000 1 1 > with malloc: 1, with mbrtowc: 1 > 2.062u 0.140s 0:02.15 102.3% 0+0k 0+0io 1839pf+0w > > Running on the same CPU under Linux: > > $ setenv LANG en_US.UTF-8 > $ time ./mb 1000000 1 0 > with malloc: 1, with mbrtowc: 0 > 0.088u 0.004s 0:00.09 88.8% 0+0k 0+0io 0pf+0w > $ time ./mb 1000000 0 1 > with malloc: 0, with mbrtowc: 1 > 1.836u 0.000s 0:01.85 98.9% 0+0k 0+0io 0pf+0w > $ time ./mb 1000000 1 1 > with malloc: 1, with mbrtowc: 1 > 1.888u 0.000s 0:01.93 97.4% 0+0k 0+0io 0pf+0w > > So, while Linux is definitely faster, the number are still comparable > for 1 million iterations. That still doens't explain why grep is a > multitude slower when using UTF-8 as charset. > Results of mbrtowc vs. utftouni on Linux: thw[en_US.UTF-8]@scotty:~/tmp: locale charmap UTF-8 thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0 with malloc: 0, with mbrtowc: 1, with utftouni: 0 real 0m2.897s user 0m2.836s sys 0m0.012s thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1 with malloc: 0, with mbrtowc: 0, with utftouni: 1 real 0m0.030s user 0m0.028s sys 0m0.000s thw[en_US.UTF-8]@scotty:~/tmp: Results of mbrtowc vs. utftouni on cygwin: demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 1 0 with malloc: 0, with mbrtowc: 1, with utftouni: 0 real 0m3.034s user 0m2.974s sys 0m0.030s demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 0 1 with malloc: 0, with mbrtowc: 0, with utftouni: 1 real 0m0.110s user 0m0.070s sys 0m0.030s demsn702[C.UTF-8]@stbm8186:/H/tools: The conclusion is, as long as calling mbrtowc is as inefficient, a program caring about performance should not use it. Thomas --------------080905000306010604040903 Content-Type: text/plain; name="uu.c" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="uu.c" #include #include #include #include void utf8_info (u, length, ucs) char * u; int * length; unsigned long * ucs; { char * textpoi = u; unsigned char c = * textpoi; int utfcount; unsigned long unichar; if ((c & 0x80) == 0x00) { utfcount = 1; unichar = c; } else if ((c & 0xE0) == 0xC0) { utfcount = 2; unichar = c & 0x1F; } else if ((c & 0xF0) == 0xE0) { utfcount = 3; unichar = c & 0x0F; } else if ((c & 0xF8) == 0xF0) { utfcount = 4; unichar = c & 0x07; } else if ((c & 0xFC) == 0xF8) { utfcount = 5; unichar = c & 0x03; } else if ((c & 0xFE) == 0xFC) { utfcount = 6; unichar = c & 0x01; } else if (c == 0xFE) { /* illegal UTF-8 code */ utfcount = 1; unichar = 0; } else if (c == 0xFF) { /* illegal UTF-8 code */ utfcount = 1; unichar = 0; } else { /* illegal UTF-8 sequence character */ utfcount = 1; unichar = 0; } * length = utfcount; utfcount --; textpoi ++; while (utfcount > 0 && (* textpoi & 0xC0) == 0x80) { unichar = (unichar << 6) | (* textpoi & 0x3F); utfcount --; textpoi ++; } if (utfcount > 0) { /* too short UTF-8 sequence */ unichar = 0; * length -= utfcount; } * ucs = unichar; } void utftouni (wchar_t * wpoi, char * s) { unsigned long c; int l; int i = 0; while (* s) { utf8_info (s, & l, & wpoi [i++]); s += l; } } int main (int argc, char **argv) { char in[] = "The quick brown fox jumps over the lazy dog"; int line, i; mbstate_t mbs; size_t mbclen; size_t size = sizeof (in); wchar_t wc; int lines = argc > 1 ? atoi (argv[1]) : 1000; int do_malloc = 1; int do_mbrtowc = 1; int do_utftouni = 1; if (argc > 2) do_malloc = atoi (argv[2]); if (argc > 3) do_mbrtowc = atoi (argv[3]); if (argc > 4) do_utftouni = atoi (argv[4]); printf ("with malloc: %d, with mbrtowc: %d, with utftouni: %d\n", do_malloc, do_mbrtowc, do_utftouni); memset (&mbs, 0, sizeof mbs); for (line = 0; line < lines; ++line) { char *x; if (do_malloc) x = malloc (size); if (do_mbrtowc) for (i = 0; i < size; i += mbclen) if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0) break; if (do_utftouni) utftouni (&wc, in); if (do_malloc) free (x); } return 0; } --------------080905000306010604040903 Content-Type: text/plain; charset=us-ascii -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple --------------080905000306010604040903--