X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Sun, 8 Nov 2009 14:56:02 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com, bug-grep AT gnu DOT org Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file Message-ID: <20091108135602.GA26344@calimero.vinschen.de> Mail-Followup-To: cygwin AT cygwin DOT com, bug-grep AT gnu DOT org References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> <20091106135152 DOT GK26344 AT calimero DOT vinschen DOT de> <814i7o$49eric AT dmzms99802 DOT na DOT baesystems DOT com> <806a89db0911061422l290ff84u3d58cbbe1d3eface AT mail DOT gmail DOT com> <1257632832 DOT 5773 DOT 48 DOT camel AT fast> <26249599 DOT post AT talk DOT nabble DOT com> <20091108103038 DOT GY26344 AT calimero DOT vinschen DOT de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20091108103038.GY26344@calimero.vinschen.de> User-Agent: Mutt/1.5.20 (2009-06-14) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Nov 8 11:30, Corinna Vinschen wrote: > On Nov 7 15:26, aputerguy wrote: > > > > Changing LC_ALL also solved the problem for me. > > But it begs the question of how many other basic and take-for-granted > > functions might be affected by this apparent UTF-8 slowdown. And again we, > > are not talking about some minor overhead, we are talking about a slowdown > > of 1500X or 150,000%!!!! > > Yeah, that's really still strange to me. In my testing, the multibyte > to widechar conversion performed by grep in case of UTF-8 took only > 1.5 up to 4 seconds for 10 times the number of input lines as in your > case. It still puzzles me where the time is wasted in grep. Got it. The problem is this. Grep reads the file in chunks > pagesize. Pagesize is 64K on Cygwin. For each buffer read into memory, the grepbuf() function calls the execute() function as long as it returns a match. The execute() function calls check_multibyte_string() for the entire buffer(!), then calls kwsexec() to find a match. If a match has been found, it free's the memory allocated by check_multibyte_string() and returns to grepbuf. Then grepbuf() will call execute again with the pointers into the buffer moved to the next line. Let's make an example. Assume the buffer is 100K, which is not unusual when running under Cygwin. Assume further that the file consists of 100,000 lines with the text "The quick brown fox jumped over the lazy dog". Each line is 45 bytes, so the buffer contains somwhat more than 2200 lines. Now let's search for the expression "dog". The first call to execute will call check_multibyte_string() for the entire buffer of 100000 bytes. Then it finds a match in the first line, free's the check_multibyte_string() memory and returns to grepbuf. grepbuf calls execute with the start pointer moved to the second line in the buffer, so execute() calls check_multibyte_string() for the remainder of the buffer, which is 99955 bytes. It find a match in the first line, free's the check_multibyte_string buffer, returns to grepbuf, which calls execute, which calls check_multibyte_string() with a buffer of 99910 bytes, and so on... Every invocation of check_multibyte_string() calls mbrtowc() in a loop for the entire buffer given as argument. In our example, that means that mbrtowc() is called (hold your breath) 111,161,115 times for each 100K of input file! No wonder grep takes 3 or 4 minutes to grep this very example on Cygwin. I really think there's some room for optimization left in this algorithm. Btw., the check for mmap in grep's configure file is broken. It tries to mmap to a fixed address formerly allocated via malloc(). This doesn't work on Windows. An autoconf run with a newer version of autoconf would be nice. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple