delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/11/08/08:56:23

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Sun, 8 Nov 2009 14:56:02 +0100
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com, bug-grep AT gnu DOT org
Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file
Message-ID: <20091108135602.GA26344@calimero.vinschen.de>
Mail-Followup-To: cygwin AT cygwin DOT com, bug-grep AT gnu DOT org
References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> <20091106135152 DOT GK26344 AT calimero DOT vinschen DOT de> <814i7o$49eric AT dmzms99802 DOT na DOT baesystems DOT com> <806a89db0911061422l290ff84u3d58cbbe1d3eface AT mail DOT gmail DOT com> <1257632832 DOT 5773 DOT 48 DOT camel AT fast> <26249599 DOT post AT talk DOT nabble DOT com> <20091108103038 DOT GY26344 AT calimero DOT vinschen DOT de>
MIME-Version: 1.0
In-Reply-To: <20091108103038.GY26344@calimero.vinschen.de>
User-Agent: Mutt/1.5.20 (2009-06-14)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Nov  8 11:30, Corinna Vinschen wrote:
> On Nov  7 15:26, aputerguy wrote:
> > 
> > Changing LC_ALL also solved the problem for me.
> > But it begs the question of how many other basic and take-for-granted
> > functions might be affected by this apparent UTF-8 slowdown. And again we,
> > are not talking about some minor overhead, we are talking about a slowdown
> > of 1500X or 150,000%!!!!
> 
> Yeah, that's really still strange to me.  In my testing, the multibyte
> to widechar conversion performed by grep in case of UTF-8 took only
> 1.5 up to 4 seconds for 10 times the number of input lines as in your
> case.  It still puzzles me where the time is wasted in grep.

Got it.  The problem is this.

Grep reads the file in chunks > pagesize.  Pagesize is 64K on Cygwin.
For each buffer read into memory, the grepbuf() function calls the
execute() function as long as it returns a match.

The execute() function calls check_multibyte_string() for the entire
buffer(!), then calls kwsexec() to find a match.  If a match has been
found, it free's the memory allocated by check_multibyte_string()
and returns to grepbuf.  Then grepbuf() will call execute again with
the pointers into the buffer moved to the next line.

Let's make an example.  Assume the buffer is 100K, which is not unusual
when running under Cygwin.  Assume further that the file consists of
100,000 lines with the text "The quick brown fox jumped over the lazy dog".
Each line is 45 bytes, so the buffer contains somwhat more than 2200
lines.  Now let's search for the expression "dog".

The first call to execute will call check_multibyte_string() for the
entire buffer of 100000 bytes.  Then it finds a match in the first line,
free's the check_multibyte_string() memory and returns to grepbuf.
grepbuf calls execute with the start pointer moved to the second line in
the buffer, so execute() calls check_multibyte_string() for the remainder
of the buffer, which is 99955 bytes.  It find a match in the first line,
free's the check_multibyte_string buffer, returns to grepbuf, which calls
execute, which calls check_multibyte_string() with a buffer of 99910
bytes, and so on...

Every invocation of check_multibyte_string() calls mbrtowc() in a loop
for the entire buffer given as argument.  In our example, that means
that mbrtowc() is called (hold your breath)

  111,161,115

times for each 100K of input file!  No wonder grep takes 3 or 4 minutes
to grep this very example on Cygwin.

I really think there's some room for optimization left in this algorithm.

Btw., the check for mmap in grep's configure file is broken.  It tries
to mmap to a fixed address formerly allocated via malloc().  This doesn't
work on Windows.  An autoconf run with a newer version of autoconf would
be nice.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019