delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/11/06/09:27:04

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Fri, 6 Nov 2009 15:26:44 +0100
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Cc: bug-grep AT gnu DOT org
Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file
Message-ID: <20091106142644.GL26344@calimero.vinschen.de>
Mail-Followup-To: cygwin AT cygwin DOT com, bug-grep AT gnu DOT org
References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> <20091106135152 DOT GK26344 AT calimero DOT vinschen DOT de> <4AF42B15 DOT 9050100 AT byu DOT net>
MIME-Version: 1.0
In-Reply-To: <4AF42B15.9050100@byu.net>
User-Agent: Mutt/1.5.20 (2009-06-14)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Nov  6 06:56, Eric Blake wrote:
> According to Corinna Vinschen on 11/6/2009 6:51 AM:
> >> The problem *is* with grep (and sed), however, because there is no
> >> good reason that UTF-8 should give us a penalty of being 100times
> >> slower on most search operations, this is just poor programming of
> >> grep and sed.
> > 
> > The penalty on Linux is much smaller, about 15-20%.  It looks like
> > grep is calling malloc for every input line if MB_CUR_MAX is > 1.
> > Then it evaluates for each byte in the line whether the byte is a
> > single byte or the start of a multibyte sequence using mbrtowc on
> > every charatcer on the input line.  Then, for each potential match,
> > it checks if it's the start byte of a multibyte sequence and ignores
> > all other matches.  Eventually, it calls free, and the game starts
> > over for the next line.
> 
> Adding bug-grep, since this slowdown caused by additional mallocs is
> definitely the sign of a poor algorithm that could be improved by reusing
> existing buffers.

I created a simple testcase:

==== SNIP ===
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <string.h>

int main (int argc, char **argv)
{
  const char in[] = "The quick brown fox jumps over the lazy dog";
  int line, i;
  mbstate_t mbs;
  size_t mbclen;
  size_t size = sizeof (in);
  wchar_t wc;
  int lines = argc > 1 ? atoi (argv[1]) : 1000;
  int do_malloc = 1;
  int do_mbrtowc = 1;

  if (argc > 2)
    do_malloc = atoi (argv[2]);
  if (argc > 3)
    do_mbrtowc = atoi (argv[3]);

  printf ("with malloc: %d, with mbrtowc: %d\n", do_malloc, do_mbrtowc);

  memset (&mbs, 0, sizeof mbs);
  for (line = 0; line < lines; ++line)
    {
      char *x;
      if (do_malloc) x = malloc (size);
      if (do_mbrtowc)
        for (i = 0; i < size; i += mbclen)
          if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0)
            break;
      if (do_malloc) free (x);
    }
  return 0;
}
==== SNAP ====

Under Cygwin (tcsh time output):

  $ setenv LANG en_US.UTF-8
  $ time ./mb 1000000 1 0
  with malloc: 1, with mbrtowc: 0
  0.328u 0.031s 0:00.34 102.9%    0+0k 0+0io 1834pf+0w
  $ time ./mb 1000000 0 1
  with malloc: 0, with mbrtowc: 1
  1.921u 0.092s 0:02.09 96.1%     0+0k 0+0io 1827pf+0w
  $ time ./mb 1000000 1 1
  with malloc: 1, with mbrtowc: 1
  2.062u 0.140s 0:02.15 102.3%    0+0k 0+0io 1839pf+0w

Running on the same CPU under Linux:

  $ setenv LANG en_US.UTF-8
  $ time ./mb 1000000 1 0
  with malloc: 1, with mbrtowc: 0
  0.088u 0.004s 0:00.09 88.8%     0+0k 0+0io 0pf+0w
  $ time ./mb 1000000 0 1
  with malloc: 0, with mbrtowc: 1
  1.836u 0.000s 0:01.85 98.9%     0+0k 0+0io 0pf+0w
  $ time ./mb 1000000 1 1
  with malloc: 1, with mbrtowc: 1
  1.888u 0.000s 0:01.93 97.4%     0+0k 0+0io 0pf+0w

So, while Linux is definitely faster, the number are still comparable
for 1 million iterations.  That still doens't explain why grep is a
multitude slower when using UTF-8 as charset.


Puzzled,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019