delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/11/06/10:00:28

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.3 required=5.0 tests=AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Message-ID: <4AF439F0.8060203@towo.net>
Date: Fri, 06 Nov 2009 16:00:00 +0100
From: Thomas Wolff <towo AT towo DOT net>
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: cygwin AT cygwin DOT com, bug-grep AT gnu DOT org
Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file
References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> <20091106135152 DOT GK26344 AT calimero DOT vinschen DOT de> <4AF42B15 DOT 9050100 AT byu DOT net> <20091106142644 DOT GL26344 AT calimero DOT vinschen DOT de>
In-Reply-To: <20091106142644.GL26344@calimero.vinschen.de>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

--------------080905000306010604040903
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Corinna Vinschen wrote:
> On Nov  6 06:56, Eric Blake wrote:
>   
>> According to Corinna Vinschen on 11/6/2009 6:51 AM:
>>     
>>>> The problem *is* with grep (and sed), however, because there is no
>>>> good reason that UTF-8 should give us a penalty of being 100times
>>>> slower on most search operations, this is just poor programming of
>>>> grep and sed.
>>>>         
>>> The penalty on Linux is much smaller, about 15-20%.  
On my Linux system, the penalty is a factor between 6 and 8.
>>> It looks like
>>> grep is calling malloc for every input line if MB_CUR_MAX is > 1.
>>> Then it evaluates for each byte in the line whether the byte is a
>>> single byte or the start of a multibyte sequence using mbrtowc on
>>> every charatcer on the input line.  Then, for each potential match,
>>> it checks if it's the start byte of a multibyte sequence and ignores
>>> all other matches.  Eventually, it calls free, and the game starts
>>> over for the next line.
>>>       
>> Adding bug-grep, since this slowdown caused by additional mallocs is
>> definitely the sign of a poor algorithm that could be improved by reusing
>> existing buffers.
>>     
>
> I created a simple testcase:
>
> ==== SNIP ===
> ...
> ==== SNAP ====
>   
I extended your test program to demonstrate the inefficiency of the 
standard mbrtowc function. Instead I use a function from my editor 
(mined) to extract a Unicode character from a UTF-8 sequence. This is 
the simple case only, not converting character sets other than UTF-8 but 
that's the same thing mbrtowc does in the sample invocation. Program 
attached. Results below.
> Under Cygwin (tcsh time output):
>
>   $ setenv LANG en_US.UTF-8
>   $ time ./mb 1000000 1 0
>   with malloc: 1, with mbrtowc: 0
>   0.328u 0.031s 0:00.34 102.9%    0+0k 0+0io 1834pf+0w
>   $ time ./mb 1000000 0 1
>   with malloc: 0, with mbrtowc: 1
>   1.921u 0.092s 0:02.09 96.1%     0+0k 0+0io 1827pf+0w
>   $ time ./mb 1000000 1 1
>   with malloc: 1, with mbrtowc: 1
>   2.062u 0.140s 0:02.15 102.3%    0+0k 0+0io 1839pf+0w
>
> Running on the same CPU under Linux:
>
>   $ setenv LANG en_US.UTF-8
>   $ time ./mb 1000000 1 0
>   with malloc: 1, with mbrtowc: 0
>   0.088u 0.004s 0:00.09 88.8%     0+0k 0+0io 0pf+0w
>   $ time ./mb 1000000 0 1
>   with malloc: 0, with mbrtowc: 1
>   1.836u 0.000s 0:01.85 98.9%     0+0k 0+0io 0pf+0w
>   $ time ./mb 1000000 1 1
>   with malloc: 1, with mbrtowc: 1
>   1.888u 0.000s 0:01.93 97.4%     0+0k 0+0io 0pf+0w
>
> So, while Linux is definitely faster, the number are still comparable
> for 1 million iterations.  That still doens't explain why grep is a
> multitude slower when using UTF-8 as charset.
>   
Results of mbrtowc vs. utftouni on Linux:

thw[en_US.UTF-8]@scotty:~/tmp: locale charmap
UTF-8
thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0
with malloc: 0, with mbrtowc: 1, with utftouni: 0

real    0m2.897s
user    0m2.836s
sys     0m0.012s
thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1
with malloc: 0, with mbrtowc: 0, with utftouni: 1

real    0m0.030s
user    0m0.028s
sys     0m0.000s
thw[en_US.UTF-8]@scotty:~/tmp:


Results of mbrtowc vs. utftouni on cygwin:

demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 1 0
with malloc: 0, with mbrtowc: 1, with utftouni: 0

real    0m3.034s
user    0m2.974s
sys     0m0.030s
demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 0 1
with malloc: 0, with mbrtowc: 0, with utftouni: 1

real    0m0.110s
user    0m0.070s
sys     0m0.030s
demsn702[C.UTF-8]@stbm8186:/H/tools:


The conclusion is, as long as calling mbrtowc is as inefficient, a 
program caring about performance should not use it.

Thomas

--------------080905000306010604040903
Content-Type: text/plain;
 name="uu.c"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="uu.c"

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <string.h>

void utf8_info (u, length, ucs)
  char * u;
  int * length;
  unsigned long * ucs;
{
  char * textpoi = u;
  unsigned char c = * textpoi;
  int utfcount;
  unsigned long unichar;

	if ((c & 0x80) == 0x00) {
		utfcount = 1;
		unichar = c;
	} else if ((c & 0xE0) == 0xC0) {
		utfcount = 2;
		unichar = c & 0x1F;
	} else if ((c & 0xF0) == 0xE0) {
		utfcount = 3;
		unichar = c & 0x0F;
	} else if ((c & 0xF8) == 0xF0) {
		utfcount = 4;
		unichar = c & 0x07;
	} else if ((c & 0xFC) == 0xF8) {
		utfcount = 5;
		unichar = c & 0x03;
	} else if ((c & 0xFE) == 0xFC) {
		utfcount = 6;
		unichar = c & 0x01;
	} else if (c == 0xFE) {
		/* illegal UTF-8 code */
		utfcount = 1;
		unichar = 0;
	} else if (c == 0xFF) {
		/* illegal UTF-8 code */
		utfcount = 1;
		unichar = 0;
	} else {
		/* illegal UTF-8 sequence character */
		utfcount = 1;
		unichar = 0;
	}

	* length = utfcount;

	utfcount --;
	textpoi ++;
	while (utfcount > 0 && (* textpoi & 0xC0) == 0x80) {
		unichar = (unichar << 6) | (* textpoi & 0x3F);
		utfcount --;
		textpoi ++;
	}
	if (utfcount > 0) {
		/* too short UTF-8 sequence */
		unichar = 0;
		* length -= utfcount;
	}

	* ucs = unichar;
}

void utftouni (wchar_t * wpoi, char * s)
{
  unsigned long c;
  int l;
  int i = 0;

  while (* s) {
	utf8_info (s, & l, & wpoi [i++]);
	s += l;
  }
}

int main (int argc, char **argv)
{
  char in[] = "The quick brown fox jumps over the lazy dog";
  int line, i;
  mbstate_t mbs;
  size_t mbclen;
  size_t size = sizeof (in);
  wchar_t wc;
  int lines = argc > 1 ? atoi (argv[1]) : 1000;
  int do_malloc = 1;
  int do_mbrtowc = 1;
  int do_utftouni = 1;

  if (argc > 2)
    do_malloc = atoi (argv[2]);
  if (argc > 3)
    do_mbrtowc = atoi (argv[3]);
  if (argc > 4)
    do_utftouni = atoi (argv[4]);

  printf ("with malloc: %d, with mbrtowc: %d, with utftouni: %d\n", do_malloc, do_mbrtowc, do_utftouni);

  memset (&mbs, 0, sizeof mbs);
  for (line = 0; line < lines; ++line)
    {
      char *x;
      if (do_malloc) x = malloc (size);
      if (do_mbrtowc)
        for (i = 0; i < size; i += mbclen)
          if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0)
            break;
      if (do_utftouni)
        utftouni (&wc, in);
      if (do_malloc) free (x);
    }
  return 0;
}


--------------080905000306010604040903
Content-Type: text/plain; charset=us-ascii

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
--------------080905000306010604040903--

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019