Mail Archives: cygwin/2009/11/06/09:27:04
On Nov 6 06:56, Eric Blake wrote:
> According to Corinna Vinschen on 11/6/2009 6:51 AM:
> >> The problem *is* with grep (and sed), however, because there is no
> >> good reason that UTF-8 should give us a penalty of being 100times
> >> slower on most search operations, this is just poor programming of
> >> grep and sed.
> >
> > The penalty on Linux is much smaller, about 15-20%. It looks like
> > grep is calling malloc for every input line if MB_CUR_MAX is > 1.
> > Then it evaluates for each byte in the line whether the byte is a
> > single byte or the start of a multibyte sequence using mbrtowc on
> > every charatcer on the input line. Then, for each potential match,
> > it checks if it's the start byte of a multibyte sequence and ignores
> > all other matches. Eventually, it calls free, and the game starts
> > over for the next line.
>
> Adding bug-grep, since this slowdown caused by additional mallocs is
> definitely the sign of a poor algorithm that could be improved by reusing
> existing buffers.
I created a simple testcase:
==== SNIP ===
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <string.h>
int main (int argc, char **argv)
{
const char in[] = "The quick brown fox jumps over the lazy dog";
int line, i;
mbstate_t mbs;
size_t mbclen;
size_t size = sizeof (in);
wchar_t wc;
int lines = argc > 1 ? atoi (argv[1]) : 1000;
int do_malloc = 1;
int do_mbrtowc = 1;
if (argc > 2)
do_malloc = atoi (argv[2]);
if (argc > 3)
do_mbrtowc = atoi (argv[3]);
printf ("with malloc: %d, with mbrtowc: %d\n", do_malloc, do_mbrtowc);
memset (&mbs, 0, sizeof mbs);
for (line = 0; line < lines; ++line)
{
char *x;
if (do_malloc) x = malloc (size);
if (do_mbrtowc)
for (i = 0; i < size; i += mbclen)
if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0)
break;
if (do_malloc) free (x);
}
return 0;
}
==== SNAP ====
Under Cygwin (tcsh time output):
$ setenv LANG en_US.UTF-8
$ time ./mb 1000000 1 0
with malloc: 1, with mbrtowc: 0
0.328u 0.031s 0:00.34 102.9% 0+0k 0+0io 1834pf+0w
$ time ./mb 1000000 0 1
with malloc: 0, with mbrtowc: 1
1.921u 0.092s 0:02.09 96.1% 0+0k 0+0io 1827pf+0w
$ time ./mb 1000000 1 1
with malloc: 1, with mbrtowc: 1
2.062u 0.140s 0:02.15 102.3% 0+0k 0+0io 1839pf+0w
Running on the same CPU under Linux:
$ setenv LANG en_US.UTF-8
$ time ./mb 1000000 1 0
with malloc: 1, with mbrtowc: 0
0.088u 0.004s 0:00.09 88.8% 0+0k 0+0io 0pf+0w
$ time ./mb 1000000 0 1
with malloc: 0, with mbrtowc: 1
1.836u 0.000s 0:01.85 98.9% 0+0k 0+0io 0pf+0w
$ time ./mb 1000000 1 1
with malloc: 1, with mbrtowc: 1
1.888u 0.000s 0:01.93 97.4% 0+0k 0+0io 0pf+0w
So, while Linux is definitely faster, the number are still comparable
for 1 million iterations. That still doens't explain why grep is a
multitude slower when using UTF-8 as charset.
Puzzled,
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -