delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/11/06/10:25:15

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Fri, 6 Nov 2009 16:24:54 +0100
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file
Message-ID: <20091106152454.GN26344@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> <20091106135152 DOT GK26344 AT calimero DOT vinschen DOT de> <4AF42B15 DOT 9050100 AT byu DOT net> <20091106142644 DOT GL26344 AT calimero DOT vinschen DOT de> <4AF439F0 DOT 8060203 AT towo DOT net>
MIME-Version: 1.0
In-Reply-To: <4AF439F0.8060203@towo.net>
User-Agent: Mutt/1.5.20 (2009-06-14)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Nov  6 16:00, Thomas Wolff wrote:
> Corinna Vinschen wrote:
> >I created a simple testcase:
> >
> >==== SNIP ===
> >...
> >==== SNAP ====
> I extended your test program to demonstrate the inefficiency of the
> standard mbrtowc function. [...]
> >Under Cygwin (tcsh time output):
> >
> >  $ setenv LANG en_US.UTF-8
> >  $ time ./mb 1000000 1 0
> >  with malloc: 1, with mbrtowc: 0
> >  0.328u 0.031s 0:00.34 102.9%    0+0k 0+0io 1834pf+0w
> >  $ time ./mb 1000000 0 1
> >  with malloc: 0, with mbrtowc: 1
> >  1.921u 0.092s 0:02.09 96.1%     0+0k 0+0io 1827pf+0w
> >  $ time ./mb 1000000 1 1
> >  with malloc: 1, with mbrtowc: 1
> >  2.062u 0.140s 0:02.15 102.3%    0+0k 0+0io 1839pf+0w
> >
> >Running on the same CPU under Linux:
> >
> >  $ setenv LANG en_US.UTF-8
> >  $ time ./mb 1000000 1 0
> >  with malloc: 1, with mbrtowc: 0
> >  0.088u 0.004s 0:00.09 88.8%     0+0k 0+0io 0pf+0w
> >  $ time ./mb 1000000 0 1
> >  with malloc: 0, with mbrtowc: 1
> >  1.836u 0.000s 0:01.85 98.9%     0+0k 0+0io 0pf+0w
> >  $ time ./mb 1000000 1 1
> >  with malloc: 1, with mbrtowc: 1
> >  1.888u 0.000s 0:01.93 97.4%     0+0k 0+0io 0pf+0w
> >
> >So, while Linux is definitely faster, the number are still comparable
> >for 1 million iterations.  That still doens't explain why grep is a
> >multitude slower when using UTF-8 as charset.
> Results of mbrtowc vs. utftouni on Linux:
> 
> thw[en_US.UTF-8]@scotty:~/tmp: locale charmap
> UTF-8
> thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0
> with malloc: 0, with mbrtowc: 1, with utftouni: 0
> 
> real    0m2.897s
> user    0m2.836s
> sys     0m0.012s
> thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1
> with malloc: 0, with mbrtowc: 0, with utftouni: 1
> 
> real    0m0.030s
> user    0m0.028s
> sys     0m0.000s
> thw[en_US.UTF-8]@scotty:~/tmp:
> [...]
> The conclusion is, as long as calling mbrtowc is as inefficient, a
> program caring about performance should not use it.

That's sort of an unfair test.  Your utftouni function doesn't care for
mbstate, error, and surrogate pair handling.

Having said that, I just experimented further with mbrtowc, and I was
able to speed up mbrtowc and wcrtomb calls on Cygwin by a factor of
almost 50 per cent, just by reducing the function call depth in newlib,
which is the result of reentrancy and isolation efforts.

Talking about your implementation, if you could come up with a faster
implementation of newlib's __utf8_wctomb/__utf8_mbtowc, it would
certainly be another welcome performance boost.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019