X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Fri, 6 Nov 2009 16:24:54 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file Message-ID: <20091106152454.GN26344@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> <20091106135152 DOT GK26344 AT calimero DOT vinschen DOT de> <4AF42B15 DOT 9050100 AT byu DOT net> <20091106142644 DOT GL26344 AT calimero DOT vinschen DOT de> <4AF439F0 DOT 8060203 AT towo DOT net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4AF439F0.8060203@towo.net> User-Agent: Mutt/1.5.20 (2009-06-14) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Nov 6 16:00, Thomas Wolff wrote: > Corinna Vinschen wrote: > >I created a simple testcase: > > > >==== SNIP === > >... > >==== SNAP ==== > I extended your test program to demonstrate the inefficiency of the > standard mbrtowc function. [...] > >Under Cygwin (tcsh time output): > > > > $ setenv LANG en_US.UTF-8 > > $ time ./mb 1000000 1 0 > > with malloc: 1, with mbrtowc: 0 > > 0.328u 0.031s 0:00.34 102.9% 0+0k 0+0io 1834pf+0w > > $ time ./mb 1000000 0 1 > > with malloc: 0, with mbrtowc: 1 > > 1.921u 0.092s 0:02.09 96.1% 0+0k 0+0io 1827pf+0w > > $ time ./mb 1000000 1 1 > > with malloc: 1, with mbrtowc: 1 > > 2.062u 0.140s 0:02.15 102.3% 0+0k 0+0io 1839pf+0w > > > >Running on the same CPU under Linux: > > > > $ setenv LANG en_US.UTF-8 > > $ time ./mb 1000000 1 0 > > with malloc: 1, with mbrtowc: 0 > > 0.088u 0.004s 0:00.09 88.8% 0+0k 0+0io 0pf+0w > > $ time ./mb 1000000 0 1 > > with malloc: 0, with mbrtowc: 1 > > 1.836u 0.000s 0:01.85 98.9% 0+0k 0+0io 0pf+0w > > $ time ./mb 1000000 1 1 > > with malloc: 1, with mbrtowc: 1 > > 1.888u 0.000s 0:01.93 97.4% 0+0k 0+0io 0pf+0w > > > >So, while Linux is definitely faster, the number are still comparable > >for 1 million iterations. That still doens't explain why grep is a > >multitude slower when using UTF-8 as charset. > Results of mbrtowc vs. utftouni on Linux: > > thw[en_US.UTF-8]@scotty:~/tmp: locale charmap > UTF-8 > thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0 > with malloc: 0, with mbrtowc: 1, with utftouni: 0 > > real 0m2.897s > user 0m2.836s > sys 0m0.012s > thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1 > with malloc: 0, with mbrtowc: 0, with utftouni: 1 > > real 0m0.030s > user 0m0.028s > sys 0m0.000s > thw[en_US.UTF-8]@scotty:~/tmp: > [...] > The conclusion is, as long as calling mbrtowc is as inefficient, a > program caring about performance should not use it. That's sort of an unfair test. Your utftouni function doesn't care for mbstate, error, and surrogate pair handling. Having said that, I just experimented further with mbrtowc, and I was able to speed up mbrtowc and wcrtomb calls on Cygwin by a factor of almost 50 per cent, just by reducing the function call depth in newlib, which is the result of reentrancy and isolation efforts. Talking about your implementation, if you could come up with a faster implementation of newlib's __utf8_wctomb/__utf8_mbtowc, it would certainly be another welcome performance boost. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple