delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/11/06/11:30:50

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.4 required=5.0 tests=AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Message-ID: <4AF44F26.2030603@towo.net>
Date: Fri, 06 Nov 2009 17:30:30 +0100
From: Thomas Wolff <towo AT towo DOT net>
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: cygwin AT cygwin DOT com
Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file
References: <26224019 DOT post AT talk DOT nabble DOT com> <4AF393C6 DOT 3000505 AT tlinx DOT org> <20091106033243 DOT GB30410 AT ednor DOT casa DOT cgf DOT cx> <4AF42027 DOT 80604 AT towo DOT net> <20091106135152 DOT GK26344 AT calimero DOT vinschen DOT de> <4AF42B15 DOT 9050100 AT byu DOT net> <20091106142644 DOT GL26344 AT calimero DOT vinschen DOT de> <4AF439F0 DOT 8060203 AT towo DOT net> <20091106152454 DOT GN26344 AT calimero DOT vinschen DOT de>
In-Reply-To: <20091106152454.GN26344@calimero.vinschen.de>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

Corinna Vinschen wrote:
> On Nov  6 16:00, Thomas Wolff wrote:
>   
>>> ...
>>>       
>> I extended your test program to demonstrate the inefficiency of the
>> standard mbrtowc function. [...]
>>     
I later had to correct:
> Anyway, corrected results are still by a factor of 3 to 4 in favor of 
> my algorithm. 
Corinna wrote:
> That's sort of an unfair test.  Your utftouni function doesn't care for
> mbstate, error, and surrogate pair handling.
>   
This is a question of use cases:
* mbstate is needed e.g. if you feed results of read() which possibly 
come in arbitrary chunks directly into mbtowc(); it's not needed if you 
only transform complete lines of text at once. The stdlib function is a 
little bit too generic (and thus complicated, too) for many applications.
* error handling is there, in my function; it's simplified, incorrect 
sequences are all mapped to 0 for the test case but they could as well 
return an error indication without performance impact.
* surrogate pair handling is only needed if you pass the string from/to 
the Windows API. It's not needed for POSIX applications (provided 
wchar_t would be sufficiently wide). So if wchar_t can be extended in 
the newlib API, it might be useful to have two implementations; one for 
applications (w/o surrogates), one for cygwin itself.

> Having said that, I just experimented further with mbrtowc, and I was
> able to speed up mbrtowc and wcrtomb calls on Cygwin by a factor of
> almost 50 per cent, just by reducing the function call depth in newlib,
> which is the result of reentrancy and isolation efforts.
>   
Great! That comes close to my corrected results  :-[

> Talking about your implementation, if you could come up with a faster
> implementation of newlib's __utf8_wctomb/__utf8_mbtowc, it would
> certainly be another welcome performance boost.
>   
A quick look at those function doesn't reveal much potential, except for 
tiny optimizations like
-  if (ch >= 0xe0 && ch <= 0xef) /* three-byte sequence */
+  if (ch & 0xf0 == 0xe0) /* three-byte sequence */
But even that, given the way the compiler optimizes expressions, is 
probably not an improvement.

Also, I remember some recent trouble was fixed by your tweaking of wide 
character functions, so this is better not touched again.

My main point was that, depending on the use case, some applications 
would be better off using less generic, optimized functions.
grep and sed would certainly be well advised to do that.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019