X-Recipient: archive-cygwin@delorie.com
X-SWARE-Spam-Status: No, hits=-2.4 required=5.0 	tests=AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Message-ID: <4AF45152.5060505@towo.net>
Date: Fri, 06 Nov 2009 17:39:46 +0100
From: Thomas Wolff <towo@towo.net>
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: bug-grep@gnu.org, bug-sed@gnu.org
CC: cygwin@cygwin.com
Subject: Re: 1.7] BUG - GREP slows to a crawl with large number of matches         on a  single file
References: <26224019.post@talk.nabble.com>  <4AF393C6.3000505@tlinx.org>  <20091106033243.GB30410@ednor.casa.cgf.cx>  <4AF42027.80604@towo.net>  <20091106135152.GK26344@calimero.vinschen.de>  <4AF42B15.9050100@byu.net>  <20091106142644.GL26344@calimero.vinschen.de>  <4AF439F0.8060203@towo.net> <20091106152454.GN26344@calimero.vinschen.de> <4AF44F26.2030603@towo.net>
In-Reply-To: <4AF44F26.2030603@towo.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
Delivered-To: mailing list cygwin@cygwin.com

[forgot to CC to bug-grep before, so I'm resending this, with one more 
comment, and leaving out cygwin-specific parts]

Corinna Vinschen wrote:
> On Nov  6 16:00, Thomas Wolff wrote:
>  
>>> ...
>>>       
>> I extended your test program to demonstrate the inefficiency of the
>> standard mbrtowc function. [...]
>>     
I later had to correct:
> Anyway, corrected results are still by a factor of 3 to 4 in favor of 
> my algorithm. 
Corinna wrote:
> That's sort of an unfair test.  Your utftouni function doesn't care for
> mbstate, error, and surrogate pair handling.
>   
This is a question of use cases:
* mbstate is needed e.g. if you feed results of read() which possibly 
come in arbitrary chunks directly into mbtowc(); it's not needed if you 
only transform complete lines of text at once. The stdlib function is a 
little bit too generic (and thus complicated, too) for many applications.
* error handling is there, in my function; it's simplified, incorrect 
sequences are all mapped to 0 for the test case but they could as well 
return an error indication without performance impact.
* surrogate pair handling is only needed if you pass the string from/to 
the Windows API. It's not needed for POSIX applications (provided 
wchar_t would be sufficiently wide). So if wchar_t can be extended in 
the newlib API, it might be useful to have two implementations; one for 
applications (w/o surrogates), one for cygwin itself.


[...]

My main point was that, depending on the use case, some applications 
would be better off using less generic, optimized functions.
The kind of dogmatic suggestion (as seen in the "locale scene") that 
everybody should use the stdlib wide character functions is often 
misleading.
grep and sed would certainly be well advised to change that.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

