DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 4816HSYR443379
Authentication-Results: delorie.com;
	dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=e95H0y/T
X-Recipient: archive-cygwin@delorie.com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E35E33858280
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
	s=default; t=1725171446;
	bh=V1wn8O1uD8o1woAt36WzxFulG5/TSryCzQlZFx7ttF8=;
	h=Date:Subject:To:References:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=e95H0y/TpYeseauHOkF1eSVJQEok4T/khbODhlu64hlhWPKxXdjOlcZbwuW0RS9MP
	 ZOv81zzfLozOTUKBi0N7FgeRQbkRzRKKYY58I0gFrFvRLsgthXyPw+22UOH217RXbY
	 Qozz2+SJNaDPVXYy//0d3tIyj8c3uXRM09BARAT8=
X-Original-To: cygwin@cygwin.com
Delivered-To: cygwin@cygwin.com
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EBD513858D34
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org EBD513858D34
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1725171386; cv=none;
 b=OXQZlqNEGfO490tk49FWXmfjUTus6y/pxb+iPd8zJ4EtB461fiv/1r/5kjrCxACNzL6QCWUsY+wCNiKlBJ0NQxZz9yuBAicmHllEpLlA+WKoYWdP80gzaEyIeAqyjy+GCY4AMPwVgxF8hFHfmvI2KDv0oRbhCaywhO0O1gU9TwE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1725171386; c=relaxed/simple;
 bh=e6479mojJsFfznC4xl52Qbm2/jpoFlyhapQgtfCrEWo=;
 h=Message-ID:Date:MIME-Version:Subject:To:From;
 b=nf7JyNFfaku1NYPqFviw53/8i9D0dzOm0tEMlBZIf+Pq1obivx6MXxGONHifq0Zsp3N8/BQGpSYPR+Mtb3C6dfsBmSdJBWkOumWGk18JgEKCB8yKzo5v5BFfaZO+V1SGl/svpP/xTIdQtYbH1hgdQQaAkxbWw2jMFz+5DneeZqU=
ARC-Authentication-Results: i=1; server2.sourceware.org
Message-ID: <b54f8ffa-feea-424b-a8b3-9dfaf4adf00d@SystematicSW.ab.ca>
Date: Sun, 1 Sep 2024 00:16:20 -0600
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: grepping a large file through a pipe takes eons
Content-Language: en-CA
To: cygwin@cygwin.com
References: <CAK-n8j6cjd5mHah6y1EVgbRsXLrdbati-j1QS1r1+aDc8jwg=g@mail.gmail.com>
 <20240901042425.702a5242c4bd5573ae993497@nifty.ne.jp>
Organization: Systematic Software
In-Reply-To: <20240901042425.702a5242c4bd5573ae993497@nifty.ne.jp>
X-Rspamd-Queue-Id: 2F82C18
X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE,
 UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.6
X-Stat-Signature: prhuz1tky99s9xzrcqo8kanupfon9knb
X-Rspamd-Server: rspamout08
X-Session-Marker: 427269616E2E496E676C69734053797374656D6174696353572E61622E6361
X-Session-ID: U2FsdGVkX18Vycj+o4CfF6LD1N8zDp2gyru18N9fqpQ=
X-HE-Tag: 1725171381-500491
X-HE-Meta: U2FsdGVkX193yJ50y/sXvyWY78d381lWFnjsfBQ6L514dOidSCPNpJfqro6nx38+45CRM2GI9HXaFjZm34SSBIp/Uhc/Qt+J30e+V5b38Z0P7dGYoTKg1z444YrtDaV1dRJilSmwlbUnw2mn07zp/G28cCJxpUFUl09IlxFQUn0XDCaLnVkSMZqmN/I7dGuHLuu/x1qUwQND6xlDA8EmegTkK9i/Afzpxz+J9qqmIEZgtMb/YnLu0+lXUcDtZf2X87d60GVIqvkh1C9Uf8R9Cpw9DIqvrPCG5Q4H7UbXt3HMWdsDpmvZG87Dd4BAB+yvCwXDujb+RfdRy5AjRnPgrVjekpLB5IfE
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: cygwin@cygwin.com
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-request@cygwin.com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
 <mailto:cygwin-request@cygwin.com?subject=subscribe>
From: Brian Inglis via Cygwin <cygwin@cygwin.com>
Reply-To: cygwin@cygwin.com
Cc: Brian Inglis <Brian.Inglis@SystematicSW.ab.ca>
Content-Type: text/plain; charset="utf-8"; Format="flowed"
Errors-To: cygwin-bounces~archive-cygwin=delorie.com@cygwin.com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie.com@cygwin.com>
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 4816HSYR443379

On 2024-08-31 13:24, Takashi Yano via Cygwin wrote:
> On Sat, 31 Aug 2024 09:59:11 -0600
> Jim Reisert AD1C wrote:
>> Something has changed in the last month or two.  I have a very large
>> file I am trying to grep (465 MB):
>>
>> -rwxrw----+ 1 jjrei jjrei 465092052 Aug 31 09:39 all_spots.txt
>>
>>
>> If I grep for something near the end of the file, the results return right away:
>>
>> # time grep -n N0FUL all_spots.txt
>>
>> 17027336:N0FUL,20240615,20240615,1
>> 17027337:N0FUL,20240629,20240629,1
>>
>> real    0m0.190s
>> user    0m0.078s
>> sys     0m0.078s
>>
>>
>> If I pipe the file through cat, grep takes much longer:
>>
>> # time cat all_spots.txt | grep -n N0FUL
>>
>> 17027336:N0FUL,20240615,20240615,1
>> 17027337:N0FUL,20240629,20240629,1
>>
>>
>> real    1m4.934s
>> user    0m0.031s
>> sys     0m0.124s
> 
> Thanks for the report. This seems to be a regression of cygwin 3.5.4.
> I'll submit a patch for this issue shortly.

Remember many Unix derived utilities use mmap-ed files when available, to have 
the paging system handle file I/O, allowing them to use memory operations to do 
read/write operations and searches at high speed.
It would be worth your while to time grepping all files vs cat into one file and 
grep that.
In either case, it will mostly be faster to operate directly on files.

$ ls -1gloU /var/log/*.log | awk '{t+=$3};END{print int(NR/1024+0.5) "k 
files",int(t/1024/1024+0.5) "MB"}'
26k files 59MB

$ time grep -h -e cygwin -- /var/log/*.log > /tmp/grep.log

real    0m8.996s
user    0m1.015s
sys     0m7.983s

$ time cat -- /var/log/*.log > /tmp/var.log && grep -h -e cygwin -- /tmp/var.log 
 > /tmp/cat-grep.log

real    0m9.557s
user    0m0.953s
sys     0m8.609s

$ wc -lc -- /tmp/var.log /tmp/*grep.log
   708552 61905630 /tmp/var.log
    35481  5652354 /tmp/cat-grep.log
    35481  5652354 /tmp/grep.log

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

