Mail Archives: cygwin/2003/02/14/20:43:57

delorie.com/archives/browse.cgi

search

Mail Archives: cygwin/2003/02/14/20:43:57

Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm

List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>

List-Archive: <http://sources.redhat.com/ml/cygwin/>

List-Post: <mailto:cygwin AT cygwin DOT com>

List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sources.redhat.com/ml/#faqs>

Sender: cygwin-owner AT cygwin DOT com

Mail-Followup-To: cygwin AT cygwin DOT com

Delivered-To: mailing list cygwin AT cygwin DOT com

Message-ID: <3E4D9B42.6040003@serv.net>

Date: Fri, 14 Feb 2003 17:43:30 -0800

From: L Anderson <lowella AT serv DOT net>

Organization: TBD

User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.0.2) Gecko/20021120 Netscape/7.01

X-Accept-Language: en,ru

MIME-Version: 1.0

To: cygwin AT cygwin DOT com

Subject: Re: Wget ignores robot.txt entry

References: <5 DOT 2 DOT 0 DOT 9 DOT 2 DOT 20030213182750 DOT 01e97e98 AT pop3 DOT cris DOT com>

Randall R Schulz wrote:
> Lowell,
> 
> What's in your "~/.wgetrc" file? If it contains this:
> 
> robots = off
> 
> Then wget will not respect a "robots.txt" file on the host from which it 
> is retrieving files.
> 
> Before I learned of this option (accessible _only_ via this directive in 
> the .wgetrc file), I did something too clever by half to get robots.txt 
> ignored, so I know that wget does respect it.
> 

I have only two wgetrc related files as follows:

/etc/wgetrc
/usr/doc/wget-1.8.2/sample.wgetrc

NB: I use win98 and these are under my cygwin directory i:\cygwin (i.e. 
/cygdrive/i).

I have never changed either file--I just accept the default installed by 
setup.  However, the two files differ by a few lines which are just 
comments anyway. i.e. doing:

$ diff /etc/wgetrc /usr/doc/wget-1.8.2/sample.wgetrc
73,74c73,74
< # You can set the default proxy for Wget to use.  It will override the
< # value in the environment.
---
 > # You can set the default proxies for Wget to use for http and ftp.
 > # They will override the value in the environment.
75a76
 > #ftp_proxy = http://proxy.yoyodyne.com:18023/

shows this.  Moreover,

$ grep robot /etc/wgetrc
# Setting this to off makes Wget not download /robots.txt.  Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
#robots = on

shows the only references to "robot" are also comments.

The stated default for wget is "robots=on" which I have seen honored for 
quite a number of other downloads and since I didn't use "-e 
robots=off", that can't explain it.  The only other thing I have found 
that might be related is not under my control and I haven't yet figured 
out how to check it.  From the wget documentation it states:

"
The second, less known mechanism, enables the author of an individual 
document to specify whether they want the links from the file to be 
followed by a robot. This is achieved using the META tag, like this:

<meta name="robots" content="nofollow">

This is explained in some detail at 
<http://www.robotstxt.org/wc/meta-user.html>. Wget supports this method 
of robot exclusion in addition to the usual /robots.txt exclusion.
"

Perhaps there is a counterpart to the above, i.e., <meta name="robots" 
content="follow"> that's being involked and someone from Redhat could 
check into and rule this out.

Thanks (and still puzzled)!

Lowell Anderson

> Randall Schulz
> 
> 
> At 18:14 2003-02-13, L Anderson wrote:
> 
>> Using the latest of things Cygwin, I downloaded some stuff with wget 
>> from <http://cygwin.com> to peruse off-line and noticed a problem I 
>> can't explain:
>>
>> The <http://cygwin.com/robots.txt> file has the entries:
>>
>> User-agent: *
>> Disallow: /snapshots/
>> Disallow: /cgi-bin/
>> Disallow: /cgi2-bin/
>>
>> so wget should not download /cgi-bin/.
>>
>> However, "wget -o cygwincom.log -m -p --no-parent -X /cygwin,/ml 
>> http://cygwin.com/" downloads /cgi-bin anyway.
>>
>> NB. "wget -o cygwincom.log -m -p --no-parent -X /cgi-bin,/cygwin,/ml 
>> http://cygwin.com/ doesn't download /cgi-bin
>>
>> I ran a validity check on <http://cygwin.com/robots.txt> and found no 
>> errors.
>>
>> Is this a bug in wget or am I doing something wrong?
>>
>> Thanks,
>>
>> Lowell Anderson
> 

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Bug reporting:         http://cygwin.com/bugs.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -

webmaster	delorie software privacy
Copyright © 2019 by DJ Delorie	Updated Jul 2019

Mailing-List:	contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Subscribe:	<mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive:	<http://sources.redhat.com/ml/cygwin/>
List-Post:	<mailto:cygwin AT cygwin DOT com>
List-Help:	<mailto:cygwin-help AT cygwin DOT com>, <http://sources.redhat.com/ml/#faqs>
Sender:	cygwin-owner AT cygwin DOT com
Mail-Followup-To:	cygwin AT cygwin DOT com
Delivered-To:	mailing list cygwin AT cygwin DOT com
Message-ID:	<3E4D9B42.6040003@serv.net>
Date:	Fri, 14 Feb 2003 17:43:30 -0800
From:	L Anderson <lowella AT serv DOT net>
Organization:	TBD
User-Agent:	Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.0.2) Gecko/20021120 Netscape/7.01
X-Accept-Language:	en,ru
MIME-Version:	1.0
To:	cygwin AT cygwin DOT com
Subject:	Re: Wget ignores robot.txt entry
References:	<5 DOT 2 DOT 0 DOT 9 DOT 2 DOT 20030213182750 DOT 01e97e98 AT pop3 DOT cris DOT com>