delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2003/02/15/00:35:45

Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sources.redhat.com/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sources.redhat.com/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Message-Id: <5.2.0.9.2.20030214213133.0296cb10@pop3.cris.com>
X-Sender: rrschulz AT pop3 DOT cris DOT com
Date: Fri, 14 Feb 2003 21:35:20 -0800
To: cygwin AT cygwin DOT com
From: Randall R Schulz <rrschulz AT cris DOT com>
Subject: Re: Wget ignores robot.txt entry
In-Reply-To: <3E4D9B42.6040003@serv.net>
References: <5 DOT 2 DOT 0 DOT 9 DOT 2 DOT 20030213182750 DOT 01e97e98 AT pop3 DOT cris DOT com>
Mime-Version: 1.0

Lowell,

Max Bowsher reported:

>Or, on the command line -erobots=off :-)
>
>Whilst this does control whether wget downloads robots.txt, a quick 
>test confirms that even when it does get robots.txt, it still wanders 
>into cgi-bin.
>
>I'd suggest taking this to the wget list, except wget it currently 
>maintainer-less, and, it appears, bitrotted.
>
>Max.


As for this:

>Perhaps there is a counterpart to the above, i.e., <meta name="robots" 
>content="follow"> that's being involked and someone from Redhat could 
>check into and rule this out.

You should realize that for open source programs like wget, the 
recommended practice is to examine the source yourself.

Randall Schulz


At 17:43 2003-02-14, L Anderson wrote:

>Randall R Schulz wrote:
>>Lowell,
>>What's in your "~/.wgetrc" file? If it contains this:
>>robots = off
>>Then wget will not respect a "robots.txt" file on the host from which 
>>it is retrieving files.
>>Before I learned of this option (accessible _only_ via this directive 
>>in the .wgetrc file), I did something too clever by half to get 
>>robots.txt ignored, so I know that wget does respect it.
>
>I have only two wgetrc related files as follows:
>
>/etc/wgetrc
>/usr/doc/wget-1.8.2/sample.wgetrc
>
>NB: I use win98 and these are under my cygwin directory i:\cygwin 
>(i.e. /cygdrive/i).
>
>I have never changed either file--I just accept the default installed 
>by setup.  However, the two files differ by a few lines which are just 
>comments anyway. i.e. doing:
>
>$ diff /etc/wgetrc /usr/doc/wget-1.8.2/sample.wgetrc
>73,74c73,74
>< # You can set the default proxy for Wget to use.  It will override the
>< # value in the environment.
>---
> > # You can set the default proxies for Wget to use for http and ftp.
> > # They will override the value in the environment.
>75a76
> > #ftp_proxy = http://proxy.yoyodyne.com:18023/
>
>shows this.  Moreover,
>
>$ grep robot /etc/wgetrc
># Setting this to off makes Wget not download /robots.txt.  Be sure to
># know *exactly* what /robots.txt is and how it is used before changing
>#robots = on
>
>shows the only references to "robot" are also comments.
>
>The stated default for wget is "robots=on" which I have seen honored 
>for quite a number of other downloads and since I didn't use "-e 
>robots=off", that can't explain it.  The only other thing I have found 
>that might be related is not under my control and I haven't yet 
>figured out how to check it.  From the wget documentation it states:
>
>"
>The second, less known mechanism, enables the author of an individual 
>document to specify whether they want the links from the file to be 
>followed by a robot. This is achieved using the META tag, like this:
>
><meta name="robots" content="nofollow">
>
>This is explained in some detail at 
><http://www.robotstxt.org/wc/meta-user.html>. Wget supports this 
>method of robot exclusion in addition to the usual /robots.txt exclusion.
>"
>
>Perhaps there is a counterpart to the above, i.e., <meta name="robots" 
>content="follow"> that's being involked and someone from Redhat could 
>check into and rule this out.
>
>Thanks (and still puzzled)!
>
>Lowell Anderson


--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Bug reporting:         http://cygwin.com/bugs.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019