Mail Archives: cygwin/2003/02/15/00:35:45
Lowell,
Max Bowsher reported:
>Or, on the command line -erobots=off :-)
>
>Whilst this does control whether wget downloads robots.txt, a quick
>test confirms that even when it does get robots.txt, it still wanders
>into cgi-bin.
>
>I'd suggest taking this to the wget list, except wget it currently
>maintainer-less, and, it appears, bitrotted.
>
>Max.
As for this:
>Perhaps there is a counterpart to the above, i.e., <meta name="robots"
>content="follow"> that's being involked and someone from Redhat could
>check into and rule this out.
You should realize that for open source programs like wget, the
recommended practice is to examine the source yourself.
Randall Schulz
At 17:43 2003-02-14, L Anderson wrote:
>Randall R Schulz wrote:
>>Lowell,
>>What's in your "~/.wgetrc" file? If it contains this:
>>robots = off
>>Then wget will not respect a "robots.txt" file on the host from which
>>it is retrieving files.
>>Before I learned of this option (accessible _only_ via this directive
>>in the .wgetrc file), I did something too clever by half to get
>>robots.txt ignored, so I know that wget does respect it.
>
>I have only two wgetrc related files as follows:
>
>/etc/wgetrc
>/usr/doc/wget-1.8.2/sample.wgetrc
>
>NB: I use win98 and these are under my cygwin directory i:\cygwin
>(i.e. /cygdrive/i).
>
>I have never changed either file--I just accept the default installed
>by setup. However, the two files differ by a few lines which are just
>comments anyway. i.e. doing:
>
>$ diff /etc/wgetrc /usr/doc/wget-1.8.2/sample.wgetrc
>73,74c73,74
>< # You can set the default proxy for Wget to use. It will override the
>< # value in the environment.
>---
> > # You can set the default proxies for Wget to use for http and ftp.
> > # They will override the value in the environment.
>75a76
> > #ftp_proxy = http://proxy.yoyodyne.com:18023/
>
>shows this. Moreover,
>
>$ grep robot /etc/wgetrc
># Setting this to off makes Wget not download /robots.txt. Be sure to
># know *exactly* what /robots.txt is and how it is used before changing
>#robots = on
>
>shows the only references to "robot" are also comments.
>
>The stated default for wget is "robots=on" which I have seen honored
>for quite a number of other downloads and since I didn't use "-e
>robots=off", that can't explain it. The only other thing I have found
>that might be related is not under my control and I haven't yet
>figured out how to check it. From the wget documentation it states:
>
>"
>The second, less known mechanism, enables the author of an individual
>document to specify whether they want the links from the file to be
>followed by a robot. This is achieved using the META tag, like this:
>
><meta name="robots" content="nofollow">
>
>This is explained in some detail at
><http://www.robotstxt.org/wc/meta-user.html>. Wget supports this
>method of robot exclusion in addition to the usual /robots.txt exclusion.
>"
>
>Perhaps there is a counterpart to the above, i.e., <meta name="robots"
>content="follow"> that's being involked and someone from Redhat could
>check into and rule this out.
>
>Thanks (and still puzzled)!
>
>Lowell Anderson
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Bug reporting: http://cygwin.com/bugs.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/
- Raw text -