Mail Archives: cygwin/2003/02/14/20:43:57
Randall R Schulz wrote:
> Lowell,
>
> What's in your "~/.wgetrc" file? If it contains this:
>
> robots = off
>
> Then wget will not respect a "robots.txt" file on the host from which it
> is retrieving files.
>
> Before I learned of this option (accessible _only_ via this directive in
> the .wgetrc file), I did something too clever by half to get robots.txt
> ignored, so I know that wget does respect it.
>
I have only two wgetrc related files as follows:
/etc/wgetrc
/usr/doc/wget-1.8.2/sample.wgetrc
NB: I use win98 and these are under my cygwin directory i:\cygwin (i.e.
/cygdrive/i).
I have never changed either file--I just accept the default installed by
setup. However, the two files differ by a few lines which are just
comments anyway. i.e. doing:
$ diff /etc/wgetrc /usr/doc/wget-1.8.2/sample.wgetrc
73,74c73,74
< # You can set the default proxy for Wget to use. It will override the
< # value in the environment.
---
> # You can set the default proxies for Wget to use for http and ftp.
> # They will override the value in the environment.
75a76
> #ftp_proxy = http://proxy.yoyodyne.com:18023/
shows this. Moreover,
$ grep robot /etc/wgetrc
# Setting this to off makes Wget not download /robots.txt. Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
#robots = on
shows the only references to "robot" are also comments.
The stated default for wget is "robots=on" which I have seen honored for
quite a number of other downloads and since I didn't use "-e
robots=off", that can't explain it. The only other thing I have found
that might be related is not under my control and I haven't yet figured
out how to check it. From the wget documentation it states:
"
The second, less known mechanism, enables the author of an individual
document to specify whether they want the links from the file to be
followed by a robot. This is achieved using the META tag, like this:
<meta name="robots" content="nofollow">
This is explained in some detail at
<http://www.robotstxt.org/wc/meta-user.html>. Wget supports this method
of robot exclusion in addition to the usual /robots.txt exclusion.
"
Perhaps there is a counterpart to the above, i.e., <meta name="robots"
content="follow"> that's being involked and someone from Redhat could
check into and rule this out.
Thanks (and still puzzled)!
Lowell Anderson
> Randall Schulz
>
>
> At 18:14 2003-02-13, L Anderson wrote:
>
>> Using the latest of things Cygwin, I downloaded some stuff with wget
>> from <http://cygwin.com> to peruse off-line and noticed a problem I
>> can't explain:
>>
>> The <http://cygwin.com/robots.txt> file has the entries:
>>
>> User-agent: *
>> Disallow: /snapshots/
>> Disallow: /cgi-bin/
>> Disallow: /cgi2-bin/
>>
>> so wget should not download /cgi-bin/.
>>
>> However, "wget -o cygwincom.log -m -p --no-parent -X /cygwin,/ml
>> http://cygwin.com/" downloads /cgi-bin anyway.
>>
>> NB. "wget -o cygwincom.log -m -p --no-parent -X /cgi-bin,/cygwin,/ml
>> http://cygwin.com/ doesn't download /cgi-bin
>>
>> I ran a validity check on <http://cygwin.com/robots.txt> and found no
>> errors.
>>
>> Is this a bug in wget or am I doing something wrong?
>>
>> Thanks,
>>
>> Lowell Anderson
>
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Bug reporting: http://cygwin.com/bugs.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/
- Raw text -