Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Message-ID: <3E4D9B42.6040003@serv.net> Date: Fri, 14 Feb 2003 17:43:30 -0800 From: L Anderson Organization: TBD User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.0.2) Gecko/20021120 Netscape/7.01 X-Accept-Language: en,ru MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: Re: Wget ignores robot.txt entry References: <5 DOT 2 DOT 0 DOT 9 DOT 2 DOT 20030213182750 DOT 01e97e98 AT pop3 DOT cris DOT com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Randall R Schulz wrote: > Lowell, > > What's in your "~/.wgetrc" file? If it contains this: > > robots = off > > Then wget will not respect a "robots.txt" file on the host from which it > is retrieving files. > > Before I learned of this option (accessible _only_ via this directive in > the .wgetrc file), I did something too clever by half to get robots.txt > ignored, so I know that wget does respect it. > I have only two wgetrc related files as follows: /etc/wgetrc /usr/doc/wget-1.8.2/sample.wgetrc NB: I use win98 and these are under my cygwin directory i:\cygwin (i.e. /cygdrive/i). I have never changed either file--I just accept the default installed by setup. However, the two files differ by a few lines which are just comments anyway. i.e. doing: $ diff /etc/wgetrc /usr/doc/wget-1.8.2/sample.wgetrc 73,74c73,74 < # You can set the default proxy for Wget to use. It will override the < # value in the environment. --- > # You can set the default proxies for Wget to use for http and ftp. > # They will override the value in the environment. 75a76 > #ftp_proxy = http://proxy.yoyodyne.com:18023/ shows this. Moreover, $ grep robot /etc/wgetrc # Setting this to off makes Wget not download /robots.txt. Be sure to # know *exactly* what /robots.txt is and how it is used before changing #robots = on shows the only references to "robot" are also comments. The stated default for wget is "robots=on" which I have seen honored for quite a number of other downloads and since I didn't use "-e robots=off", that can't explain it. The only other thing I have found that might be related is not under my control and I haven't yet figured out how to check it. From the wget documentation it states: " The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this: This is explained in some detail at . Wget supports this method of robot exclusion in addition to the usual /robots.txt exclusion. " Perhaps there is a counterpart to the above, i.e., that's being involked and someone from Redhat could check into and rule this out. Thanks (and still puzzled)! Lowell Anderson > Randall Schulz > > > At 18:14 2003-02-13, L Anderson wrote: > >> Using the latest of things Cygwin, I downloaded some stuff with wget >> from to peruse off-line and noticed a problem I >> can't explain: >> >> The file has the entries: >> >> User-agent: * >> Disallow: /snapshots/ >> Disallow: /cgi-bin/ >> Disallow: /cgi2-bin/ >> >> so wget should not download /cgi-bin/. >> >> However, "wget -o cygwincom.log -m -p --no-parent -X /cygwin,/ml >> http://cygwin.com/" downloads /cgi-bin anyway. >> >> NB. "wget -o cygwincom.log -m -p --no-parent -X /cgi-bin,/cygwin,/ml >> http://cygwin.com/ doesn't download /cgi-bin >> >> I ran a validity check on and found no >> errors. >> >> Is this a bug in wget or am I doing something wrong? >> >> Thanks, >> >> Lowell Anderson > -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Bug reporting: http://cygwin.com/bugs.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/