Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Message-Id: <5.2.0.9.2.20030214213133.0296cb10@pop3.cris.com> X-Sender: rrschulz AT pop3 DOT cris DOT com Date: Fri, 14 Feb 2003 21:35:20 -0800 To: cygwin AT cygwin DOT com From: Randall R Schulz Subject: Re: Wget ignores robot.txt entry In-Reply-To: <3E4D9B42.6040003@serv.net> References: <5 DOT 2 DOT 0 DOT 9 DOT 2 DOT 20030213182750 DOT 01e97e98 AT pop3 DOT cris DOT com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed Lowell, Max Bowsher reported: >Or, on the command line -erobots=off :-) > >Whilst this does control whether wget downloads robots.txt, a quick >test confirms that even when it does get robots.txt, it still wanders >into cgi-bin. > >I'd suggest taking this to the wget list, except wget it currently >maintainer-less, and, it appears, bitrotted. > >Max. As for this: >Perhaps there is a counterpart to the above, i.e., content="follow"> that's being involked and someone from Redhat could >check into and rule this out. You should realize that for open source programs like wget, the recommended practice is to examine the source yourself. Randall Schulz At 17:43 2003-02-14, L Anderson wrote: >Randall R Schulz wrote: >>Lowell, >>What's in your "~/.wgetrc" file? If it contains this: >>robots = off >>Then wget will not respect a "robots.txt" file on the host from which >>it is retrieving files. >>Before I learned of this option (accessible _only_ via this directive >>in the .wgetrc file), I did something too clever by half to get >>robots.txt ignored, so I know that wget does respect it. > >I have only two wgetrc related files as follows: > >/etc/wgetrc >/usr/doc/wget-1.8.2/sample.wgetrc > >NB: I use win98 and these are under my cygwin directory i:\cygwin >(i.e. /cygdrive/i). > >I have never changed either file--I just accept the default installed >by setup. However, the two files differ by a few lines which are just >comments anyway. i.e. doing: > >$ diff /etc/wgetrc /usr/doc/wget-1.8.2/sample.wgetrc >73,74c73,74 >< # You can set the default proxy for Wget to use. It will override the >< # value in the environment. >--- > > # You can set the default proxies for Wget to use for http and ftp. > > # They will override the value in the environment. >75a76 > > #ftp_proxy = http://proxy.yoyodyne.com:18023/ > >shows this. Moreover, > >$ grep robot /etc/wgetrc ># Setting this to off makes Wget not download /robots.txt. Be sure to ># know *exactly* what /robots.txt is and how it is used before changing >#robots = on > >shows the only references to "robot" are also comments. > >The stated default for wget is "robots=on" which I have seen honored >for quite a number of other downloads and since I didn't use "-e >robots=off", that can't explain it. The only other thing I have found >that might be related is not under my control and I haven't yet >figured out how to check it. From the wget documentation it states: > >" >The second, less known mechanism, enables the author of an individual >document to specify whether they want the links from the file to be >followed by a robot. This is achieved using the META tag, like this: > > > >This is explained in some detail at >. Wget supports this >method of robot exclusion in addition to the usual /robots.txt exclusion. >" > >Perhaps there is a counterpart to the above, i.e., content="follow"> that's being involked and someone from Redhat could >check into and rule this out. > >Thanks (and still puzzled)! > >Lowell Anderson -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Bug reporting: http://cygwin.com/bugs.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/