X-Spam-Check-By: sourceware.org Message-ID: <449CAF28.BB172C1A@dessent.net> Date: Fri, 23 Jun 2006 20:19:04 -0700 From: Brian Dessent X-Mailer: Mozilla 4.79 [en] (Windows NT 5.0; U) MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: autossh broken with current openssh/cygwin Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Reply-To: cygwin AT cygwin DOT com Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com I'm not sure if it is due to changes in openssh, or changes in Cygwin, but the current autossh package fails to work. Instead of detecting that the connection is alive, it seems to continuously timeout and recycle the ssh process. Here is a representative testcase: $ AUTOSSH_FIRST_POLL=5 AUTOSSH_POLL=5 AUTOSSH_DEBUG=yes autossh -M 30000 -N dessent.net autossh: PID 3204: short poll time: adjusting net timeouts to 2500 autossh: PID 3204: checking for grace period, tries = 0 autossh: PID 3204: starting ssh (count 1) autossh: PID 3204: ssh child pid is 4160 autossh: PID 4160: execing /usr/bin/ssh autossh: PID 3204: check on child 4160 autossh: PID 3204: set alarm for 5 secs autossh: PID 3204: timeout on io poll, looping to accept again autossh: PID 3204: too many loops without data autossh: PID 3204: error on poll: Socket operation on non-socket autossh: PID 3204: port down, restarting ssh autossh: PID 3204: checking for grace period, tries = 0 autossh: PID 3204: starting ssh (count 2) autossh: PID 4728: execing /usr/bin/ssh autossh: PID 3204: ssh child pid is 4728 autossh: PID 3204: check on child 4728 autossh: PID 3204: set alarm for 5 secs autossh: PID 3204: not what I sent: "booch autossh 3204 122720421 " : "" autossh: PID 3204: too many loops without data autossh: PID 3204: error on poll: Interrupted system call autossh: PID 3204: port down, restarting ssh autossh: PID 3204: checking for grace period, tries = 0 autossh: PID 3204: starting ssh (count 3) autossh: PID 3204: ssh child pid is 5520 autossh: PID 5520: execing /usr/bin/ssh autossh: PID 3204: check on child 5520 autossh: PID 3204: set alarm for 5 secs autossh: PID 3204: not what I sent: "booch autossh 3204 840588297 " : "" (This continues on and on indefinitely.) I have verified with netcat that indeed the port 30000/30001 pair can successfully transfer data. I tried building autossh 1.4 from source but it does not cure the problem. I stepped through it, and the problem seems to be in conn_send_and_receive(). It calls poll(), sees that the write handle is ready for writing, sends the test string, sets 'ntopoll' to 1, and re-calls poll() again a second time. Here you would expect poll() to return 1 with fd 0 ready for reading after a brief pause, but it just times out and conn_send_and_receive() returns 1 which results in the error "timeout on io poll, looping to accept again". I think from there on the rest is just cascading failure resulting from that. It seems to try to re-accept the data channel but I don't think this succeeds as it never went away to begin with. So, anyway, I can't tell if this is a problem with the logic in autossh, a problem with openssh, or a problem caused by a change in Cygwin (I use the current snapshots.) The end result is that on the default settings autossh recycles its ssh every 10 minutes, which just fills up the logs with data. I'm not sure when this regressed, but I know that I've used autossh for quite a while without noticing this problem until recently. Brian -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/