X-Spam-Check-By: sourceware.org Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU Date: Wed, 8 Aug 2007 14:10:57 -0400 Message-ID: <76087731258D2545B1016BB958F00ADA1239A5@STEELPO.steeleye.com> In-Reply-To: <76087731258D2545B1016BB958F00ADA1234D7@STEELPO.steeleye.com> From: "Ernie Coskrey" To: X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id l78IDUm3008313 > -----Original Message----- > From: cygwin-owner AT cygwin DOT com > [mailto:cygwin-owner AT cygwin DOT com] On Behalf Of Ernie Coskrey > Sent: Tuesday, July 31, 2007 3:40 PM > To: cygwin AT cygwin DOT com > Subject: cygwin 1.5.20-1, spinning pdksh, 100% CPU > > > I've run into a problem with cygwin 1.5.20-1 and pdksh > 5.2.14. We've got a pdksh.exe process that is spinning, > using all the CPU. > > This scenario is very hard to reproduce, but has happened on > our test systems occasionally. It occurred recently, and I > currently have gdb attached to the process and have the > symbols loaded. I see that pdksh is continually calling > "sigsuspend()", which is immediately returning from > cancelable_wait due to the fact that the signal_arrived event > is set. I also see that pdksh is waiting for a subprocess to > complete, and has a handle to the PID of that process - > however the process has long since terminated. > > It appears that something went wrong during delivery of SIGCHLD. > > I've got two questions related to this: > > - have there been changes between 1.5.20-1 and 1.5.24-2, or > the latest snapshot, that might have fixed this issue? We've > done some limited testing with 1.5.24-2 and haven't seen this > happen yet, but as I said the it only happens rarely. > - is there anything I can look at in gdb to help identify > what the issue is? > > Any suggestions would be appreciated! > > --------- > Ernie Coskrey I've discovered an interesting piece of information that I think is related to this. I'm hoping this might ring a bell with someone on the list. Looking at _main_tls->stack[], when I've set a breakpoint in handle_sigsuspend just after the cancelable_wait() call, I see the following entries: 0x6109186f 0x4132ac 0x6109186f is "sigdelayed()", which is the routine that should have been called to deliver the signal and reset the signal_arrived event. 0x4132ac is j_waitj (in pdksh). So, somehow, when this problem occurs, "sigdelayed" gets pushed onto the stack *before* j_waitj does. So, _sigbe never calls sigdelayed. I don't think there's ever a case where sigdelayed should be at _main_tls->stack[0]. However this happened is, I believe, the cause of this problem. Ernie Coskrey -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/