X-Spam-Check-By: sourceware.org Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU Date: Thu, 9 Aug 2007 11:43:31 -0400 Message-ID: <76087731258D2545B1016BB958F00ADA123A37@STEELPO.steeleye.com> In-Reply-To: <76087731258D2545B1016BB958F00ADA1239A5@STEELPO.steeleye.com> From: "Ernie Coskrey" To: X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id l79Fk6Qq023458 > -----Original Message----- > From: cygwin-owner AT cygwin DOT com > [mailto:cygwin-owner AT cygwin DOT com] On Behalf Of Ernie Coskrey > Sent: Wednesday, August 08, 2007 2:11 PM > To: cygwin AT cygwin DOT com > Subject: RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU > > > -----Original Message----- > > From: cygwin-owner AT cygwin DOT com > > [mailto:cygwin-owner AT cygwin DOT com] On Behalf Of Ernie Coskrey > > Sent: Tuesday, July 31, 2007 3:40 PM > > To: cygwin AT cygwin DOT com > > Subject: cygwin 1.5.20-1, spinning pdksh, 100% CPU > > > > > > I've run into a problem with cygwin 1.5.20-1 and pdksh > 5.2.14. We've > > got a pdksh.exe process that is spinning, using all the CPU. > > > > This scenario is very hard to reproduce, but has happened > on our test > > systems occasionally. It occurred recently, and I > currently have gdb > > attached to the process and have the symbols loaded. I see > that pdksh > > is continually calling "sigsuspend()", which is immediately > returning > > from cancelable_wait due to the fact that the > signal_arrived event is > > set. I also see that pdksh is waiting for a subprocess to > complete, > > and has a handle to the PID of that process - however the > process has > > long since terminated. > > > > It appears that something went wrong during delivery of SIGCHLD. > > > > I've got two questions related to this: > > > > - have there been changes between 1.5.20-1 and 1.5.24-2, or > the latest > > snapshot, that might have fixed this issue? We've done > some limited > > testing with 1.5.24-2 and haven't seen this happen yet, but > as I said > > the it only happens rarely. > > - is there anything I can look at in gdb to help identify what the > > issue is? > > > > Any suggestions would be appreciated! > > > > --------- > > Ernie Coskrey > > I've discovered an interesting piece of information that I > think is related to this. I'm hoping this might ring a bell > with someone on the list. > > Looking at _main_tls->stack[], when I've set a breakpoint in > handle_sigsuspend just after the cancelable_wait() call, I > see the following entries: > > 0x6109186f 0x4132ac > > 0x6109186f is "sigdelayed()", which is the routine that > should have been called to deliver the signal and reset the > signal_arrived event. > 0x4132ac is j_waitj (in pdksh). > > So, somehow, when this problem occurs, "sigdelayed" gets > pushed onto the stack *before* j_waitj does. So, _sigbe > never calls sigdelayed. > > I don't think there's ever a case where sigdelayed should be > at _main_tls->stack[0]. However this happened is, I believe, > the cause of this problem. > > Ernie Coskrey > Well, I think that I may have found the cause of this issue, and I believe that the problem exists in 1.5.24-2. Please take a look at what I think is the solution, and let me know if I'm mistaken. I believe that the problem is in _sigbe, at the very end of the assembler code. _sigbe decrements the lock *before* it decrements incyg. This leaves a very small window where another thread - possibly the sig thread that's doing setup_handler() - can acquire the lock, see that incyg is still set to 1, and act accordingly. In setup_handler, this will cause the thread to go into _cygtls::interrupt_setup, which pushes sigdelayed onto the tls stack. But since we're not really in Cygwin code when this happens, sigdelayed() never gets executed and you end up spinning as we're seeing. I'll post a patch to cygwin-patches. Ernie Coskrey -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/