delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2007/08/09/11:46:17

X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
Subject: RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU
Date: Thu, 9 Aug 2007 11:43:31 -0400
Message-ID: <76087731258D2545B1016BB958F00ADA123A37@STEELPO.steeleye.com>
In-Reply-To: <76087731258D2545B1016BB958F00ADA1239A5@STEELPO.steeleye.com>
From: "Ernie Coskrey" <Ernie DOT Coskrey AT steeleye DOT com>
To: <cygwin AT cygwin DOT com>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id l79Fk6Qq023458

> -----Original Message-----
> From: cygwin-owner AT cygwin DOT com 
> [mailto:cygwin-owner AT cygwin DOT com] On Behalf Of Ernie Coskrey
> Sent: Wednesday, August 08, 2007 2:11 PM
> To: cygwin AT cygwin DOT com
> Subject: RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU
> 
> > -----Original Message-----
> > From: cygwin-owner AT cygwin DOT com
> > [mailto:cygwin-owner AT cygwin DOT com] On Behalf Of Ernie Coskrey
> > Sent: Tuesday, July 31, 2007 3:40 PM
> > To: cygwin AT cygwin DOT com
> > Subject: cygwin 1.5.20-1, spinning pdksh, 100% CPU
> > 
> >  
> > I've run into a problem with cygwin 1.5.20-1 and pdksh 
> 5.2.14.  We've 
> > got a pdksh.exe process that is spinning, using all the CPU.
> >  
> > This scenario is very hard to reproduce, but has happened 
> on our test 
> > systems occasionally.  It occurred recently, and I 
> currently have gdb 
> > attached to the process and have the symbols loaded.  I see 
> that pdksh 
> > is continually calling "sigsuspend()", which is immediately 
> returning 
> > from cancelable_wait due to the fact that the 
> signal_arrived event is 
> > set.  I also see that pdksh is waiting for a subprocess to 
> complete, 
> > and has a handle to the PID of that process - however the 
> process has 
> > long since terminated.
> >  
> > It appears that something went wrong during delivery of SIGCHLD.
> >  
> > I've got two questions related to this:
> >  
> > - have there been changes between 1.5.20-1 and 1.5.24-2, or 
> the latest 
> > snapshot, that might have fixed this issue?  We've done 
> some limited 
> > testing with 1.5.24-2 and haven't seen this happen yet, but 
> as I said 
> > the it only happens rarely.
> > - is there anything I can look at in gdb to help identify what the 
> > issue is?
> >  
> > Any suggestions would be appreciated!
> >  
> > ---------
> > Ernie Coskrey
> 
> I've discovered an interesting piece of information that I 
> think is related to this.  I'm hoping this might ring a bell 
> with someone on the list.
> 
> Looking at _main_tls->stack[], when I've set a breakpoint in 
> handle_sigsuspend just after the cancelable_wait() call, I 
> see the following entries:
> 
>     0x6109186f  0x4132ac
> 
> 0x6109186f is "sigdelayed()", which is the routine that 
> should have been called to deliver the signal and reset the 
> signal_arrived event.
> 0x4132ac is j_waitj (in pdksh).
> 
> So, somehow, when this problem occurs, "sigdelayed" gets 
> pushed onto the stack *before* j_waitj does.  So, _sigbe 
> never calls sigdelayed.
> 
> I don't think there's ever a case where sigdelayed should be 
> at _main_tls->stack[0].  However this happened is, I believe, 
> the cause of this problem.
> 
> Ernie Coskrey
> 

Well, I think that I may have found the cause of this issue, and I
believe that the problem exists in 1.5.24-2.  Please take a look at what
I think is the solution, and let me know if I'm mistaken.

I believe that the problem is in _sigbe, at the very end of the
assembler code.  _sigbe decrements the lock *before* it decrements
incyg.  This leaves a very small window where another thread - possibly
the sig thread that's doing setup_handler() - can acquire the lock, see
that incyg is still set to 1, and act accordingly.  In setup_handler,
this will cause the thread to go into _cygtls::interrupt_setup, which
pushes sigdelayed onto the tls stack.  But since we're not really in
Cygwin code when this happens, sigdelayed() never gets executed and you
end up spinning as we're seeing.

I'll post a patch to cygwin-patches.

Ernie Coskrey

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019