X-Spam-Check-By: sourceware.org
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; 	charset="us-ascii"
Subject: RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU
Date: Wed, 8 Aug 2007 14:10:57 -0400
Message-ID: <76087731258D2545B1016BB958F00ADA1239A5@STEELPO.steeleye.com>
In-Reply-To: <76087731258D2545B1016BB958F00ADA1234D7@STEELPO.steeleye.com>
From: "Ernie Coskrey" <Ernie.Coskrey@steeleye.com>
To: <cygwin@cygwin.com>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
Precedence: bulk
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie.com@cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
Delivered-To: mailing list cygwin@cygwin.com
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id l78IDUm3008313

> -----Original Message-----
> From: cygwin-owner@cygwin.com 
> [mailto:cygwin-owner@cygwin.com] On Behalf Of Ernie Coskrey
> Sent: Tuesday, July 31, 2007 3:40 PM
> To: cygwin@cygwin.com
> Subject: cygwin 1.5.20-1, spinning pdksh, 100% CPU
> 
>  
> I've run into a problem with cygwin 1.5.20-1 and pdksh 
> 5.2.14.  We've got a pdksh.exe process that is spinning, 
> using all the CPU.
>  
> This scenario is very hard to reproduce, but has happened on 
> our test systems occasionally.  It occurred recently, and I 
> currently have gdb attached to the process and have the 
> symbols loaded.  I see that pdksh is continually calling 
> "sigsuspend()", which is immediately returning from 
> cancelable_wait due to the fact that the signal_arrived event 
> is set.  I also see that pdksh is waiting for a subprocess to 
> complete, and has a handle to the PID of that process - 
> however the process has long since terminated.
>  
> It appears that something went wrong during delivery of SIGCHLD.
>  
> I've got two questions related to this:
>  
> - have there been changes between 1.5.20-1 and 1.5.24-2, or 
> the latest snapshot, that might have fixed this issue?  We've 
> done some limited testing with 1.5.24-2 and haven't seen this 
> happen yet, but as I said the it only happens rarely.
> - is there anything I can look at in gdb to help identify 
> what the issue is?
>  
> Any suggestions would be appreciated!
>  
> ---------
> Ernie Coskrey 

I've discovered an interesting piece of information that I think is
related to this.  I'm hoping this might ring a bell with someone on the
list.

Looking at _main_tls->stack[], when I've set a breakpoint in
handle_sigsuspend just after the cancelable_wait() call, I see the
following entries:

    0x6109186f  0x4132ac

0x6109186f is "sigdelayed()", which is the routine that should have been
called to deliver the signal and reset the signal_arrived event.
0x4132ac is j_waitj (in pdksh).

So, somehow, when this problem occurs, "sigdelayed" gets pushed onto the
stack *before* j_waitj does.  So, _sigbe never calls sigdelayed.

I don't think there's ever a case where sigdelayed should be at
_main_tls->stack[0].  However this happened is, I believe, the cause of
this problem.

Ernie Coskrey

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


