X-Recipient: archive-cygwin AT delorie DOT com X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0212C385842B Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru MIME-Version: 1.0 Date: Wed, 27 Apr 2022 15:19:11 +0300 From: Alexey Izbyshev To: Takashi Yano Subject: Re: Deadlock of the process tree when running make In-Reply-To: <20220427202216.4901f538e9916e4a8bde10d9@nifty.ne.jp> References: <9388316255ada0e0fcb2d849cce5a894 AT ispras DOT ru> <1ecd670b1cdff43e0b0d7e5ee4c9cfc5 AT ispras DOT ru> <20220409204619 DOT dd0e53902d5e108ef462e510 AT nifty DOT ne DOT jp> <907ce1b4416a826cb07990dd601bd687 AT ispras DOT ru> <20220410015753 DOT 753e2a238513eaf2a3da81e9 AT nifty DOT ne DOT jp> <20220410025410 DOT 196aa0a04368147dbbb31d3e AT nifty DOT ne DOT jp> <7204ed0aa2d6b3fcfb239010e6b67646 AT ispras DOT ru> <20220410163432 DOT 00dd7b9f81f8f322d97688f2 AT nifty DOT ne DOT jp> <0e1a53626639cb21369225ff9092ecfc AT ispras DOT ru> <20220411173526 DOT 6243b9492e0fc3d4132a58a8 AT nifty DOT ne DOT jp> <1bdd5ac77277343fbff9b560fa98b15e AT ispras DOT ru> <20220416183910 DOT b532b2cc95725b508bfd0991 AT nifty DOT ne DOT jp> <45f9160a597b25bc576eb153a138fb88 AT ispras DOT ru> <20220427202216 DOT 4901f538e9916e4a8bde10d9 AT nifty DOT ne DOT jp> User-Agent: Roundcube Webmail/1.4.4 Message-ID: X-Sender: izbyshev AT ispras DOT ru X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: cygwin AT cygwin DOT com Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: cygwin-bounces+archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Hi, Takashi, On 2022-04-27 14:22, Takashi Yano wrote: > Hi Alexey, > > On Sat, 16 Apr 2022 16:21:34 +0300 > Alexey Izbyshev wrote: >> On 2022-04-16 12:39, Takashi Yano wrote: >> > I am not sure yet what is essential, but the current code closes >> > pseudo console only if there is no other process which is attaching >> > to the pseudo console. I wonder why javac.exe is remaining as >> > zombie. The parent bash.exe calls ColosePseudoConsole() when >> > child non-cygwin app is terminated, i.e., after WaitForSingleObject() >> > for child process handle returns. >> > https://www.cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=7ac0767053e278f0ce9811bf6f77278bd2f49c20#l1009 >> > >> > What does the "zombie" mean? Is it listed in the process list of >> > ProcessHacker? I still suspect that the zombie javac.exe holds >> > the hWritePipe handle leaked from parent bash.exe. >> > >> By "zombie" I meant the same thing as in the Linux kernel: a data >> structure that remains after a process terminated, but hasn't been >> waited for yet (I don't know how this is implemented in Cygwin). So >> there is no javac.exe process in ProcessHacker, but "ps" and similar >> tools in Cygwin still list "javac". >> >> I'm now trying to create a small reproducer that I can share, and I've >> had a first small success this night: I could get a very similar hang >> with a simple Makefile and a script with Cygwin 3.3.4. Here is the >> tree: >> >> make(14479)-+-bash(14484)---bash(14611) >> |-bash(14515)---bash(14618) >> |-bash(14491)---bash(14500)---bash(14612) >> |-bash(14501)---bash(14510)---bash(14605) >> |-bash(14505)---bash(14607) >> |-bash(14494)---bash(14617) >> |-bash(14506)---bash(14513)---bash(14610) >> |-bash(14512)---bash(14518)---bash(14615) >> |-bash(14486)---bash(14495)---bash(14606) >> |-bash(14483)---bash(14490)---bash(14609) >> |-bash(14509)---bash(14614) >> |-bash(14489)---bash(14608) >> |-bash(14499)---bash(14613) >> |-bash(14481)---bash(14485)---python(14588) >> |-bash(14496)---bash(14504)---bash(14616) >> `-bash(14482)---bash(14604) >> >> >> "python" is a zombie, just as "javac" is in the original case. There >> is >> also a single "conhost.exe" again, and all of its 5 threads are doing >> the same things as in the original case (including the signal pipe >> thread trying to EnterCriticalSection()). The only difference is that >> leaf bash.exe are trying to acquire pcon mutex at a different point >> [1], >> but I guess this difference is not important. >> >> I'll try this reproducer with your patched DLL as well as on another >> machine and share it in case of success. >> >> Thanks, >> Alexey >> >> [1] >> https://www.cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/spawn.cc;h=81dba5a941e919ea2514013069aef22c6fad8004;hb=cygwin-3_3_4-release#l697 > > Is there any progress on this? During the last week I reproduced the hang on a vanilla 3.3.4 Cygwin with a small test multiple times. In one case, the hanging state is even minimal, i.e. there is only a bash.exe waiting in ClosePseudoConsole() after its native child terminated and a conhost.exe, but no other processes trying to acquire pcon mutex. Conhost.exe signal-pipe thread is also blocked at the same EnterCriticalSection() call in all cases. However, I couldn't reproduce the hang with your patched DLL[1] with the same test running for multiple days. I can't explain how your change of handle inheritability can affect the double-unlock bug in conhost.exe that I referenced earlier, so either I'm missing something or I've been very unlucky with reproducing. I was going to try to investigate conhost.exe logic and state more (in particular, why one of its threads still reads from "\Device\ConDrv" after all console clients detached) and then reply to you, but I haven't been able to do it yet. If you want to try to reproduce the hang yourself with 3.3.4, here is one of small tests that I used (it looks strange because it's the result of minimization of other code): $ cat Makefile T := $(shell echo {1..16}) all: $(T) $(T): @./test.sh $@ $ cat test.sh #!/bin/bash set -eu ( for ((i = 0; i < 10; i++)); do python -c "" done ) $ while make -j16; do echo $((i++)); done The test can still take multiple hours to hang on my machine. If I get any new interesting data, I'll share it. Thank you, Alexey [1] https://tyan0.yr32.net/cygwin/x86/test/cygwin1-20220418.dll.xz -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple