X-Recipient: archive-cygwin AT delorie DOT com X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D6417385734A Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru MIME-Version: 1.0 Date: Thu, 14 Apr 2022 02:17:38 +0300 From: Alexey Izbyshev To: Takashi Yano Subject: Re: Deadlock of the process tree when running make In-Reply-To: <1bdd5ac77277343fbff9b560fa98b15e@ispras.ru> References: <9388316255ada0e0fcb2d849cce5a894 AT ispras DOT ru> <20220409191743 DOT 6da2268a36e8c9b4ab22c722 AT nifty DOT ne DOT jp> <1ecd670b1cdff43e0b0d7e5ee4c9cfc5 AT ispras DOT ru> <20220409204619 DOT dd0e53902d5e108ef462e510 AT nifty DOT ne DOT jp> <907ce1b4416a826cb07990dd601bd687 AT ispras DOT ru> <20220410015753 DOT 753e2a238513eaf2a3da81e9 AT nifty DOT ne DOT jp> <20220410025410 DOT 196aa0a04368147dbbb31d3e AT nifty DOT ne DOT jp> <7204ed0aa2d6b3fcfb239010e6b67646 AT ispras DOT ru> <20220410163432 DOT 00dd7b9f81f8f322d97688f2 AT nifty DOT ne DOT jp> <0e1a53626639cb21369225ff9092ecfc AT ispras DOT ru> <20220411173526 DOT 6243b9492e0fc3d4132a58a8 AT nifty DOT ne DOT jp> <1bdd5ac77277343fbff9b560fa98b15e AT ispras DOT ru> User-Agent: Roundcube Webmail/1.4.4 Message-ID: X-Sender: izbyshev AT ispras DOT ru X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_00, DOS_RCVD_IP_TWICE_B, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: cygwin AT cygwin DOT com Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: cygwin-bounces+archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" On 2022-04-13 19:48, Alexey Izbyshev wrote: > On 2022-04-11 13:10, Alexey Izbyshev wrote: > What's probably not normal is the behavior of the hanging conhost.exe. > I've compared the points where conhost.exe is blocked, and all but one > threads in the model case are doing the same things as in the hanging > case, but the remaining thread is blocked in > ReadFile("\Device\NamedPipe\") (i.e. the read end of "hWritePipe" of > pcon) instead of trying to enter a critical section like thread 1 > above. So now I'm starting to doubt that it's a cygwin bug and not > some conhost.exe bug. > > I'll try to poke around the hanging conhost.exe some more, and also > may be will try to create a faster reproducer. > I've studied conhost.exe hang, and it indeed looks like it's buggy. TLDR: https://github.com/microsoft/terminal/pull/12181 The full story: I dumped conhost.exe, opened the dump in windbg and looked at the stack trace of the hanging thread: ntdll!NtWaitForAlertByThreadId+0x14 ntdll!RtlpWaitOnAddressWithTimeout+0x81 ntdll!RtlpWaitOnAddress+0xae ntdll!RtlpWaitOnCriticalSection+0xfd ntdll!RtlpEnterCriticalSectionContended+0x1c4 ntdll!RtlEnterCriticalSection+0x42 conhost!Microsoft::Console::Render::Renderer::_PaintFrameForEngine+0x54 conhost!Microsoft::Console::Render::Renderer::TriggerTeardown+0x19e60 conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit+0x21 conhost!Microsoft::Console::PtySignalInputThread::_GetData+0x65 conhost!Microsoft::Console::PtySignalInputThread::_InputThread+0x25 kernel32!BaseThreadInitThunk+0x14 ntdll!RtlUserThreadStart+0x21 By looking at assembly, I've found that it hangs *after* ReadFile() on the pipe completes, so the problem is definitely not a leak of hWritePipe in bash.exe or elsewhere. Using the function names, I've found this issue: https://github.com/microsoft/terminal/issues/1810. This is a different one, but the discussion and the patch shows that synchronization on startup/shutdown is a disaster. Then I looked at the code and identified that hang happens while attempting to lock the console at [1]. After studying how this lock is used in other parts of the code, I noticed that PtySignalInputThread::_Shutdown() (which is further up in the call stack of the hanging function) uses ProcessCtrlEvents() incorrectly, because the latter unconditionally unlocks the console, but the lock is never taken by this thread at this point. Then I looked at a more recent version of the code and discovered the patch to _Shutdown() which I referenced above. I've also verified that assembly of _Shutdown() (which is inlined into PtySignalInputThread::_GetData()) corresponds to the unpatched version (i.e. without LockConsole() call): call conhost!CloseConsoleProcessState (00007ff6`22e7013c) call conhost!ProcessCtrlEvents (00007ff6`22e262a0) mov ecx,6Dh call conhost!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit (00007ff6`22e3c730) I'm not sure why this bug is not triggered more frequently, but one possible reason, as indicated by comment [2], is that the bad path is only taken if there are live clients after ClosePseudoConsole() is called, which is probably rare. A potential workaround on Cygwin side would be to ensure that the pseudoconsole doesn't have clients before calling ClosePseudoConsole(), but I don't know whether it's possible. [1] https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/renderer/base/renderer.cpp#L75 [2] https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/PtySignalInputThread.cpp#L205 Alexey -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple