DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 4APCPOT52504243 Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 4APCPOT52504243 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=ZdZuoYgw X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7C8583858405 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1732537522; bh=52zYxdrIItXP9YHWqTANtkWy1dQ0xqmCCMX3yBum1mk=; h=Date:To:Subject:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=ZdZuoYgwxZwON+17WZ8Fi21VoLzxkWQoIx+oehIYxEC1xb5yzNNGQaWS8A1BbG6J5 zLpqhS856wcZDIAkzl86I8hx81L5BqHJy/jKDoULDuUqZpjSI3+tuuRW/jfnkb0Anq yLgbmAR3+AnqgWPZTo/82I+qjIM1xZiGuxtfLQfY= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3CBCD3858D37 ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 3CBCD3858D37 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1732537428; cv=none; b=rnFAdmictAWmK5KW7afT2LezvEmQgS3dBxn0iyCabm1MXpWmigZTgowdLvwROHl7vSUMcKEY+lMb79DXv9Cgu/hjg6X7BvR7AkaY1Mhnlqd224rc+XeZ7Nmhk/9A+m/6yCIi6YAj2NxB6/qzATgf0DAlQ1MevKSABzkia/WVxxU= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1732537428; c=relaxed/simple; bh=+g4zWA5/acmd3zjrsEJblUvX9JEFplETQK62UoabX2w=; h=Date:From:To:Subject:Message-Id:Mime-Version:DKIM-Signature; b=NSR3+W+wf0OSQgfPpYj43Lb05qazLBWQ94qS+8Tf8LFho7zGTSTAYVw2F02+O+hIuGANT3RJtS861igykrXIPlsEnaZ36F1qjlKaaGGMwX80ySo+z4UyiJr4PW7RwPfRDS7ahMtYGHGzE4WsTdiep1N1kOCLUQeuhAx10YcoF2k= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3CBCD3858D37 Date: Mon, 25 Nov 2024 21:23:45 +0900 To: cygwin AT cygwin DOT com Subject: Re: SIGKILL may no longer work after many SIGCONT/SIGSTOP signals Message-Id: <20241125212345.4effa99060e84754658a49f4@nifty.ne.jp> In-Reply-To: <20241124011509.e30f0a5fa2ef86b240f260bf@nifty.ne.jp> References: <20241119182152 DOT c2195f50ed7091fbed644606 AT nifty DOT ne DOT jp> <20241120224308 DOT 000a18e48c0b8926e82e5147 AT nifty DOT ne DOT jp> <20241123205307 DOT 80e08e9669cd3e1ee72043a1 AT nifty DOT ne DOT jp> <7f00d1e4-736f-5f95-8bab-33a302487cdb AT t-online DOT de> <20241124011509 DOT e30f0a5fa2ef86b240f260bf AT nifty DOT ne DOT jp> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.30; i686-pc-mingw32) Mime-Version: 1.0 X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Takashi Yano via Cygwin Reply-To: Takashi Yano Content-Type: text/plain; charset="utf-8" Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 4APCPOT52504243 On Sun, 24 Nov 2024 01:15:09 +0900 Takashi Yano wrote: > On Sat, 23 Nov 2024 16:53:21 +0100 > Christian Franke wrote: > > Takashi Yano via Cygwin wrote: > > > On Wed, 20 Nov 2024 22:43:08 +0900 > > > Takashi Yano wrote: > > >> On Tue, 19 Nov 2024 18:21:52 +0900 > > >> Takashi Yano wrote: > > >>> On Tue, 12 Nov 2024 10:53:58 +0100 > > >>> Christian Franke wrote: > > >>>> Found with 'stress-ng --cpu-sched' from current stress-ng upstream HEAD: > > >>>> > > >>>> Testcase (attached): > > >>>> > > >>>> $ gcc -O2 -o manysignals manysignals.c > > >>>> > > >>>> $ ./manysignals > > >>>> fork() = 1833 > > >>>> ... > > >>>> fork() = 1848 > > >>>> ... > > >>>> kill(1833, 17) > > >>>> ... > > >>>> kill(1848, 17) > > >>>> kill(1833, 9) > > >>>> ... > > >>>> kill(1848, 9) > > >>>> waitpid(1833, ., 0) > > >>>> > > >>>> > > >>>> Run this in second terminal: > > >>>> > > >>>> $ watch "ps | sed -n '1p;/manysignals/{/sed/d;p}'" > > >>>> > > >>>> If 'S' appear in the first column, the child processes likely reached > > >>>> the final SIGSTOP state. This takes some time. The parent process may > > >>>> still hang in first waitpid() but should not. > > >>>> > > >>>> If the parent process is aborted with ^C, child processes may be stopped > > >>>> or left behind. Occasionally a child process that can not be stopped by > > >>>> Cygwin (kill -9) is left behind. > > >>>> > > >>>> Tested with ancient (i7-2600K) and more recent (i7-14700K) CPU :-) > > >>>> > > >>>> > > >>>> Unrelated to the above, but related to 'stress-ng --cpu-sched' which > > >>>> uses sched_get/setscheduler(): > > >>>> > > >>>> - sched_getscheduler() always returns SCHED_FIFO. As far as I understand > > >>>> Linux sched(7), this is a non-preemptive real-time policy. The > > >>>> preemptive SCHED_RR would possibly a more reasonable value. > > >>>> Unfortunately SCHED_OTHER cannot be used because it would require to > > >>>> ignore the priority. > > >>>> > > >>>> - sched_setscheduler() always fails with ENOSYS. It IMO should allow to > > >>>> set 'param->sched_priority' if 'policy' is equal to the value returned > > >>>> by sched_getscheduler(). > > >>> Thanks for the report and the test case. I'm now looking into > > >>> the issue. Please wait a while. > > >> Hopefully, I have found the cause. > > >> > > >> The deadlock happens between main thread and wait_sig thread. > > >> The main thread is waiting for the wait_sig thread triggering > > >> wakeup event while the wait_sig thread is waiting previous > > >> signal being processed by main thread. > > >> > > >> Let me consider how to fix that. > > > I'd like to report my progress for this issue. > > > > > > The patch attached almost solves the problem. ... > > > > Compile error if applied to current git main (3dbc8c3): > > > >  ../../../../winsup/cygwin/exceptions.cc:1487:21: error: ‘struct > > _cygtls’ has no member named ‘sig’ > >   1487 |   while (_main_tls->sig) > >        |                     ^~~ > > This is because the latest Corinna's commit changes the name 'sig' > to 'current_sig'. > > commit 3dbc8c3fbdc99d3f0f68fab8ba2a814ecdc27e17 > Cygwin: cygtls: rename sig to current_sig > > > > However, your test > > > case is paused for tens of seconds, then ends normally. > > > > I guess this is as expected. The processing of the > > SIGSTOP/SIGCONT/.../SIGSTOP/SIGKILL sequence of each child process take > > some time because all are locked to a single core. > > I feel it's too slow even if 16 processes (with wait_sig threads) are > executed in one CPU core. > > > > If the code: > > > cpu_set_t cpus; CPU_ZERO(&cpus); > > > CPU_SET(0, &cpus); > > > if (sched_setaffinity(getpid(), sizeof(cpus), &cpus)) > > > perror("setaffinity"); > > > > > > for (;;) > > > sched_yield(); > > > is changed to just: > > > for (;;) sleep(1); > > > the test case runs without pause. > > > > The pause will possibly reappear if the number of child processes is > > increased to some multiple of the available cores. > > I tested with np = 16*32 without sched_setaffinity() call, the pause > does not happen. My CPU is Threadripper 1950X 16-core 32-thread. > > > > I think there still is a bug in the signal handling. I have just submitted 6 patches for this issue. With these pathces, the problem reported no longer occurs in my environment. -- Takashi Yano -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple