X-Recipient: archive-cygwin AT delorie DOT com DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:mime-version:from:date:message-id:subject:to :cc:content-type; q=dns; s=default; b=bQxRqeYyas9vUKg1yEcVSVhePR jRXqeBWdjnvgtanX33ehuH9SV7crF6Uc3h98GyAGMVJIcxg2jq72YY4fDD8TBBkJ 5ZDlVW+1uUFROOC1JkwZxNdaeyXNw5SZkWMTOEwTuXAUqZz710YftC8sKAAZiWlW ZXkgp9XVrLhsl5orQ= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:mime-version:from:date:message-id:subject:to :cc:content-type; s=default; bh=yB9lfVutlXetL4OAmLB8BPy7Krc=; b= RGt6GTnQjUD/tIdzeJIPdcN42QmYUSNVJoQmqHw8t+9cKOzBp6kqzMHEXbuUXPIY HMD6La+T8ciVbBAfX3l3bOkI7ef0lww7iUJvO9/PYICof4VV8kPf4BYUtWlmfmmN zkxm9ZGPU4bSit8xWikXUJyw43IM1f1vjFu+Y7ndvEE= Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_PASS,SPF_PASS,TIME_LIMIT_EXCEEDED autolearn=unavailable version=3.3.2 spammy=dan, forbidden, processes, evidence X-HELO: fe3.lbl.gov X-Ironport-SBRS: 2.7 MIME-Version: 1.0 From: Dan Bonachea Date: Sun, 20 Jan 2019 15:33:03 -0500 Message-ID: Subject: Bug: Incorrect signal behavior in multi-threaded processes To: cygwin AT cygwin DOT com Cc: gasnet-devel AT lbl DOT gov, Dan Bonachea Content-Type: text/plain; charset="UTF-8" I'm writing to report some POSIX compliance problems with Cygwin signal handling in the presence of multiple pthreads that our group has encountered in our parallel scientific computing codes. A minimal test program is copied below and also available here: https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589 I believe the test program is fully compliant with ISO C 99 and POSIX 1003.1-2016. In a nutshell, it registers one signal handler, spawns a number of pthreads, and then synchronously generates a signal from exactly one thread while others sit in a pthread_barrier_wait. The "throwing" thread and signal number can be varied from the command line, and diagnostic output indicates what happened. As a basis for comparison, here are a few examples of the test program running on x86_64/Linux-3.10.0(Scientific Linux 7.4)/gcc-4.8.5 demonstrating what I believe to be the *correct*/POSIX-required behavior: $ ./thread-signal 1 11 # "th#1 sends sig 11 (SIGSEGV) via null deref" Running test with 5 threads and thread 1 sending signal=11 Spawning pthreads.. thread 1 (0x7f8dd0b13700): Hello thread 4 (0x7f8dcf310700): Hello thread 2 (0x7f8dd0312700): Hello thread 3 (0x7f8dcfb11700): Hello thread 0 (0x7f8dd131a740): Hello thread 1 (0x7f8dd0b13700): sending signal 11.. sig_handler: ENTERING sig_handler: running on thread 0x7f8dd0b13700 sig_handler: calling _exit() $ ./thread-signal 1 6 # "th#1 sends sig 6 (SIGABRT) via abort()" Running test with 5 threads and thread 1 sending signal=6 Spawning pthreads.. thread 1 (0x7f1a2451d700): Hello thread 2 (0x7f1a23d1c700): Hello thread 0 (0x7f1a24d24740): Hello thread 3 (0x7f1a2351b700): Hello thread 4 (0x7f1a22d1a700): Hello thread 1 (0x7f1a2451d700): sending signal 6.. sig_handler: ENTERING sig_handler: running on thread 0x7f1a2451d700 sig_handler: calling _exit() $ ./thread-signal 1 2 # "th#1 sends sig 2 via raise(SIGINT)" Running test with 5 threads and thread 1 sending signal=2 Spawning pthreads.. thread 1 (0x7f2a29a3f700): Hello thread 2 (0x7f2a2923e700): Hello thread 0 (0x7f2a2a246740): Hello thread 3 (0x7f2a28a3d700): Hello thread 4 (0x7f2a2823c700): Hello thread 1 (0x7f2a29a3f700): sending signal 2.. sig_handler: ENTERING sig_handler: running on thread 0x7f2a29a3f700 sig_handler: calling _exit() This output indicates that in all cases on Linux, the unique thread generating the signal jumps to the pre-registered signal handler while other threads remain stalled at the barrier, as required by POSIX signalling semantics (e.g. see raise() on p.1765 of POSIX 1003.1-2016). The test program and commands above demonstrate the substantially same, correct behavior on ALL of the following platform combinations: * Linux-3.10/{i686,x86_64}/{gcc-4.8.5,gcc-8.2.0,clang-7.0.0} * Solaris-11.3/x86_64/{gcc-7.2.0,SunStudio-12.5} * FreeBSD-12.0/x86_64/clang-6.0.1 * MicrosoftWSL-Ubuntu18.04/x86_64/{gcc-7.3.0,clang-6.0.0) - This notably runs on Microsoft Windows! (10.0.17763.288) Unfortunately the observed behavior on Cygwin (various versions) deviates far from our expectations and (based on my understanding) from the behavior required by current POSIX specs. Here is example output from Cygwin 2.11.1(0.329/5/3) 2018-09-05 on Windows 10, build 17763.288 with gcc 7.3.0: $ ./thread-signal 1 11 # "th#1 sends sig 11 (SIGSEGV) via null deref" Running test with 5 threads and thread 1 sending signal=11 Spawning pthreads.. thread 1 (0x600048770): Hello thread 2 (0x600048870): Hello thread 3 (0x600048970): Hello thread 0 (0x600000010): Hello thread 4 (0x600048a70): Hello thread 1 (0x600048770): sending signal 11.. $ ./thread-signal 1 6 # "th#1 sends sig 6 (SIGABRT) via abort()" Running test with 5 threads and thread 1 sending signal=6 Spawning pthreads.. thread 1 (0x600048770): Hello thread 2 (0x600048870): Hello thread 3 (0x600048970): Hello thread 4 (0x600048a70): Hello thread 0 (0x600000010): Hello thread 1 (0x600048770): sending signal 6.. sig_handler: ENTERING Abort $ ./thread-signal 1 2 # "th#1 sends sig 2 via raise(SIGINT)" Running test with 5 threads and thread 1 sending signal=2 Spawning pthreads.. thread 1 (0x600048770): Hello thread 2 (0x600048870): Hello thread 3 (0x600048970): Hello thread 0 (0x600000010): Hello thread 4 (0x600048a70): Hello thread 1 (0x600048770): sending signal 2.. sig_handler: ENTERING sig_handler: ERROR - signal delivered to wrong thread! thread 1 (0x600048770): ERROR: STILL ALIVE! sig_handler: running on thread 0x600000010 sig_handler: calling _exit() The second case in particular (abort() called by one non-primordial thread) appears to have non-deterministic/racing behavior. The evidence seems to indicate the SIGABRT is delivered to the primordial thread (the wrong thread) via the signal handler and concurrently also delivered to the SIG_DFL handler of other threads who then race to invoke abortive process termination (which should not be reachable in any correct execution of the program). It's worth noting POSIX 1003.1-2016 sec XRAT.B.2.4.1 (p.3577) specifically requires that any given signal should be delivered to exactly one thread. Also the spec for abort (p.565) requires the signal to be delivered as if by `raise(SIGABRT)` (p.1765) aka. `pthread_kill(pthread_self(),SIGABRT)` (p.1657), which implies any registered SIGABRT handler should run only on the thread which called abort(). The choice of SIGINT in the third example is arbitrary, and representative of similar deliver-to-wrong-thread behavior also observed on Cygwin for all of the following signals: HUP, INT, QUIT, ILL, EMT, TRAP, FPE, BUS, SYS, PIPE, ALRM, TERM, URG, TSTP, CONT, CHLD, TTIN, TTOU, IO, USR1, USR2, and RTMIN..RTMAX All of which consequently appear to be unreliable for thread-specific signalling in Cygwin programs. Note that in all cases examined, generating the signal from the "primordial" thread 0 (by changing the 1 to a 0 in the commands above) yields nominally correct behavior; in that case, the signal handler is correctly invoked by the primordial thread and the others remain undisturbed. However it appears the primordial thread is the ONLY thread that enjoys the special status of POSIX-compliant signal behavior on Cygwin. Substantially similar broken behavior has been observed for NON-primordial threads on ALL of the following Cygwin version combinations (spread across three different workstations): * Cygwin64-2.11.1(0.329/5/3)-{win7,win10}-{gcc-7.3.0,clang-5.0.1} * Cygwin64-2.10.0(0.325/5/3)-{win7,win10}-{gcc-6.4.0,clang-5.0.1} * Cygwin64-2.6.0(0.304/5/3)-win7-{gcc-5.4.0,clang-3.8.1} * Cygwin64-2.6.0(0.304/5/3)-win7-{gcc-5.4.0,clang-3.8.1} Possibly of note, a 32-bit version of Cygwin (i686 2.11.1(0.329/5/3)) correctly handles SIGSEGV, but fails all the other cases in substantially the same manner as Cygwin64. In case you're wondering why we care: The SIGABRT and SIGSEGV misbehaviors are particularly problematic for our distributed-memory codes that register fatal signal handlers to ensure correct tear-down of a multi-process job if/when any process crashes or aborts (e.g. due to an assertion failure). Cygwin unfortunately makes it effectively impossible to reliably handle abort()'s or SIGSEGV's generated by programming errors in a multi-threaded program, unless one can arrange to only generate the signal from the primordial thread (impractical for our applications). Searching around the Cygwin lists I find some evidence that tangentially similar problems with signals and multithreading have been discussed before, but perhaps not adequately isolated/demonstrated. Is there any hope of this situation ever improving? Thanks for your consideration. -Dan Bonachea Test program code below, also available for download at: https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=589 ===================================================================== // Thread/signal tester by Dan Bonachea // compile with a command like: // gcc -D_GNU_SOURCE -std=c99 -pedantic -pthread thread-signal.c -o thread-signal // usage: // thread-signal // // page numbers in comments below refer to POSIX IEEE Std 1003.1-2016 #include #include #include #include #include #include #include #include // Utilities typedef void (*sig_handler_t)(int); // signal handler function pointer unsigned long long thidtollu(pthread_t thid) { // map pthread_t to a unique value // non-portable but sufficient on all systems of interest return (unsigned long long)(uintptr_t)thid; } pthread_barrier_t barrier_object; void barrier(void) { int res = pthread_barrier_wait(&barrier_object); // p.1595 assert(res == 0 || res == PTHREAD_BARRIER_SERIAL_THREAD); } #define FD_STDOUT 1 #define FD_STDERR 2 void writeout(const char *msg) { // signal-safe string output and flush int sz = strlen(msg)+1; int res = write(FD_STDOUT, msg, sz); if (res != sz) { const char err[] = "write failed!\n"; write(FD_STDERR, err, sizeof(err)); _exit(-1); } (void)fsync(FD_STDOUT); } #ifndef NUMTHREAD #define NUMTHREAD 5 #endif // state variables int sigid = SIGSEGV; int sender = 1; volatile sig_atomic_t sender_aid = 0; volatile sig_atomic_t errs = 0; // registered signal handler function void sig_handler(int signum) { // p.494 defines permitted calls pthread_t thid = pthread_self(); writeout("sig_handler: ENTERING\n"); sig_atomic_t my_aid = (sig_atomic_t)thidtollu(thid); if (my_aid != sender_aid) { errs++; writeout("sig_handler: ERROR - signal delivered to wrong thread!\n"); } #if !STRICT // sprintf technically forbidden, but doesn't affect behavior in practice { char tmp[200]; sprintf(tmp,"sig_handler: running on thread 0x%llx\n",thidtollu(thid)); writeout(tmp); } #endif writeout("sig_handler: calling _exit()\n"); _exit(errs); } struct thinfo { pthread_t thid; int idx; } thread_info[NUMTHREAD]; // thread entry point void * thread_main(void *arg) { struct thinfo *myinfo = arg; pthread_t thid = pthread_self(); assert(pthread_equal(thid, myinfo->thid)); printf("thread %i (0x%llx): Hello\n",myinfo->idx, thidtollu(thid)); fflush(NULL); if (myinfo->idx == sender) { // this thread will send the signal sender_aid = (sig_atomic_t)thidtollu(thid); // record for signal handler } barrier(); // wait for all threads if (myinfo->idx == sender) { // this thread sends the signal printf("thread %i (0x%llx): sending signal %i..\n", myinfo->idx, thidtollu(thid), sigid); fflush(NULL); switch (sigid) { case SIGABRT: abort(); // p.565 break; case SIGSEGV: { int *nullpt = NULL; *nullpt = 0; // SEGV } break; default: { int res = raise(sigid); // p.1765 if (res) { errs++; printf("thread %i (0x%llx): ERROR: raise failed: %i %s\n", myinfo->idx, thidtollu(thid), res, strerror(res)); fflush(NULL); } } } errs++; printf("thread %i (0x%llx): ERROR: STILL ALIVE!\n",myinfo->idx, thidtollu(thid)); fflush(NULL); } barrier(); // wait for all threads return NULL; } // process entry point int main(int argc, char **argv) { if (argc > 1) sender = atoi(argv[1]); if (argc > 2) sigid = atoi(argv[2]); printf("Running test with %i threads and thread %i sending signal=%i\n", NUMTHREAD,sender,sigid); fflush(NULL); int ret = pthread_barrier_init(&barrier_object, NULL, NUMTHREAD); // p.1593 assert(!ret); // establish a signal handler sig_handler_t init = signal(sigid, sig_handler); // p.1971 assert(init == SIG_DFL || init == SIG_IGN); // ensure it is registered sig_handler_t res = signal(sigid, sig_handler); assert(res == sig_handler); printf("Spawning pthreads..\n"); fflush(NULL); for (int i=1; i < NUMTHREAD; i++) { // create threads thread_info[i].idx = i; int res = pthread_create(&(thread_info[i].thid), NULL, thread_main, &(thread_info[i])); // p.1633 assert(!res); } // primordial thread is "thread 0" thread_info[0].idx = 0; thread_info[0].thid = pthread_self(); thread_main(&(thread_info[0])); // should never reach this point for a catchable signal for (int i=1; i < NUMTHREAD; i++) { // join threads int res = pthread_join(thread_info[i].thid, NULL); // p.1649 assert(!res); } printf("all threads exited!\n"); errs++; return errs; } -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple