Mailing-List: contact cygwin-developers-help AT cygwin DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-developers-owner AT cygwin DOT com Delivered-To: mailing list cygwin-developers AT cygwin DOT com Message-ID: <3D38949C.3090200@hekimian.com> Date: Fri, 19 Jul 2002 18:37:16 -0400 X-Sybari-Trust: f522ac42 b923d9bf 0879ee9b 00000109 From: Joe Buehler Reply-To: joseph DOT buehler AT spirentcom DOT com Organization: Spirent Communications User-Agent: Mozilla/5.0 (Windows; U; WinNT4.0; en-US; rv:1.0.0) Gecko/20020530 X-Accept-Language: en-us, en MIME-Version: 1.0 To: cygwin-developers AT cygwin DOT com Subject: Re: cygwin hang problem References: <3D32FC00 DOT 5090108 AT hekimian DOT com> <20020719050925 DOT GA24259 AT redhat DOT com> <3D37F0E5 DOT 50F3669B AT yahoo DOT com> <20020719141242 DOT GB27697 AT redhat DOT com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit I have "hang" problems, and I have core dumps. It's very inconsistent -- I have a dual-processor NT machine that has been running continuous builds for about 3 days without stopping. I have a single-processor XP machine that ran for less than a day and hung with a core dump. I spent the last few hours examining that core dump (generated using dumper.exe) and it appears that the cygwin dll jumped to never-never land after calling CopySid() in cygsid::assign() in security.h. Here's the trace from gdb for the thread that caused the dump (I modified handle_exceptions to wait for dumper to do the dump): #0 0x77e72e9f in _libkernel32_a_iname () #1 0x61013ef5 in try_to_debug (waitloop=true) at /usr/local/cygwin-src/src/winsup/cygwin/exceptions.cc:396 #2 0x6101453e in handle_exceptions (e=0x72f6f8, in=0x72f714) at /usr/local/cygwin-src/src/winsup/cygwin/exceptions.cc:537 #3 0x77f833a0 in _libkernel32_a_iname () #4 0x77f83372 in _libkernel32_a_iname () #5 0x77f510a6 in _libkernel32_a_iname () #6 0x610d8b8c in cygsid::operator= (this=0x72fa5c, nsid=0x61610294) at /usr/local/cygwin-src/src/winsup/cygwin/security.h:47 #7 0x61070c37 in __sec_user (sa_buf=0x72fae8, sid2=0x0, inherit=0) at /usr/local/cygwin-src/src/winsup/cygwin/sec_helper.cc:473 #8 0x610db98c in sec_user_nih (sa_buf=0x72fae8 "", sid=0x0) at /usr/local/cygwin-src/src/winsup/cygwin/security.h:214 #9 0x61082da5 in getsem (p=0x0, str=0x610eb248 "cygwin1S3-2002-07-11 10:28.sigcatch.23002002-07-11 10:28", init=0, max=2147483647) at /usr/local/cygwin-src/src/winsup/cygwin/sigproc.cc:948 #10 0x6108352b in wait_sig () at /usr/local/cygwin-src/src/winsup/cygwin/sigproc.cc:1091 #11 0x61007961 in thread_stub (arg=0x610e23a0) at /usr/local/cygwin-src/src/winsup/cygwin/debug.cc:98 #12 0x77e802ed in _libkernel32_a_iname () Note that the "str" argument in frame 9 is not correct... Here is a trace for the main thread: #0 0x7ffe0304 in ?? () #1 0x77e79d6a in _libkernel32_a_iname () #2 0x610a000f in wait4 (intpid=-1, status=0x22cdbc, options=2, r=0x0) at /usr/local/cygwin-src/src/winsup/cygwin/wait.cc:86 #3 0x6109fd3d in waitpid (intpid=-1, status=0x22cdbc, options=2) at /usr/local/cygwin-src/src/winsup/cygwin/wait.cc:32 #4 0x00419148 in job_waitsafe (sig=0) at /usr/local/ast-src/src/cmd/ksh93/sh/jobs.c:201 #5 0x0041abcb in job_wait (pid=3568) at /usr/local/ast-src/src/cmd/ksh93/sh/jobs.c:1215 #6 0x00413c7d in sh_exec (t=0xa05e4d8, flags=0) at /usr/local/ast-src/src/cmd/ksh93/sh/xec.c:709 #7 0x004144f7 in sh_exec (t=0xa05e528, flags=0) at /usr/local/ast-src/src/cmd/ksh93/sh/xec.c:952 #8 0x004144db in sh_exec (t=0xa05e590, flags=0) at /usr/local/ast-src/src/cmd/ksh93/sh/xec.c:951 #9 0x00414f8e in sh_exec (t=0xa05e3e8, flags=4) at /usr/local/ast-src/src/cmd/ksh93/sh/xec.c:1193 #10 0x0041447f in sh_exec (t=0xa05e910, flags=4) at /usr/local/ast-src/src/cmd/ksh93/sh/xec.c:940 #11 0x00414faa in sh_exec (t=0xa05e298, flags=4) at /usr/local/ast-src/src/cmd/ksh93/sh/xec.c:1194 #12 0x0041447f in sh_exec (t=0xa05e998, flags=5) at /usr/local/ast-src/src/cmd/ksh93/sh/xec.c:940 #13 0x004140cc in sh_exec (t=0xa05ee70, flags=5) at /usr/local/ast-src/src/cmd/ksh93/sh/xec.c:833 #14 0x00413ef0 in sh_exec (t=0xa05ee80, flags=4) at /usr/local/ast-src/src/cmd/ksh93/sh/xec.c:779 #15 0x00401f7f in exfile (iop=0xa053950, fno=6) at /usr/local/ast-src/src/cmd/ksh93/sh/main.c:520 #16 0x00401810 in sh_main (ac=2, av=0x616107ac, userinit=0) at /usr/local/ast-src/src/cmd/ksh93/sh/main.c:318 #17 0x0040106a in main (argc=2, argv=0x616107ac) at /usr/local/ast-src/src/cmd/ksh93/sh/pmain.c:33 #18 0x610065b0 in dll_crt0_1 () at /usr/local/cygwin-src/src/winsup/cygwin/dcrt0.cc:774 #19 0x61006a59 in _dll_crt0 () at /usr/local/cygwin-src/src/winsup/cygwin/dcrt0.cc:872 #20 0x61006ab1 in dll_crt0 (uptr=0x0) at /usr/local/cygwin-src/src/winsup/cygwin/dcrt0.cc:885 #21 0x0045c44e in cygwin_crt0 () #22 0x0040103c in mainCRTStartup () #23 0x77e7eb69 in _libkernel32_a_iname () It is difficult to tell exactly what happened -- it looks like the CopySid call did not return -- based on the stack it looks like something may have gone wrong with the DLL linkage code that loads advapi32 and calls the real CopySid. It did not get to the point where it overwrites the original mov, call instruction sequence in the DLL linkage code. One interesting point I haven't figured out yet is that the exception address passed to Cygwin's exception handler is almost exactly 2x (as in left shift 1) the address of the _win32_CopySid AT 12 code. I checked the IA32 exception handler stack format and it looks like the exception that NT got was due to a jump to the weeds -- the EIP pushed on the stack is the same as the exception address passed to the Cygwin exception handler. I notice a comment in the source about replacing CopySid with memcpy. Does anyone remember why this was done? Is there something flaky about CopySid? Something else I wonder about -- wait_sig() is still setting up, and the main thread is in waitpid() -- perhaps a signal came in while the signal handler is still setting up? I haven't looked at that stuff and don't know how it works. Sorry if I am missing anything obvious -- I am learning Cygwin internals as I go, and this is a very knotty problem. Enough for now, time to go home... Joe Buehler