Date: Fri, 06 Apr 2001 11:53:35 +0200 From: "Eli Zaretskii" Sender: halo1 AT zahav DOT net DOT il To: "Nimrod A. Abing" Message-Id: <2593-Fri06Apr2001115334+0300-eliz@is.elta.co.il> X-Mailer: Emacs 20.6 (via feedmail 8.3.emacs20_6 I) and Blat ver 1.8.6 CC: djgpp-workers AT delorie DOT com, Charles Sandmann In-reply-to: <3.0.1.32.20010406113549.006c38e0@wingate> (n_abing AT ns DOT roxas-online DOT net DOT ph) Subject: Re: That crash message from the core dumper. References: <3 DOT 0 DOT 1 DOT 32 DOT 20010406113549 DOT 006c38e0 AT wingate> Reply-To: djgpp-workers AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: djgpp-workers AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk > Date: Fri, 06 Apr 2001 11:35:49 +0800 > From: "Nimrod A. Abing" > > Eli, you wanted a disassembly of __dj_movedata+33: > > [--cut here--] > (gdb) disas __dj_movedata+33 > Dump of assembler code for function big_move: > 0x8eba : mov %cl,%al > 0x8ebc : shr $0x2,%ecx > 0x8ebf : and $0x3,%al > 0x8ec1 : repz movsl %ds:(%esi),%es:(%edi) > 0x8ec3 : mov %al,%cl > End of assembler dump. > (gdb) > [--cut here--] > > This is disas from the test program sigabrt.exe. The results are what you > expected I believe. Yes: it crashes on this instruction: 0x8ec1 : repz movsl %ds:(%esi),%es:(%edi) So the question remains: what is the problem with the value of ESI and maybe also EDI that causes the crash? See below. > As for the AV software causing the crash, it is very > possible and the only probable cause. When I disable real-time scanning, > everything works fine, core dumps go without any errors. I don't argue with facts; I agree that the AV somehow causes the crashes. What I don't understand is HOW does it cause the crashes, and the key to that is to undrestand why exactly does the crash happen. In any case, your initial hypothesis that the segment loaded into ES is somehow corrupted is not true: inside __dj_movedata, ES is loaded with the _dos_ds selector, so ES's value printed in the crash message is perfectly normal. The reason I'm trying so hard to uinderstand this crash is that I'm not sure it is limited to AV software. It's possible that there's a real bug somewhere in the core dumper, which will show in other circumstances as well. I think we will not be able to dismiss this case until we gain some insight into why does it happen. Right now, I'm clueless, and I don't like that ;-). > Division by Zero at eip=0000157eExiting due to signal SIGSEGV > An error occured while writing core file. (signal: 14, progress number: 11) > Page fault at eip=00008e91, error=0004 > eax=00000000 ebx=00004000 ecx=00001000 edx=0000f620 esi=00030000 edi=0000f620 > ebp=002f5cf4 esp=002f5ce4 program=C:\PROJECTS\PMDB\COREDUMP\SIGFPE1.EXE > cs: sel=00f7 base=830bf000 limit=002f5fff > ds: sel=00ff base=830bf000 limit=002f5fff > es: sel=010f base=00000000 limit=0010ffff > fs: sel=010f base=00000000 limit=0010ffff > gs: sel=010f base=00000000 limit=0010ffff > ss: sel=00ff base=830bf000 limit=002f5fff > App stack: [002f6000..00276000] Exceptn stack: [0000fd40..0000de00] > > Call frame traceback EIPs: > 0x00008e91 ___dj_movedata+33 Yes, this is the same crash. Since this is a Page Fault, and the error code is 4, the primary suspect is the value of ESI, which points to the address where the data is read from. Can you please see if the value shown above, 0x30000, is valid given the address and the size of the chunk of memory the code is trying to dump at this stage? As for the value of EDI, it should be compared with the value of _go32_info_block.linear_address_of_transfer_buffer. Can you post the address of the transfer buffer in that specific program when AV is enabled. > As for this line in the crash message: > > ``An error occured while writing core file. (signal: 14, progress number: > 11)'' > > This was part of GF's original code and I decided to keep it while the core > dumper is still in testing stage. So if it says ``progress number: 11'', > egrep -n "progress = 11" will tell me where to start looking. As for the > ``signal: 14'' this is an exception number for SIGSEGV, maybe I should > rewrite it to say ``exception'' or ``DJGPP signal'' I'd say "exception" is more accurate. > Eli, it gets weirder all the time. When I gdb (gdb 4.18 and 5.0) the test > program (with AV software running in the background), the SIGSEGV does > *not* happen. This is unqualified weirdness if you ask me. One more reason to dig deeper into this, I'd say. > About this AV thing, I guess it's caused by the real-time scanner when it > tries to read and examine the instructions used by the program. That's possible, but it still doesn't explain why does the program crash. > Maybe it tries to move the chunk of memory (stupidly) to another > location to examine it. But then again, we would never know because > we don't have the source code for the AV scanner, eh? The key to this problem is that the DJGPP program crashes. So something inside _our_ code causes the crash. We need to try to understand what that something is. Charles, is it possible for another program, such as an antivirus, cause a Page Fault by smething its code does, but have Windows abort our program instead? In other words, what could be a reason for a program to get a Page Fault if the instruction is a perfectly valid one and all the registers hold valid values? Nimrod, one thing to try is to set up a signal handler for SIGSEGV, around the code which dumps one chunk of memory, and have that SIGSEGV handler longjmp to restart the dumping of that same chunk of memory. This could help if the problem is not permanent and does not originate in the core dumper's own code.