Date: Sun, 2 Aug 1998 15:53:18 +0300 (IDT) From: Eli Zaretskii To: "Gurunandan R. Bhat" cc: djgpp AT delorie DOT com Subject: Re: fsdb crashes post emacs In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Precedence: bulk On Wed, 29 Jul 1998, Gurunandan R. Bhat wrote: > I have a problem in running fsdb immediately after exiting from > emacs (19.34 latest binaries from simtel) and *only* after exiting from > emacs. I have a plain DOS 6.22 machine (comspec=4dos.com). fsdb myprog.exe > crashes even before the fsdb screen comes up with the following traceback: > > ----------------------------------------------------------- > Cannot continue from exception, exiting due to signal 0123 > General Protection Fault at eip=00012360 > eax=0000007c ebx=00000004 ecx=fffffffc edx=06a2fcef esi=00000048 edi=00069270 > ebp=0006920c esp=00069204 program=D:\DJGPP\BIN\FSDB.EXE > cs: sel=00a7 base=10000000 limit=0009ffff > ds: sel=00af base=10000000 limit=0009ffff > es: sel=00af base=10000000 limit=0009ffff > fs: sel=00cf base=00016300 limit=000040ff > gs: sel=00bf base=00000000 limit=ffffffff > ss: sel=00af base=10000000 limit=0009ffff > > Call frame traceback EIPs: > 0x00012360 0x12360 > 0x00011e4d 0x11e4d > 0x00007569 0x7569 > 0x00008126 0x8126 > 0x0000a348 0xa348 > 0x0000186f 0x186f > 0x00011862 0x11862 This is a longish message, since I decided to use this crash as a case study to explain how to approach such problems. Here's the short summary: there seems to be either a bug in library function `malloc' or inside FSDB that causes these crashes. More debugging is required, which could only be done on the machine where these crashes happen, to find the actual cause; specific recommendations on how to continue debugging are at the end of this message. The rest of this message is a detailed explanation of how to debug such problems. Many people think this is some kind of black art which is unavailable to the uninitiated. I will try to demystify this by demonstrating that there's no black art here, it only takes a couple of tools, and a determined person who isn't afraid of mundane work. Hopefully, this will be useful to somebody in the future. First, honor is due to Gurunandan R. Bhat for posting a complete and accurate report about the crash. Using the posted info, it was much easier to investigate the problem. We begin with the crash dump. It's a bit unusual, so I will first explain the non-standard parts. > exiting due to signal 0123 0123 in hex is 291 decimal. Looking inside the header , we will find this line: #define SIGSEGV 291 So this program apparently died because of SIGSEGV. > Cannot continue from exception, exiting due to signal 0123 This message comes from the DJGPP signal-handling code. It means that the program installed a handler for SIGSEGV, the handler was called when the disaster struck, and it (the handler) attempted to return. This isn't allowed (since returning to the locus of the exception will just trigger the same exception again), so DJGPP is aborting the program. This is also the reason for 0123 being printed instead of the usual SIGSEGV (the message is printed by a code that is different from the usual one: most programs don't install exception handlers which return). Why did the program install a handler for SIGSEGV? Because FSDB, the program that crashed, is a debugger. DJGPP debuggers typically want to catch exceptions in the debugged program, so they install handlers for these exceptions. This crash happens *after* FSDB already installed its handlers, but *before* it got opportunity to run the debuggee. So it ends up catching its own exception. At this point, I would usually turn to run the offending program and try to locate the bug. But in this case, I couldn't reproduce the problem on my machine. So we are left with using the posted information, and with its help, find the exact place and cause of the crash. The plan is to look at the machine code, find the instruction that triggered the crash, and then correlate it with the sources of the program and try to guess what might have been the reason(s) for this calamity. > General Protection Fault at eip=00012360 So we know that it crashed at EIP = 0x12360. But how do we peek at that address to display machine instructions nearby? A debugger such as GDB will help. Here: C:\> gdb c:/djgpp/bin/fsdb.exe [GDB's usual blurb omitted] (gdb) x/i 0x12360 0x12360: movl (%edx),%eax The GDB command `x' displays the contents of a memory address, and `/i' tells it to display it as an instruction. This instruction is a pointer dereference: a 32-bit value whose address is in EDX is loaded into EAX. An instruction as simple as that can only crash if EDX were zero, i.e. if it were a NULL pointer dereference, or if EDX is garbled. Now, what was that registers' dump saying? > edx=06a2fcef Aha! This is indeed garbled: it is way beyond the limit of the DS segment: > ds: sel=00af base=10000000 limit=0009ffff So somebody dereferenced a garbled pointer. But where in the source is that? That's where the other, symyfied traceback will help us (did I already say kudos to Gurunandan R. Bhat for providing all that?): > 0x00013748 _malloc+192 It seems that the crash is inside `malloc'. But since the EIP here is quite different from the original 0x00012360, let's make sure that the original crash was also inside `malloc' (otherwise we will be chasing a wild goose). The easiest way to do that is to disassemble a program (any program) compiled with -g. Observe: C:\> gcc -g -o hello hello.c C:\> gdb hello (gdb) breakpoint malloc Breakpoint 1 at 0x288d (gdb) disassemble 0x288d Dump of assembler code for function malloc: 0x2884 : pushl %ebp 0x2885 : movl %esp,%ebp .... End of assembler dump (gdb) As an aside: why did I use the `breakpoint' command? Because `disassemble' needs the address inside the function, and `breakpoint' prints the address of the function's first executable source line. And how did I knew that `malloc' will be present in even the trivial "Hello, world" program? Well, that's an insider's knowledge of the startup code: it calls `malloc' a few times, and so every DJGPP program has `malloc' linked in. But even if we needed to disassemble another function, we could just arrange for the test program to call it. Back to our whodunit. So `malloc' starts at 0x2884 in this program. Our crash point is at offset 192 (decimal) from the beginning. Let's look there: ..... 0x3484 : xorl %eax,%eax 0x3486 : jmp 0x349a 0x3488 : movl (%edx),%eax ..... Bullseye! Here's our friend "movl (%edx),%eax" again. For still more certainty, we'll look in the vicinity of this instruction inside FSDB. This is somewhat difficult without disassembled listing like the one above: the `disassemble' command doesn't work on stripped programs which lack debugging symbols, so we need to use the `x' command and compute the address for it, which requires to calculate the length of each instruction. But armed with the listing above, it's easy to see the size of every instruction: it is the difference of its address and the address of the next instruction. So we can now do this: C:\> gdb c:/djgpp/bin/fsdb.exe (gdb) x/i 0x12360 0x12360: movl (%edx),%eax Okay, already been there. Now let's display the previous instruction, whose address is 2 bytes before (see the disassembled `malloc' above): (gdb) x/i 0x1235e 0x1235e: jmp 0x12372 and the one before that: (gdb) x/i 0x1235c 0x1235c: xorl %eax,%eax These are exactly the instructions from the disassembled listing, give or take the address where it `jmp's, which is different because absolute addresses are always different in different programs. We can now be sure that the crash indeed happened inside `malloc'. (Actually, I looked at a few more instructions: better be safe than sorry.) On to the last part of the plan: we now want to find out where inside `malloc' did it crash, and why. This requires to correlate the source of `malloc' and its disassembled listing. One way to do that is to compile the source of `malloc' with the `-S' option and look at the produced assembly. However, a better way is to walk through the assembly listing looking at the source, and correlate them by ``decompiling'' the assembly in your head, since "gcc -S" might produce a different listing due to different optimization switches (to know what switches were used to compile the library, you will need to wade through the library build procedure, which isn't easy). Anyway, I will save you some trouble. Here's the offending code from the `malloc' source, with the crash spot marked: /* * If nothing in hash bucket right now, * request more memory from the system. */ if ((op = nextf[bucket]) == NULL) { morecore(bucket); if ((op = nextf[bucket]) == NULL) return (NULL); } /* remove from linked list */ nextf[bucket] = op->ov_next; <<<<<< it crashes here op->ov_magic = MAGIC; op->ov_index = bucket; The crashed instruction, "movl (%edx),%eax", is the translation of "op->ov_next". GCC translates it into a single pointer dereference because the `ov_next' member is the first member of the structure: union overhead { union overhead *ov_next; /* when free */ struct { unsigned char ovu_magic; /* magic number */ unsigned char ovu_index; /* bucket # */ } ovu; }; Since the offset of the first member is always zero, GCC avoids adding that offset after dereferencing the structure's address. The garbled pointer in EDX is therefore the variable `op', and its value is computed a few lines above: if ((op = nextf[bucket]) == NULL) { This line translates into the following 3 instructions: 0x3468 : movl 0xa140(,%ebx,4),%edx 0x346f : testl %edx,%edx 0x3471 : jne 0x3488 The first of these instructions uses the index in EBX (which apparently holds the value of `bucket') to access the array nextf[] whose address happens to be 0xa140. Maybe the index is itself garbled? Back to the registers' dump: > ebx=00000004 Nope, `buckets' has a reasonable value of 4. To be sure, let's look at FSDB sources and see whether 4 is okay for this call to `malloc'. Here's the relevant part of the stack dump: > 0x00013748 _malloc+192 > 0x00013229 _xmalloc+21 > 0x000080a5 _initdisplay+133, line 2147 of fullscr.c Line 2147 of fullscr.c says this: code_pane_pos = xmalloc ((toplines + 2) * sizeof (word32)); where `toplines' is computed like this (inside `init_screen', lines 308 and 321 of screen.c): rows = ScreenRows (); toplines = (rows / 2) + 4; Assuming the standard 25-line display, `toplines' is 16 and `xmalloc' will be called with the argument (16 + 2) * 4 = 72. I will save you this trouble, but I actually went through the `malloc' code to verify that, when called with an argument of 72, it will end up with `buckets' equal to 4. So there's nothing wrong here. There are only one alternative left: somebody garbles nextf[4]. It can be either some bug in `malloc' whereby a garbled value is put into nextf[4], and `malloc' tries later to treat that like a valid pointer; or FSDB itself garbles `malloc's data structure, either in one of the previous calls to `free' or `realloc', or in some other code which inadvertently runs awry. To find out which is the actual cause, we need access to a machine where this problem actually happens. We need to run FSDB either under a debugger, or with debugging print-out, and find out who and why clobbers the nextf[] array. Since debugging a program which is itself a debugger is tricky, I would suggest to add calls to `cprintf' inside `malloc' and friends, in the places where they access nextf[], and find out what code clobbers it. The reason for this bug must be pretty easy to find when we know that. And this concludes today's lesson. We haven't find the villain yet, but we've got pretty close, which is good enough for a debugging session which cannot even reproduce the problem.