From: Michiel de Bondt Newsgroups: comp.os.msdos.djgpp Subject: Re: how to use inline push and pop Date: Fri, 11 May 2001 16:35:23 +0200 Organization: University of Nijmegen Lines: 210 Message-ID: <3AFBF8AB.C42331EB@sci.kun.nl> References: NNTP-Posting-Host: fanth.sci.kun.nl Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: wnnews.sci.kun.nl 989591723 3280 131.174.132.54 (11 May 2001 14:35:23 GMT) X-Complaints-To: usenet AT sci DOT kun DOT nl NNTP-Posting-Date: Fri, 11 May 2001 14:35:23 +0000 (UTC) X-Mailer: Mozilla 4.75 [en] (X11; U; SunOS 5.7 sun4u) X-Accept-Language: en To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp Reply-To: djgpp AT delorie DOT com Eli Zaretskii wrote: > On 10 May 2001, Hans-Bernhard Broeker wrote: > > > You really should rethink your strategy all the way since you started > > at square one. What you're doing here, actually, is to avoid the usual > > overhead involved in recursive function calling --- but you're doing > > it the wrong way. To avoid having more than absolutely necessary > > pushed to the stack in recursive calls, use the following recipe, > > instead: > > > > 1) Remove as many as possible of the arguments of the recursive function > This is really impossible. > > 2) Remove as many as possible of the recursive function's local variables > > in scope at the time of the resursion call(s). Block-scope variables > > of blocks without a recursion call in them can stay as they are. > > 3) Replace them by a manually implemented stack of 'status structs', where > > the status struct holds *exactly* those variables that have to be > > preserved across a recursion call. The six most-frequently used vars? > > 4) Compile the whole stuff with -save-temps -fverbose-asm -O6 and see if > > you can really beat GCC in assembly hacking :-) I always compile -O9, but the compiler does -O3 in that case. I shall read in the manual what -fverbose-asm does. > > And I would like to point out, again, that all this complicated procedure > and the additional complexity it imposes on the code (meaning more bugs > and more slow-down) might be _totally_ unnecessary! > > There's a popular belief that recursive code is terribly slow, but > experience shows that this is mostly a myth. Recursive code _might_ be > slow, but in many cases it isn't. Because recursive code is usually > smaller, it fits better into the CPU caches. It is also simpler, so you > have less probability for bugs, and it lends itself better to compiler > optimizations. > I have once seen the opposite: faster recursive code.. Further, you can say that tail recursion is the only recursion that can _really_ be removed: for other recursion, some stack is needed: the computers stack or your own stack. > > In other words, I _strongly_ recommend to write a recursive implementation > in a high-level language such as C or C++, compile it with optimizations, > and _profile_ it, before you decide to go to such lengths. There are good > chances that the code will run fast enough to make any of this completely > unnecessary. I removed the calls to cprintf with alarm signals. As a consequence, the compilers code is only about 4 percent slower than mine, probably since the compiler optimized the relatively rare calls to cprintf. So you are right: it is fast enough. On the other hand, the procedures bottleneck is the memory (!= L2-cache), and it is only the L1-cache that is used many times too often by the compilers code. Half of my memory is very old (fast page), the other half is sd-ram. I have a pentium MMX. On another computer, the difference might be larger. What do you mean with profile? Fine-tuning the code within the language itself? Or within legal asm? I discovered that the base pointer can be used as well, with -fno-frame-pointer. This makes an extra register available and my code can be speeded up in another way. It is only hard to get the desired var in the base pointer, since "register long something asm ("%ebp")" is always allowed, but results in incorrect code even if the frame-pointer removal option is selected (and the frame-pointer is indeed pushed and popped). I reserved %ebx and %esi, and used a, b, c, d, S and D in the constraints of my intel asm macros instead of r. I started using many intel inline asms when I discovered that my C instructions were not translated to the one-liners I had in mind. The code gcc generates looks terrible. See e.g. the following examples: C-code: T.Byte += dd[2] (union {long Long; unsigned char Byte; } T;) gcc-output: movb %cl, %al addb _dd+2, %al movb %al, %cl one-liner: addb _dd+2, %cl C-code: C = (long) SC.first_field[R] gcc-output: sall $2, %ebp movl _SC(%ebp), %esi one-liner: movl _SC+0(,%ebp,4) %esi C-code: M = ((STB *) M)->first_field[P] gcc-output: leal (%edi, %edi), %edx movw (%edx, %ebx), %bx andl $65535, %ebx one-liner movw 0(%ebx, %edx, 2), %bx andl $65535, %ebx C-code: CA[C + M] = L gcc-output: leal (%ebx, %esi), %eax movb %cl, _CA(%eax) one-liner: movb %cl, _CA(%ebx, %esi) In order to get the memory offsets, I had to make string equivalents of enum types and structure fields with #define. Further, I made the number of bytes a multiple of 4 in one of my structures. In a derived structure (gcc accepts inheritance of structures), I started with an offset equal to the size of the parent structure. I did not find a constraint for memory offsets in the manual. Further, the expression (some_long & som,e_imm_expr) of an if was precomputed. This is not smart, since in order to get the condition jump, a test has to be done anyway. I solved this with a dummy asm with constraint "+g". The compiler did not allow to start the asm with : "+g": I had to write "" : "+g". But still, the condition if (some_long & some_imm_expr) was translated as movl some_long, %edx; testb $some_value, %dl. So I added -65536 to the immediate expression: only the lower word of some_long is used. The idea was that the compiler would write the asm I would have written, since it is optimizing. But instead, the compiler made the code slower with precomputations, stored on the stack. So I tried to "control" the compiler. With success, but it has probably taken more time than writing the intel inline myself, as you see. But now, I know how to do that for my other procedures. Further, I have to admit that the code of gcc is not very much slower than mine, it just looks ugly. I intend to do the recursion manually, with computed gotos. For this purpose, I will make my own stacks. Only the return address and sometimes one variable must be stored. I will need at least one stack to store a search path, so I will not need an extra index register. The six most important variables are maintained in the recursion. For the code pushall; do_something; popall, I will make a large intel asm that starts with pushal and end with popal. This will only result in a redundant test instruction, since flags can not be exchanged between asms and ifs. For the general case, I have to do something else anyway, since pushal and popal are intel instructions. A sparc has very many regs, so it might be a good idea to use other locals. I still believe that pushl and popl are general instructions, except at maybe some very special computers. Maybe, these nemonics do not exist on some RISC-computers, but the assembler will probably translate the instructions to a movl and an incl or decl of the register that is used as stack pointer. Computed gotos are absolutely save, aren't they. The only disadvantage is that my code is not portable to other compilers, but everyone can download the gcc for the computer involved. I hope that computed gotos are near or short jumps, since far jumps take very many clock cycles. So instead of the assembler output, my C-code will look ugly, at least, that is what you might think. But not as ugly as I intended before. Best regards, Michiel