From: Michiel de Bondt <michielb AT sci DOT kun DOT nl>
Newsgroups: comp.os.msdos.djgpp
Subject: Re: how to use inline push and pop
Date: Fri, 11 May 2001 16:35:23 +0200
Organization: University of Nijmegen
Lines: 210
Message-ID: <3AFBF8AB.C42331EB@sci.kun.nl>
References: <Pine DOT SUN DOT 3 DOT 91 DOT 1010510171303 DOT 7067H-100000 AT is>
NNTP-Posting-Host: fanth.sci.kun.nl
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Trace: wnnews.sci.kun.nl 989591723 3280 131.174.132.54 (11 May 2001 14:35:23 GMT)
X-Complaints-To: usenet AT sci DOT kun DOT nl
NNTP-Posting-Date: Fri, 11 May 2001 14:35:23 +0000 (UTC)
X-Mailer: Mozilla 4.75 [en] (X11; U; SunOS 5.7 sun4u)
X-Accept-Language: en
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp
Reply-To: djgpp AT delorie DOT com

Eli Zaretskii wrote:

> On 10 May 2001, Hans-Bernhard Broeker wrote:
>
> > You really should rethink your strategy all the way since you started
> > at square one. What you're doing here, actually, is to avoid the usual
> > overhead involved in recursive function calling --- but you're doing
> > it the wrong way. To avoid having more than absolutely necessary
> > pushed to the stack in recursive calls, use the following recipe,
> > instead:
> >
> > 1) Remove as many as possible of the arguments of the recursive function
>

This is really impossible.

> > 2) Remove as many as possible of the recursive function's local variables
> >    in scope at the time of the resursion call(s). Block-scope variables
> >    of blocks without a recursion call in them can stay as they are.
> > 3) Replace them by a manually implemented stack of 'status structs', where
> >    the status struct holds *exactly* those variables that have to be
> >    preserved across a recursion call.

The six most-frequently used vars?

> > 4) Compile the whole stuff with -save-temps -fverbose-asm -O6 and see if
> >    you can really beat GCC in assembly hacking :-)

I always compile -O9, but the compiler does -O3 in that case. I shall read in
the
manual what -fverbose-asm does.

>
> And I would like to point out, again, that all this complicated procedure
> and the additional complexity it imposes on the code (meaning more bugs
> and more slow-down) might be _totally_ unnecessary!
>
> There's a popular belief that recursive code is terribly slow, but
> experience shows that this is mostly a myth.  Recursive code _might_ be
> slow, but in many cases it isn't.  Because recursive code is usually
> smaller, it fits better into the CPU caches.  It is also simpler, so you
> have less probability for bugs, and it lends itself better to compiler
> optimizations.
>

I have once seen the opposite: faster recursive code.. Further, you can say
that
tail recursion is the only recursion that can _really_ be removed: for other
recursion,
some stack is needed: the computers stack or your own stack.

>
> In other words, I _strongly_ recommend to write a recursive implementation
> in a high-level language such as C or C++, compile it with optimizations,
> and _profile_ it, before you decide to go to such lengths.  There are good
> chances that the code will run fast enough to make any of this completely
> unnecessary.

I removed the calls to cprintf with alarm signals. As a consequence, the
compilers
code is only about 4 percent slower than mine, probably since the compiler
optimized the relatively rare calls to cprintf. So you are right: it is fast
enough.
On the other hand, the procedures bottleneck is the memory (!= L2-cache), and
it
is only the L1-cache that is used many times too often by the compilers code.
Half
of my memory is very old (fast page), the other half is sd-ram. I have a
pentium
MMX. On another computer, the difference might be larger.

What do you mean with profile? Fine-tuning the code within the language itself?

Or within legal asm?

I discovered that the base pointer can be used as well, with
-fno-frame-pointer.
This makes an extra register available and my code can be speeded up in another
way.
It is only hard to get the desired var in the base pointer, since "register
long something
asm ("%ebp")" is always allowed, but results in incorrect code even if the
frame-pointer
removal option is selected (and the frame-pointer is indeed pushed and popped).
I reserved
%ebx and %esi, and used a, b, c, d, S and D in the constraints of my intel asm
macros instead
of r.

I started using many intel inline asms when I discovered that my C instructions
were
not translated to the one-liners I had in mind. The code gcc generates looks
terrible.
See e.g. the following examples:


C-code:
T.Byte += dd[2]
(union {long Long; unsigned char Byte; } T;)

gcc-output:
movb %cl, %al
addb _dd+2, %al
movb %al,  %cl

one-liner:
addb _dd+2, %cl


C-code:
C = (long) SC.first_field[R]

gcc-output:
sall $2, %ebp
movl _SC(%ebp), %esi

one-liner:
movl _SC+0(,%ebp,4) %esi


C-code:
M = ((STB *) M)->first_field[P]

gcc-output:
leal (%edi, %edi), %edx
movw (%edx, %ebx), %bx
andl $65535, %ebx

one-liner
movw 0(%ebx, %edx, 2), %bx
andl $65535, %ebx


C-code:
CA[C + M] = L

gcc-output:
leal (%ebx, %esi), %eax
movb %cl, _CA(%eax)

one-liner:
movb %cl, _CA(%ebx, %esi)


In order to get the memory offsets, I had to make string equivalents of enum
types and structure
fields with #define. Further, I made the number of bytes a multiple of 4 in one
of my structures.
In a derived structure (gcc accepts inheritance of structures), I started with
an offset equal to
the size of the parent structure. I did not find a constraint for memory
offsets in the manual.

Further, the expression (some_long & som,e_imm_expr) of an if was precomputed.
This is not
smart, since in order to get the condition jump, a test has to be done anyway.
I solved
this with a dummy asm with constraint "+g". The compiler did not allow to start
the asm
with : "+g": I had to write "" : "+g". But still, the condition if (some_long &
some_imm_expr)
was translated as movl some_long, %edx; testb $some_value, %dl. So I added
-65536 to
the immediate expression: only the lower word of some_long is used.

The idea was that the compiler would write the asm I would have written, since
it is optimizing.
But instead, the compiler made the code slower with precomputations, stored on
the stack.
So I tried to "control" the compiler. With success, but it has probably taken
more time than
writing the intel inline myself, as you see. But now, I know how to do that for
my other procedures. Further, I have to admit that the code of gcc is not very
much slower than mine, it just looks ugly.

I intend to do the recursion manually, with computed gotos. For this purpose, I
will make my own
stacks. Only the return address and sometimes one variable must be stored. I
will need at least
one stack to store a search path, so I will not need an extra index register.
The six most
important variables are maintained in the recursion. For the code pushall;
do_something; popall,
I will make a large intel asm that starts with pushal and end with popal. This
will only result in a
redundant test instruction, since flags can not be exchanged between asms and
ifs. For the
general case, I have to do something else anyway, since pushal and popal are
intel instructions.
A sparc has very many regs, so it might be a good idea to use other locals. I
still believe that
pushl and popl are general instructions, except at maybe some very special
computers. Maybe,
these nemonics do not exist on some RISC-computers, but the assembler will
probably translate
the instructions to a movl and an incl or decl of the register that is used as
stack pointer.
Computed gotos are absolutely save, aren't they. The only disadvantage is that
my code is not
portable to other compilers, but everyone can download the gcc for the computer
involved.
I hope that computed gotos are near or short jumps, since far jumps take very
many clock cycles.
So instead of the assembler output, my C-code will look ugly, at least, that is
what you might think.
But not as ugly as I intended before.


Best regards, Michiel