Message-Id: <3.0.32.19990927171139.00c74200@pop.xs4all.nl> X-Sender: diep AT pop DOT xs4all DOT nl X-Mailer: Windows Eudora Pro Version 3.0 (32) Date: Mon, 27 Sep 1999 17:11:42 +0200 To: pgcc AT delorie DOT com From: Vincent Diepeveen Subject: a C routine to optimize GCC/PGCC for Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Reply-To: pgcc AT delorie DOT com Hello, i've written a short routine. don't watch the variable names, they're picked randomly for the globals. It's about the code and optimization to assembler of it! int *board,*sweep,*Pindex,*snelbord; or something like: int board[64],sweep[20],Pindex[64],snelbord[64]; The above definition of arrays/pointers should not matter for the optimization of 'tryalles'. int tryalles(void){ int ut,*va,ua,summation=0; for( ua = 0 ; ua < 16 ; ua++ ) { va = Pindex; ut = 0; if( !sweep[snelbord[ua]] ) ut = board[ua]; va += ut; summation += *va; } return(summation); } .align 4 .globl tryalles .type tryalles,@function tryalles: pushl %ebp movl %esp,%ebp pushl %edi pushl %esi pushl %ebx xorl %esi,%esi movl $sweep,%edi xorl %ecx,%ecx movl $15,%ebx .p2align 4,,7 .L6: movl snelbord(%ecx),%eax xorl %edx,%edx sall $2,%eax cmpl $0,(%eax,%edi) jne .L7 movl board(%ecx),%edx .L7: addl Pindex(,%edx,4),%esi addl $4,%ecx decl %ebx jns .L6 movl %esi,%eax popl %ebx popl %esi popl %edi movl %ebp,%esp popl %ebp ret .Lfe1: .size tryalles,.Lfe1-tryalles .align 4 Suffering 10-15 clocks for a branch misprediction is major, it a few instructions more to prevent that penalty can get done at a rate of 3 instructions a clock! I would like to zoom in into the invariant: First the C invariant: va = Pindex; ut = 0; if( !sweep[snelbord[ua]] ) ut = board[ua]; va += ut; summation += *va; Now how this is currently translated to 32 bits assembler: .L6: movl snelbord(%ecx),%eax xorl %edx,%edx sall $2,%eax <== where do we need this shift instruction for? cmpl $0,(%eax,%edi) jne .L7 <== We don't want to have this JNE! movl board(%ecx),%edx .L7: addl Pindex(,%edx,4),%esi addl $4,%ecx decl %ebx jns .L6 If i look to an example like this i directly see the use of some more registers than intel has... ...however i would like to replace the next 3 lines by pentiumpro instructions. I don't care how much lines it gets replaced with, because here in my program i'm suffering in a lot of cases a huge penalty! cmpl $0,(%eax,%edi) jne .L7 <== We don't want to have this JNE! movl board(%ecx),%edx .L7: Is it so hard to replace the above by PRO instructions? Greetings, Vincent Vincent Diepeveen diep AT xs4all DOT nl --- ...en verder ben ik van mening dat Dap het heelal in dient te worden gestraald...