From: Till Harbaum Newsgroups: comp.os.msdos.djgpp Subject: Re: cmpl takes 14 clk cycles on a Pentium ??? Date: 13 Feb 1998 10:05:11 +0100 Organization: TU Braunschweig, Informatik (Bueltenweg), Germany Lines: 30 Distribution: world Message-ID: References: <199802130328 DOT TAA12256 AT adit DOT ap DOT net> NNTP-Posting-Host: flens.ibr.cs.tu-bs.de Mime-Version: 1.0 (generated by tm-edit 7.106) Content-Type: text/plain; charset=US-ASCII To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp Precedence: bulk Nate Eldredge writes: > > At 04:46 2/12/1998 -0500, Mario Deschenes wrote: > >Hi everyone, > > > > I'm using RDTSC to profile a routine and I got something strange. My > >routine looks like: > [snipped] > This is somewhat of a shot in the dark, since I am no hardware guru. (Btw, > you know that `align X' aligns to the nearest 2^X boundary, right?) What > seems most likely to me is some kind of caching or prefetch issue. Perhaps > when the target of the jump is close enough, it is already in a cache and is > fetched faster. But when it's farther away, a new chunk has to be fetched > from real memory, which is slower. > I think this is the right idea. Most modern cpu's do some kind of burst read ahead of there code. This means: If the cpu reads an instruction word it initiates a burst transfer from ram to cache and reads some data it will likely need in the future (the 68040 for example always reads 4 quadwords, even if it doesn't need them). The bahaviour of the code also depends on the implementation of the branch prediction unit of the cpu (which covers a big area of the pentium die, i think, so it should be very good :-). Switch of all caching and look if you still measure those differnces. Ciao, Till