From: leathm AT solwarra DOT gbrmpa DOT gov DOT au (Leath Muller) Message-Id: <199705210249.MAA13772@solwarra.gbrmpa.gov.au> Subject: P5 Profiling To: djgpp AT delorie DOT com Date: Wed, 21 May 1997 12:49:22 +1000 (EST) Content-Type: text Precedence: bulk I'm not sure where, but I thought I saw someone asking for Pentium profiling code. Kevin Baca posted this about 6 or so months ago, and is excellent. I am going to post it here for reference to people who want it. (An exact copy of the post is appended... (I hope this doesn't stress people too much WRT bandwith!!! )) Leathal. --- Subject: P5 Profiling Date: Fri, 6 Sep 1996 11:02:27 +0000 From: "Kevin Baca" To: djgpp AT delorie DOT com Yesterday I asked if anyone was interested in seeing the code I use to profile programs on Pentium machines. I got a few replies, so here it is. Note: The opcodes used in the macros below (RDMSR and WRMSR) will generate a General Protection Fault (GPF) unless they are executed at ring 0. They run fine under Win95 DOS shells, but under plain DOS you need to use a DPMI provider that allows your program to run in ring 0. CWSDPMI does NOT allow your program to run in ring 0, but CWSDPR0 does. Look at my batch file below to see how to tell your program to use CWSDPR0. Be aware that CWSDPR0 does not support virtual memory. I haven't tried this stuff under any other DPMIs except Win95 and CWSDPR0. I'd be interested to know if it works under others (i.e. QEMM and Windows NT). Below you will find the source for the p5prof.h header file, an example C program that uses the macros, and a batch file for compiling the example. -Kevin Include this header file in the C file where you want to profile your code: ----------------------------cut here------------------------------- /********************************************************* * * File: p5prof.h * By: Kevin Baca * *********************************************************/ /********************************************************* * This file provides macros to profile your code. * Here's how they work... * * As you may or may not know, the Pentium class of * processors provides extremely fine grained profiling * capabilities through the use of what are called * Machine Specific Registers (MSRs). These registers * can provide information about almost any aspect of * CPU performance down to a single cycle. * * The MSRs of interest for profiling are specified by * indices 0x10, 0x11, 0x12, and 0x13. Here is a brief * description of each of these registers: * * MSR 0x10 * This register is simple a cycle counter. * * MSR 0x11 * This register controls what type of profiling data * will be gathered. * * MSRs 0x12 and 0x13 * These registers gather the profiling data specified in * MSR 0x11. * * Each MSR is 64 bits wide. For the Pentium processor, * only the lower 32 bits of MSR 0x11 are valid. Bits 0-15 * specify what data will be gathered in MSR 0x12. Bits 16-31 * specify what data will be gathered in MSR 0x13. Both sets * of bits have the same format: * * Bits 0-5 specify which hardware event will be tracked. * Bit 6, if set, indicates events will be tracked in * rings 0-2. * Bit 7, if set, indicates events will be tracked in * ring 3. * Bit 8, if set, indicates cycles should be counted for * the specified event. If clear, it indicates the * number of events should be counted. * * Two instructions are provided for manupulating the MSRs. * RDMSR (Read Machine Specific Register) and WRMSR * (Write Machine Specific Register). These opcodes were * originally undocumented and therefore most assemblers don't * recognize them. Their byte codes are provided in the * macros below. * * RDMSR takes the MSR index in ecx and the profiling criteria * in edx:eax. * * WRMSR takes the MSR index in ecx and returns the profile data * in edx:eax. * * Two profiling registers limits profiling capability to * gathering only two types of information. The register * usage can, however, be combined in interesting ways. * For example, you can set one register to gather the * number of a specific type of event while the other gathers * the number of cycles for the same event. Or you can * gather the number of two separate events while using * MSR 0x10 to gather the number of cycles. * * The enumerated list provides somewhat readable labels for * the types of events that can be tracked. * * For more information, get ahold of appendix H from the * Intel Pentium programmer's manual (I don't remember the * order number) or go to * http://green.kaist.ac.kr/jwhahn/art3.htm. * That's an article by Terje Mathisen where I got most of * my information. * * You may use this code however you wish. I hope it's * useful and I hope I got everything right. * * -Kevin * * kbaca AT skygames DOT com * *********************************************************/ #ifdef __GNUC__ #define RDTSC(_dst) \ __asm__(" .byte 0x0F,0x31 movl %%edx,(%%edi) movl %%eax,4(%%edi)"\ : : "D" (_dst) : "eax", "edx", "edi") #define RDMSR(_msri, _msrd) \ __asm__(" .byte 0x0F,0x32 movl %%edx,(%%edi) movl %%eax,4(%%edi)"\ : : "c" (_msri), "D" (_msrd) : "eax", "ecx", "edx", "edi") #define WRMSR(_msri, _msrd) \ __asm__(" xorl %%edx,%%edx .byte 0x0F,0x30"\ : : "c" (_msri), "a" (_msrd) : "eax", "ecx", "edx") #define RDMSR_0x12_0x13(_msr12, _msr13) \ __asm__(" movl $0x12,%%ecx .byte 0x0F,0x32 movl %%edx,(%%edi) movl %%eax,4(%%edi) movl $0x13,%%ecx .byte 0x0F,0x32 movl %%edx,(%%esi) movl %%eax,4(%%esi)"\ : : "D" (_msr12), "S" (_msr13) : "eax", "ecx", "edx", "edi") #define ZERO_MSR_0x12_0x13() \ __asm__(" xorl %%edx,%%edx xorl %%eax,%%eax movl $0x12,%%ecx .byte 0x0F,0x30 movl $0x13,%%ecx .byte 0x0F,0x30"\ : : : "eax", "ecx", "edx") #elif defined(__WATCOMC__) extern void RDTSC(unsigned int *dst); #pragma aux RDTSC =\ "db 0x0F,0x31"\ "mov [edi],edx"\ "mov [4+edi],eax"\ parm [edi]\ modify [eax edx edi]; extern void RDMSR(unsigned int msri, unsigned int *msrd); #pragma aux RDMSR =\ "db 0x0F,0x32"\ "mov [edi],edx"\ "mov [4+edi],eax"\ parm [ecx] [edi]\ modify [eax ecx edx edi]; extern void WRMSR(unsigned int msri, unsigned int msrd); #pragma aux WRMSR =\ "xor edx,edx"\ "db 0x0F,0x30"\ parm [ecx] [eax]\ modify [eax ecx edx]; extern void RDMSR_0x12_0x13(unsigned int *msr12, unsigned int *msr13); #pragma aux RDMSR_0x12_0x13 =\ "mov ecx,0x12"\ "db 0x0F,0x32"\ "mov [edi],edx"\ "mov [4+edi],eax"\ "mov ecx,0x13"\ "db 0x0F,0x32"\ "mov [esi],edx"\ "mov [4+esi],eax"\ parm [edi] [esi]\ modify [eax ecx edx edi esi]; extern void ZERO_MSR_0x12_0x13(void); #pragma aux ZERO_MSR_0x12_0x13 =\ "xor edx,edx"\ "xor eax,eax"\ "mov ecx,0x12"\ "db 0x0F,0x30"\ "mov ecx,0x13"\ "db 0x0F,0x30"\ modify [eax ecx edx]; #endif enum { DataRead, DataWrite, DataTLBMiss, DataReadMiss, DataWriteMiss, WriteHitEM, DataCacheLinesWritten, DataCacheSnoops, DataCacheSnoopHit, MemAccessBothPipes, BankConflict, MisalignedDataRef, CodeRead, CodeTLBMiss, CodeCacheMiss, SegRegLoad, RESERVED0, RESERVED1, Branch, BTBHit, TakenBranchOrBTBHit, PipelineFlush, InstructionsExeced, InstructionsExecedVPipe, BusUtilizationClocks, PipelineStalledWriteBackup, PipelineStalledDateMemRead, PipeLineStalledWriteEM, LockedBusCycle, IOReadOrWriteCycle, NonCacheableMemRef, AGI, RESERVED2, RESERVED3, FPOperation, Breakpoint0Match, Breakpoint1Match, Breakpoint2Match, Breakpoint3Match, HWInterrupt, DataReadOrWrite, DataReadOrWriteMiss }; #define PROF_CYCLES (0x100) #define PROF_EVENTS (0x000) #define RING_012 (0x40) #define RING_3 (0x80) #define RING_0123 (RING_012 | RING_3) /*void ProfSetProfiles(unsigned int msr12, unsigned int msr13);*/ #define ProfSetProfiles(_msr12, _msr13)\ {\ unsigned int prof;\ \ prof = (_msr12) | ((_msr13) << 16);\ WRMSR(0x11, prof);\ } /*void ProfBeginProfiles(void);*/ #define ProfBeginProfiles()\ ZERO_MSR_0x12_0x13(); /*void ProfGetProfiles(unsigned int msr12[2], unsigned int msr13[2]);*/ #define ProfGetProfiles(_msr12, _msr13)\ RDMSR_0x12_0x13(_msr12, _msr13); /*void ProfZeroTimer(void);*/ #define ProfZeroTimer()\ WRMSR(0x10, 0); /*void ProfReadTimer(unsigned int timer[2]);*/ #define ProfReadTimer(_timer)\ RDMSR(0x10, timer); /*EOF*/ --------------------------cut here---------------------------------- This is the example C program: --------------------------cut here---------------------------------- #include #include "p5prof.h" void main(void) { unsigned int prof0[2]; unsigned int prof1[2]; unsigned int timer[2]; int i; double d; ProfSetProfiles(DataRead | PROF_EVENTS | RING_0123, DataRead | PROF_CYCLES | RING_0123); ProfBeginProfiles(); ProfZeroTimer(); for(d = 1.0, i = 0; i < 100; i++) d *= 2.0; ProfGetProfiles(prof0, prof1); ProfReadTimer(timer); printf("0x%x%x\n", timer[0], timer[1]); printf("0x%x%x\n", prof0[0], prof0[1]); printf("0x%x%x\n", prof1[0], prof1[1]); } /*EOF*/ --------------------------cut here--------------------------------- This is the batch file for compiling the example --------------------------cut here--------------------------------- REM REM FILE: MAKEDJ.BAT REM gcc -o tst.out -O3 tst.c strip tst.out call coff2exe tst.out del tst.out stubedit tst.exe dpmi=cwsdpr0.exe