12 November 2021

Code Performance Timing

This post discusses performance timing of a piece of code. Basically, there are two methods of timing, either by using the _RDTSC function or the Timer function. _RDTSC timing uses the CPU’s internal clock and Timer (QueryPerformanceCounter) uses the preferred, most reliable and portable timing method selected by Windows. Note that code using _RDTSC isn’t portable, it cannot be used in production code to perform timing, instead always use Timer (QueryPerformanceCounter).

Setting the scene
Recently I converted a piece of GFA-BASIC 32 code to assembler and wondered how much faster it was. I used the _RDTSC function to time the execution-speed of the code. The structure of the program was this:

Dim t As Large
$StepOff
. db 0x0F, 0xAE, 0xE8       // LFENCE opcode
t = _RDTSC
' execute GB32 code
. db 0x0F, 0xAE, 0xE8       // LFENCE opcode
t = _RDTSC - t : Debug "GB32-code:"; t
. db 0x0F, 0xAE, 0xE8       // LFENCE opcode
t = _RDTSC
' execute assembler code
. db 0x0F, 0xAE, 0xE8       // LFENCE opcode
t = _RDTSC - t : Debug "Asm-code:";t
$Step

The $StepOff command disables the inclusion of debugging code; the GB32-code following $StepOff is compiled without instructions to call a Tron proc. It also disables the possibility to break (or stop) the program. The compiler switch $Step(On) re-enables the inclusion of Tron calls (and the possibility to break). Usually, these compiler switches are only used temporary for a small piece of code that needs to run at optimal speed in the IDE. A compiled to EXE (LG32 or GLL) program does not include the $StepOn functionality.

The test program gave me the following results: the GB32-code executed at ~4500 ticks and the assembler at ~3600 ticks. But what do these values mean? Are these the real CPU clock cycles necessary to execute the code?

The _RDTSC function
The answer to this question is no, the returned timing values did not represent the real number of assembler code clock-cycles as is advertised. The test-values simply returned the passed time between both _RDTSC calls. _RDTSC reads the time-stamp counter (TSC) of the CPU-cores. The value gives the number of elapsed ticks of the CPU’s internal clock. Maybe, in the past, in the era of one-core processors that only executed one app a time (MSDOS), the value might have represented the actual number of assembler clock-cycles. But not anymore. On a multi-core CPU the task is most likely split to run on multiple cores and the code is executed much faster than it would have done on one core. A consequence of splitting the code on multiple cores is that the _RDTSC (assembler) instructions could be executed on different cores as well, and might return different values. Only when the cores synchronize their time-stamp clocks (TSC) the _RDTSC values are reliable. You can test if the CPU supports the _RDTSC instruction by testing bit 4 of the value in _CPUIDD:

Debug Btst(_CPUIDD, 4)  ' must give -1 (True)
Debug InvariantTSC()    ' must return True

To check if the TSC is invariant use:

Function InvariantTSC() As Bool Naked
  _EAX = 0
  . push ebx            ; always save ebx
  . mov eax, $80000007  ; Advanced Power Management Information
  . CPuid               ; invoke cpuid
  GetRegs               ' store the register contents
  . pop ebx             ; restore ebx to access local vars
  InvariantTSC = Btst(_EDX, 8)
EndFunc

If all is well you can use the _RDTSC for performance timing, only, what are we measuring exactly? The timing values I measured were always different and sometimes ten times more than average… The reason for these irregular values is for instance multitasking, memory latency, and CPU power-saving. The test code will probably be interrupted by the OS to process other applications, writing and reading to memory costs time, and due to power-saving the CPU won’t execute the code at its highest clock frequency. These issues will remain regardless of the timing method you choose. I take closer at these issues at the end of this post.

Flush the CPU before using _RDTSC
There is one other thing related to the use of _RDTSC. The CPU does not guarantee that the machine code is executed in the order it is compiled. In other words, the _RDTSC command could be executed when the code to test has already started executing. To overcome this problem the CPU must be flushed before invoking the rdtsc assembler instruction. This is either done using the cpuid instruction or the lfence assembler instruction. We’ll use the lfence instruction, because the cpuid destroys the registers eax, ebx, ecx, and edx, where ebx is essential to GB32 and destroying it might give access violation errors.

For a proper working of TSC timing insert the lfence instruction before each rdtsc instruction:

. db 0x0F, 0xAE, 0xE8       // LFENCE opcode
t = _RDTSC

The lfence instruction is not part of the GB32 built-in assembler, so we must insert the opcode bytes of lfence into the code stream manually. (You could use cpuid, but first save ebx (push ebx) and restore it afterwards (pop ebx), see the InvariantTSC() code above for an example.)

Convert the TSC values to micro-seconds
To be able to compare timing results among different PCs we need to convert the timing values to a standard unit, seconds or micro-seconds. To convert 4500 clock-cycles to micro-second, the number must be divided by the CPU’s clock frequency. On my PC the CPU’s internal clock has a base frequency of 1.51 GHz, as reported by the Windows Settings. However, I can obtain the frequency (in MHz) from the CPU as well:

Function GetCPUMhz() As Long Naked
  _EAX = 0              ' clear global variable
  . push ebx            ' save, ebx is used for accessing local vars
  . mov eax, 0          ' get CPUID level
  . CPuid
  . cmp eax, $16        ' support level $16?
  . jnz .end            ' no, return 0
  . mov eax, $16        ' get CPUID level $16
  . CPuid
  GetRegs               ' copy registers (_EAX, ..)
  .end:
  . pop ebx             ' restore ebx to access local vars
  GetCPUMhz = _EAX      ' set the return value
EndFunc

The GetCPUMhz function returned 1500 MHz (1.5 GHz). So, my ~4500 clock ticks take ~4500 / 1500000000 seconds, or, by multiplying the result with 10E6 to convert to micro-seconds, gives ~30 µs. The assembler version of my code took ~24 µs. (The values are an average after running the test code multiple times.)

Should you use _RDTSC?
Microsoft strongly discourages using the RDTSC processor instruction, because “you won’t get reliable results on some versions of Windows”, see Acquiring high-resolution time stamps - Win32 apps | Microsoft Docs. Instead, Microsoft encourages you to use the  QueryPerformanceCounter() API. However, these warnings apply to using RDTSC in production code, not to a temporary test situation on one PC.

Using QueryPerformanceCounter
The alternative to RDTSC is the portable QueryPerformanceCounter() API together with the QueryPerformanceFrequency() API. Well, that’s easy in GFA-BASIC 32, because these APIs are implemented by the the _TimerCount and _TimerFreq functions resp. Instead of using _RDTSC you could use _TimerCount, for instance:

$StepOff
t = _TimerCount
' execute code
t = _TimerCount - t : Debug "Ticks:"; t

For my test code the intervals with _TimerCount were ~30 and ~24 ticks, which are the same values as the reported _RDTSC intervals in micro-seconds. So, let’s see what these timer count values mean. The values tells us that the clock used by QPC was incremented by 30 and 24 ticks resp. As it turns out, QPC uses a different clock than RDTSC. The frequency of the QPC clock is returned by _TimerFreq and is - on my Surface 4 - 1000000 Hz (1 MHz) . So, the 30 timer count ticks correspond to 30 µs (Interval / _TimerFreq * 10E6), and the 24 ticks to 24 µs. So, as might be expected, both methods return (approximately) the same timings.

At start-up Windows decides how to implement QPC. It might use the CPU’s TSC for this purpose, but it might also use another clock provided by the PC’s hardware. Windows selects the most reliable method to obtain timing intervals. As you can see, on my PC Windows selects an alternative time source for QPC, the 1MHz clock.

GB32 has the right function for you: Timer
Now we know how to time intervals, we can use the GFA-BASIC 32 Timer function rather than divide the _TimerCount interval by _TimerFreq. The Timer function returns the current time in seconds as a Double:

Timer <=> _TimerCount / _TimerFreq 

To convert to micro-seconds multiply the interval from two Timer calls with 10E6.

t# = Timer
' execute code
t# = Timer - t# : Debug "Time in µs:"; Round (t# * 10E6)

After changing my code to use Timer the results were again 30 µs and 24 µs.
How fast are my test-results, or how fast is 1 µs? On my PC 1 µs takes 1500 TSC (CPU) ticks and 1 QPC tick. Code that operates below 1500 TSC ticks or 1 µs can not be measured with QPC (at least on my computer). You can calculate your Timer’s resolution as follows:

Debug "Timer-resolution:"; 1 / _TimerFreq * 10e6;" µs"

Other timing issues
Whether you use _RDTSC (with lfence) or Timer (QPC) the timing results simply return the elapsed time as measured by one of the internal hardware clocks. The interval is not the exact time necessary to accomplish the task. Because Windows is a multi-tasking OS the code is split over multiple cores and these cores might be interrupted by the Windows scheduler to execute other running processes. These issues can be addressed by limiting the test code to one core by using SetThreadAffinityMask() API and reducing the chance that that core is interrupted by using SetThreadPriority() API. The structure of such a program could look like this:

Dim Mask As Long = SetThreadAffinityMask(GetCurrentThread(), 1)
~SetThreadPriority(GetCurrentThread(), 2)   ' THREAD_PRIORITY_HIGHEST
Try
  $StepOff
  Dim t# = Timer
  ' execute code
  t# = Timer - t# : Debug "Time in µs:"; Round (t# * 10E6)
  $StepOn
Catch
  MsgBox ErrStr("Timing code")
EndCatch
~SetThreadAffinityMask(GetCurrentThread(), Mask)
~SetThreadPriority(GetCurrentThread(), 0)   ' THREAD_PRIORITY_NORMAL

This might give more reliable results if you have multiple cores. (Again, you won’t do this in production code. It will affect your application's performance by restricting processing to one core or by creating a bottleneck on a single core if multiple threads set their affinity to the same core when calling QueryPerformanceCounter.)

Making the test code run on one core still doesn’t return exact timing values. The executed code must be loaded into the CPU and reading the code from RAM memory might be slow(er) or fail (page hit faults) and the code might have to be re-read from memory again taking some extra time. In addition, the tested code might load and save values to memory that suffer from memory latency as well. Finally, on a laptop the actual CPU frequency might be lower than it’s base frequency, due to CPU power savings. Unfortunately, timing is not an exact science.

See also: Acquiring high-resolution time stamps - Win32 apps | Microsoft Docs.

See also: CPUID on Wikipedia

No comments:

Post a Comment