program to use a set of test data instead of user input in order to make profiling feasible.
• Interference from other processes. The profiler measures not only the time spent in the
program under test but also the time used by all other processes running on the same
computer, including the profiler itself.
• Function addresses are obscured in optimized programs. The profiler identifies any hot
spots in the program by their address and attempts to translate these addresses to
function names. But a highly optimized program is often reorganized in such a way that
there is no clear correspondence between function names and code addresses. The
names of inlined functions may not be visible at all to the profiler. The result will be
misleading reports of which functions take most time.
• Uses debug version of the code. Some profilers require that the code you are testing
contains debug information in order to identify individual functions or code lines. The
debug version of the code is not optimized.
• Jumps between CPU cores. A process or thread does not necessarily stay in the same
processor core on multi-core CPUs, but event-counters do. This results in meaningless
event counts for threads that jump between multiple CPU cores. You may need to lock a
thread to a specific CPU core by setting a thread affinity mask.
• Poor reproducibility. Delays in program execution may be caused by random events that
are not reproducible. Such events as task switches and garbage collection can occur at
random times and make parts of the program appear to take longer time than normally.
There are various alternatives to using a profiler. A simple alternative is to run the program
in a debugger and press break while the program is running. If there is a hot spot that uses
90% of the CPU time then there is a 90% chance that the break will occur in this hot spot.
Repeating the break a few times may be enough to identify a hot spot. Use the call stack in
the debugger to identify the circumstances around the hot spot.
Sometimes, the best way to identify performance bottlenecks is to put measurement
instruments directly into the code rather than using a ready-made profiler. This does not
solve all the problems associated with profiling, but it often gives more reliable results. If you
are not satisfied with the way a profiler works then you may put the desired measurement
instruments into the program itself. You may add counter variables that count how many
times each part of the program is executed. Furthermore, you may read the time before and
after each of the most important or critical parts of the program to measure how much time
each part takes. See page 167 for further discussion of this method.
Your measurement code should have #if directives around it so that it can be disabled in
the final version of the code. Inserting your own profiling instruments in the code itself is a
very useful way to keep track of the performance during the development of a program.
The time measurements may require a very high resolution if time intervals are short. In
Windows, you can use the GetTickCount or QueryPerformanceCounter functions for
millisecond resolution. A much higher resolution can be obtained with the time stamp
counter in the CPU, which counts at the CPU clock frequency (_rdtsc() or __rdtsc()).
The time stamp counter becomes invalid if a thread jumps between different CPU cores.
You may have to fix the thread to a specific CPU core during time measurements to avoid
this. (In Windows, SetThreadAffinityMask, in Linux, sched_setaffinity).
The program should be tested with a realistic set of test data. The test data should contain a
typical degree of randomness in order to get a realistic number of cache misses and branch
mispredictions.