there is no clear correspondence between function names and code addresses. The
names of inlined functions may not be visible at all to the profiler. The result will be
misleading reports of which functions take most time.
• Uses debug version of the code. Some profilers require that the code you are testing
contains debug information in order to identify individual functions or code lines. The
debug version of the code is not optimized.
• Jumps between CPU cores. A process or thread does not necessarily stay in the same
processor core on multi-core CPUs, but event-counters do. This results in meaningless
event counts for threads that jump between multiple CPU cores. You may need to lock a
thread to a specific CPU core by setting a thread affinity mask.
• Poor reproducibility. Delays in program execution may be caused by random events that
are not reproducible. Such events as task switches and garbage collection can occur at
random times and make parts of the program appear to take longer time than normally.
There are various alternatives to using a profiler. A simple alternative is to run the program
in a debugger and press break while the program is running. If there is a hot spot that uses
90% of the CPU time then there is a 90% chance that the break will occur in this hot spot.
Repeating the break a few times may be enough to identify a hot spot. Use the call stack in
the debugger to identify the circumstances around the hot spot.
Sometimes, the best way to identify performance bottlenecks is to put measurement
instruments directly into the code rather than using a ready-made profiler. This does not
solve all the problems associated with profiling, but it often gives more reliable results. If you
are not satisfied with the way a profiler works then you may put the desired measurement
instruments into the program itself. You may add counter variables that count how many
times each part of the program is executed. Furthermore, you may read the time before and
after each of the most important or critical parts of the program to measure how much time
each part takes. See page 164 for further discussion of this method.
Your measurement code should have #if directives around it so that it can be disabled in
the final version of the code. Inserting your own profiling instruments in the code itself is a
very useful way to keep track of the performance during the development of a program.
The time measurements may require a very high resolution if time intervals are short. In
Windows, you can use the GetTickCount or QueryPerformanceCounter functions for
millisecond resolution. A much higher resolution can be obtained with the time stamp
counter in the CPU, which counts at the CPU clock frequency (in Windows: __rdtsc()).
The time stamp counter becomes invalid if a thread jumps between different CPU cores.
You may have to fix the thread to a specific CPU core during time measurements to avoid
this. (In Windows, SetThreadAffinityMask, in Linux, sched_setaffinity).
The program should be tested with a realistic set of test data. The test data should contain a
typical degree of randomness in order to get a realistic number of cache misses and branch
mispredictions.
When the most time-consuming parts of the program have been found, then it is important
to focus the optimization efforts on the time consuming parts only. Critical pieces of code
can be further tested and investigated by the methods described on page 164.
A profiler is most useful for finding problems that relate to CPU-intensive code. But many
programs use more time loading files or accessing databases, network and other resources
than doing arithmetic operations. The most common time-consumers are discussed in the
following sections.