C++性能优化全指南：平台、语言与调优策略

需积分: 9 195 浏览量更新于2024-07-18 收藏 1.72MB PDF 举报

《C++应用程序性能优化：Windows、Linux与Mac平台指南》是Agner Fog撰写的专著，针对C++程序员提供了一本详尽的性能优化手册。该书不仅涵盖了语言层面的优化技巧，还深入探讨了编译器选择、硬件平台、操作系统、编程语言版本、函数库选择、用户界面框架以及如何克服C++语言的局限性等关键领域。在第1章“介绍”中，作者强调了优化软件的必要性和挑战，指出优化并非总是高收益的行为，需要权衡投入和产出。章节中解释了优化的不同层次，包括考虑硬件成本、微处理器特性、操作系统的影响以及选择最符合需求的编程语言和编译器。第2章主要讨论如何选择合适的平台，涉及到硬件平台的选择，如处理器速度和架构，以及操作系统对性能的影响。此外，还会涉及编程语言（C++的最新版本）、编译器优化选项、功能库的选择，以及用户界面框架的选择，因为这些都会直接影响到程序的运行效率。接着，第三部分着重于识别和解决性能瓶颈。作者指导读者理解每项操作的时间消耗，如时钟周期、使用性能分析工具定位热点代码，以及程序安装、更新、加载过程中的优化策略。动态链接和位置无关代码、文件访问、系统数据库查询、图形处理、系统资源使用、网络通信以及内存访问模式等都被逐一剖析。第四章将性能优化与用户体验相结合，探讨如何在追求速度的同时，保持程序的易用性和响应性。这意味着要在性能和功能性之间找到平衡。最后，第五章讨论选择最佳算法的重要性，因为算法选择直接影响到程序执行的效率。不同的问题可能需要不同算法来解决，而高效的算法设计是优化的关键环节。《C++应用程序性能优化：Windows、Linux与Mac平台指南》为C++开发者提供了一个全面的优化框架，帮助他们理解如何通过技术手段提升程序的性能，使之适应各种平台，同时兼顾用户体验。这本书是每一个希望在C++领域追求卓越性能的程序员不可或缺的参考资料。

Integer overflow is another security problem. The official C standard says that the behavior

of signed integers in case of overflow is "undefined". This allows the compiler to ignore

overflow or assume that it doesn't occur. In the case of the Gnu compiler, the assumption

that signed integer overflow doesn't occur has the unfortunate consequence that it allows

the compiler to optimize away an overflow check. There are a number of possible remedies

against this problem: (1) check for overflow before it occurs, (2) use unsigned integers -

they are guaranteed to wrap around, (3) trap integer overflow with the option -ftrapv, but

this is extremely inefficient, (4) get a compiler warning for such optimizations with option

-Wstrict-overflow=2, or (5) make the overflow behavior well-defined with option

-fwrapv or -fno-strict-overflow.

You may deviate from the above security advices in critical parts of the code where speed is

important. This can be permissible if the unsafe code is limited to well-tested functions,

classes, templates or modules with a well-defined interface to the rest of the program.

3 Finding the biggest time consumers

3.1 How much is a clock cycle?

In this manual, I am using CPU clock cycles rather than seconds or microseconds as a time

measure. This is because computers have very different speeds. If I write that something

takes 10 μs today, then it may take only 5 μs on the next generation of computers and my

manual will soon be obsolete. But if I write that something takes 10 clock cycles then it will

still take 10 clock cycles even if the CPU clock frequency is doubled.

The length of a clock cycle is the reciprocal of the clock frequency. For example, if the clock

frequency is 2 GHz then the length of a clock cycle is

ns.5.0

GHz2



A clock cycle on one computer is not always comparable to a clock cycle on another

computer. The Pentium 4 (NetBurst) CPU is designed for a higher clock frequency than

other CPUs, but it uses more clock cycles than other CPUs for executing the same piece of

code in general.

Assume that a loop in a program repeats 1000 times and that there are 100 floating point

operations (addition, multiplication, etc.) inside the loop. If each floating point operation

takes 5 clock cycles, then we can roughly estimate that the loop will take 1000 * 100 * 5 *

0.5 ns = 250 μs on a 2 GHz CPU. Should we try to optimize this loop? Certainly not! 250 μs

is less than 1/50 of the time it takes to refresh the screen. There is no way the user can see

the delay. But if the loop is inside another loop that also repeats 1000 times then we have

an estimated calculation time of 250 ms. This delay is just long enough to be noticeable but

not long enough to be annoying. We may decide to do some measurements to see if our

estimate is correct or if the calculation time is actually more than 250 ms. If the response

time is so long that the user actually has to wait for a result then we will consider if there is

something that can be improved.

3.2 Use a profiler to find hot spots

Before you start to optimize anything, you have to identify the critical parts of the program.

In some programs, more than 99% of the time is spent in the innermost loop doing

mathematical calculations. In other programs, 99% of the time is spent on reading and

writing data files while less than 1% goes to actually doing something on these data. It is

very important to optimize the parts of the code that matters rather than the parts of the

code that use only a small fraction of the total time. Optimizing less critical parts of the code

will not only be a waste of time, it also makes the code less clear and more difficult to debug

and maintain.

Most compiler packages include a profiler that can tell how many times each function is

called and how much time it uses. There are also third-party profilers such as AQtime, Intel

VTune and AMD CodeAnalyst.

There are several different profiling methods:

 Instrumentation: The compiler inserts extra code at each function call to count how

many times the function is called and how much time it takes.

 Debugging. The profiler inserts temporary debug breakpoints at every function or every

code line.

 Time-based sampling: The profiler tells the operating system to generate an interrupt,

e.g. every millisecond. The profiler counts how many times an interrupt occurs in each

part of the program. This requires no modification of the program under test, but is less

reliable.

 Event-based sampling: The profiler tells the CPU to generate interrupts at certain

events, for example every time a thousand cache misses have occurred. This makes it

possible to see which part of the program has most cache misses, branch

mispredictions, floating point exceptions, etc. Event-based sampling requires a CPU-

specific profiler. For Intel CPUs use Intel VTune, for AMD CPUs use AMD CodeAnalyst.

Unfortunately, profilers are often unreliable. They sometimes give misleading results or fail

completely because of technical problems.

Some common problems with profilers are:

 Coarse time measurement. If time is measured with millisecond resolution and the

critical functions take microseconds to execute then measurements can become

imprecise or simply zero.

 Execution time too small or too long. If the program under test finishes in a short time

then the sampling generates too little data for analysis. If the program takes too long

time to execute then the profiler may sample more data than it can handle.

 Waiting for user input. Many programs spend most of their time waiting for user input or

network resources. This time is included in the profile. It may be necessary to modify the

program to use a set of test data instead of user input in order to make profiling feasible.

 Interference from other processes. The profiler measures not only the time spent in the

program under test but also the time used by all other processes running on the same

computer, including the profiler itself.

 Function addresses are obscured in optimized programs. The profiler identifies any hot

spots in the program by their address and attempts to translate these addresses to

function names. But a highly optimized program is often reorganized in such a way that

there is no clear correspondence between function names and code addresses. The

names of inlined functions may not be visible at all to the profiler. The result will be

misleading reports of which functions take most time.

 Uses debug version of the code. Some profilers require that the code you are testing

contains debug information in order to identify individual functions or code lines. The

debug version of the code is not optimized.

 Jumps between CPU cores. A process or thread does not necessarily stay in the same

processor core on multi-core CPUs, but event-counters do. This results in meaningless

event counts for threads that jump between multiple CPU cores. You may need to lock a

thread to a specific CPU core by setting a thread affinity mask.

 Poor reproducibility. Delays in program execution may be caused by random events that

are not reproducible. Such events as task switches and garbage collection can occur at

random times and make parts of the program appear to take longer time than normally.

There are various alternatives to using a profiler. A simple alternative is to run the program

in a debugger and press break while the program is running. If there is a hot spot that uses

90% of the CPU time then there is a 90% chance that the break will occur in this hot spot.

Repeating the break a few times may be enough to identify a hot spot. Use the call stack in

the debugger to identify the circumstances around the hot spot.

Sometimes, the best way to identify performance bottlenecks is to put measurement

instruments directly into the code rather than using a ready-made profiler. This does not

solve all the problems associated with profiling, but it often gives more reliable results. If you

are not satisfied with the way a profiler works then you may put the desired measurement

instruments into the program itself. You may add counter variables that count how many

times each part of the program is executed. Furthermore, you may read the time before and

after each of the most important or critical parts of the program to measure how much time

each part takes. See page 157 for further discussion of this method.

Your measurement code should have #if directives around it so that it can be disabled in

the final version of the code. Inserting your own profiling instruments in the code itself is a

very useful way to keep track of the performance during the development of a program.

The time measurements may require a very high resolution if time intervals are short. In

Windows, you can use the GetTickCount or QueryPerformanceCounter functions for

millisecond resolution. A much higher resolution can be obtained with the time stamp

counter in the CPU, which counts at the CPU clock frequency (in Windows: __rdtsc()).

The time stamp counter becomes invalid if a thread jumps between different CPU cores.

You may have to fix the thread to a specific CPU core during time measurements to avoid

this. (In Windows, SetThreadAffinityMask, in Linux, sched_setaffinity).

The program should be tested with a realistic set of test data. The test data should contain a

typical degree of randomness in order to get a realistic number of cache misses and branch

mispredictions.

When the most time-consuming parts of the program have been found, then it is important

to focus the optimization efforts on the time consuming parts only. Critical pieces of code

can be further tested and investigated by the methods described on page 157.

A profiler is most useful for finding problems that relate to CPU-intensive code. But many

programs use more time loading files or accessing databases, network and other resources

than doing arithmetic operations. The most common time-consumers are discussed in the

following sections.

3.3 Program installation

The time it takes to install a program package is not traditionally considered a software

optimization issue. But it is certainly something that can steal the user's time. The time it

takes to install a software package and make it work cannot be ignored if the goal of

software optimization is to save time for the user. With the high complexity of modern

software, it is not unusual for the installation process to take more than an hour. Neither is it

unusual that a user has to reinstall a software package several times in order to find and

resolve compatibility problems.

Software developers should take installation time and compatibility problems into account

when deciding whether to base a software package on a complex framework requiring

many files to be installed.

The installation process should always use standardized installation tools. It should be

possible to select all installation options at the start so that the rest of the installation

process can proceed unattended. Uninstallation should also proceed in a standardized

manner.

3.4 Automatic updates

Many software programs automatically download updates through the Internet at regular

time intervals. Some programs search for updates every time the computer starts up, even if

the program is never used. A computer with many such programs installed can take several

minutes to start up, which is a total waste of the user's time. Other programs use time

searching for updates each time the program starts. The user may not need the updates if

the current version satisfies the user's needs. The search for updates should be optional

and off by default unless there is a compelling security reason for updating. The update

process should run in a low priority thread, and only if the program is actually used. No

program should leave a background process running when it is not in use. The installation

of downloaded program updates should be postponed until the program is shut down and

restarted anyway.

Updates to the operating system can be particularly time consuming. Sometimes it takes

hours to install automatic updates to the operating system. This is very problematic because

these time consuming updates may come unpredictably at inconvenient times. This can be

a very big problem if the user has to turn off or log off the computer for security reasons

before leaving their workplace and the system forbids the user to turn off the computer

during the update process.

3.5 Program loading

Often, it takes more time to load a program than to execute it. The load time can be

annoyingly high for programs that are based on big runtime frameworks, intermediate code,

interpreters, just-in-time compilers, etc., as is commonly the case with programs written in

Java, C#, Visual Basic, etc.

But program loading can be a time-consumer even for programs implemented in compiled

C++. This typically happens if the program uses a lot of runtime DLL's (dynamically linked

libraries or shared objects), resource files, configuration files, help files and databases. The

operating system may not load all the modules of a big program when the program starts

up. Some modules may be loaded only when they are needed, or they may be swapped to

the hard disk if the RAM size is insufficient.

The user expects immediate responses to simple actions like a key press or mouse move. It

is unacceptable to the user if such a response is delayed for several seconds because it

requires the loading of modules or resource files from disk. Memory-hungry applications

force the operating system to swap memory to disk. Memory swapping is a frequent cause

of unacceptably long response times to simple things like a mouse move or key press.

Avoid an excessive number of DLLs, configuration files, resource files, help files etc.

scattered around on the hard disk. A few files, preferably in the same directory as the .exe

file, is acceptable.

3.6 Dynamic linking and position-independent code

Function libraries can be implemented either as static link libraries (*.lib, *.a) or dynamic

link libraries, also called shared objects (*.dll, *.so). There are several factors that can

make dynamic link libraries slower than static link libraries. These factors are explained in

detail on page 149 below.

Position-independent code is used in shared objects in Unix-like systems. Mac systems

often use position-independent code everywhere by default. Position-independent code is

inefficient, especially in 32-bit mode, for reasons explained on page 149 below.

3.7 File access

Reading or writing a file on a hard disk often takes much more time than processing the

data in the file, especially if the user has a virus scanner that scans all files on access.

Sequential forward access to a file is faster than random access. Reading or writing big

blocks is faster than reading or writing a small bit at a time. Do not read or write less than a

few kilobytes at a time.

You may mirror the entire file in a memory buffer and read or write it in one operation rather

than reading or writing small bits in a non-sequential manner.

It is usually much faster to access a file that has been accessed recently than to access it

the first time. This is because the file has been copied to the disk cache.

Files on remote or removable media such as floppy disks and USB sticks may not be

cached. This can have quite dramatic consequences. I once made a Windows program that

created a file by calling WritePrivateProfileString, which opens and closes the file

for each line written. This worked sufficiently fast on a hard disk because of disk caching,

but it took several minutes to write the file to a floppy disk.

A big file containing numerical data is more compact and efficient if the data are stored in

binary form than if the data are stored in ASCII form. A disadvantage of binary data storage

is that it is not human readable and not easily ported to systems with big-endian storage.

Optimizing file access is more important than optimizing CPU use in programs that have

many file input/output operations. It can be advantageous to put file access in a separate

thread if there is other work that the processor can do while waiting for disk operations to

finish.

3.8 System database

It can take several seconds to access the system database in Windows. It is more efficient

to store application-specific information in a separate file than in the big registration

database in the Windows system. Note that the system may store the information in the

database anyway if you are using functions such as GetPrivateProfileString and

WritePrivateProfileString to read and write configuration files (*.ini files).

剩余167页未读，继续阅读

zhbbuaa

粉丝: 0
资源: 6

C++性能优化全指南：平台、语言与调优策略

【超级好】C++应用程序性能优化.pdf

C++应用程序性能优化.rar

c++ 应用程序性能优化 第二版

C++应用程序性能优化

c++应用程序性能优化

C++应用程序性能优化_第二版

C++应用程序性能优化之内存池

c++应用程序性能优化第1和第2版本

C++应用程序性能优化方面的一些建议

C++应用程序性能优化[文].pdf

最新资源

c++ 应用程序性能优化第二版