C++软件优化指南：关键因素与性能提升策略

需积分: 5 87 浏览量更新于2024-06-14 收藏 1.71MB PDF 举报

"《优化C++软件指南》是一本由Agner Fog撰写的技术著作，针对Windows、Linux和Mac平台提供软件优化策略。作者在版权日期2004年至2024年期间持续更新，直至2024年3月15日。本书详细探讨了软件性能优化的重要性以及优化过程中的关键决策。首先，章节1介绍了优化软件的必要性，指出软件往往效率低下，这可能是由于设计缺陷、硬件限制或编程语言特性等多方面原因。作者强调了优化的成本效益分析，即在投入时间和资源后能否获得显著性能提升。第二部分，选择合适的平台是优化的关键。作者建议根据目标市场和应用需求来决定硬件平台，如处理器架构（如x86、ARM）、操作系统（Windows、Linux或MacOS）、编程语言（C++的优势和局限）的选择。选择编译器时，需考虑其对性能的优化支持和兼容性；同时，库函数的选择也会影响性能，用户界面框架的选择也不能忽视，尤其是在追求高效交互时。 C++语言本身存在一些挑战，比如内存管理的复杂性和运行时开销，但通过了解并克服这些问题，可以提高代码效率。书中还讨论了如何识别程序中的瓶颈，例如利用性能分析工具（如gprof或Visual Studio的性能分析器）定位时间消耗大的热点代码。接着，作者深入剖析了一系列可能导致性能瓶颈的具体环节，如程序安装和自动更新的开销、动态链接与位置无关代码的影响、文件访问操作的效率、系统数据库查询、图形处理、网络通信、内存访问模式、上下文切换以及依赖链管理等。每个环节都提供了针对性的优化策略。最后，性能优化不仅关乎速度，还必须兼顾用户体验。因此，章节4讨论了如何在提升性能的同时，确保软件的易用性和稳定性，如避免不必要的资源占用，减少不必要的中断和延迟。《优化C++软件指南》是一本实用的参考书籍，涵盖了从基础理论到实际操作的全方位指导，帮助开发者在不同平台上实现高效、稳定的软件开发。通过学习和实践书中的内容，开发者可以提升C++程序的性能，满足现代软件工程的需求。"

μs is less than 1% of the time it takes to refresh the screen. There is no way the user can

see the delay. But if the loop is inside another loop that also repeats 1000 times then we

have an estimated execution time of 125 ms. This delay is just long enough to be noticeable

but not long enough to be annoying. We may decide to do some measurements to see if our

estimate is correct or if the calculation time is actually more than 125 ms. If the response

time is so long that the user actually has to wait for a result then we will consider if there is

something that can be improved.

3.2 Use a profiler to find hot spots

Before you start to optimize anything, you have to identify the critical parts of the program.

In some programs, more than 99% of the time is spent in the innermost loop doing

mathematical calculations. In other programs, 99% of the time is spent on reading and

writing data files while less than 1% goes to actually doing something on these data. It is

very important to optimize the parts of the code that matters rather than the parts of the

code that use only a small fraction of the total time. Optimizing less critical parts of the code

will not only be a waste of time, it also makes the code less clear and more difficult to debug

and maintain.

Most compiler packages include a profiler that can tell how many times each function is

called and how much time it uses. There are also third-party profilers such as AQtime, Intel

VTune and AMD CodeAnalyst.

There are several different profiling methods:

• Instrumentation: The compiler inserts extra code at each function call to count how

many times the function is called and how much time it takes.

• Debugging. The profiler inserts temporary debug breakpoints at every function or every

code line.

• Time-based sampling: The profiler tells the operating system to generate an interrupt,

e.g. every millisecond. The profiler counts how many times an interrupt occurs in each

part of the program. This requires no modification of the program under test, but is less

reliable.

• Event-based sampling: The profiler tells the CPU to generate interrupts at certain

events, for example every time a thousand cache misses have occurred. This makes it

possible to see which part of the program has most cache misses, branch

mispredictions, floating point exceptions, etc. Event-based sampling requires a CPU-

specific profiler. For Intel CPUs use Intel VTune, for AMD CPUs use AMD CodeAnalyst.

Unfortunately, profilers are often unreliable. They sometimes give misleading results or fail

completely because of technical problems.

Some common problems with profilers are:

• Coarse time measurement. If time is measured with millisecond resolution and the

critical functions take microseconds to execute then measurements can become

imprecise or simply zero.

• Execution time too small or too long. If the program under test finishes in a short time

then the sampling generates too little data for analysis. If the program takes too long

time to execute then the profiler may sample more data than it can handle.

• Waiting for user input. Many programs spend most of their time waiting for user input or

network resources. This time is included in the profile. It may be necessary to modify the

program to use a set of test data instead of user input in order to make profiling feasible.

• Interference from other processes. The profiler measures not only the time spent in the

program under test but also the time used by all other processes running on the same

computer, including the profiler itself.

• Function addresses are obscured in optimized programs. The profiler identifies any hot

spots in the program by their address and attempts to translate these addresses to

function names. But a highly optimized program is often reorganized in such a way that

there is no clear correspondence between function names and code addresses. The

names of inlined functions may not be visible at all to the profiler. The result will be

misleading reports of which functions take most time.

• Uses debug version of the code. Some profilers require that the code you are testing

contains debug information in order to identify individual functions or code lines. The

debug version of the code is not optimized.

• Jumps between CPU cores. A process or thread does not necessarily stay in the same

processor core on multi-core CPUs, but event-counters do. This results in meaningless

event counts for threads that jump between multiple CPU cores. You may need to lock a

thread to a specific CPU core by setting a thread affinity mask.

• Poor reproducibility. Delays in program execution may be caused by random events that

are not reproducible. Such events as task switches and garbage collection can occur at

random times and make parts of the program appear to take longer time than normally.

There are various alternatives to using a profiler. A simple alternative is to run the program

in a debugger and press break while the program is running. If there is a hot spot that uses

90% of the CPU time then there is a 90% chance that the break will occur in this hot spot.

Repeating the break a few times may be enough to identify a hot spot. Use the call stack in

the debugger to identify the circumstances around the hot spot.

Sometimes, the best way to identify performance bottlenecks is to put measurement

instruments directly into the code rather than using a ready-made profiler. This does not

solve all the problems associated with profiling, but it often gives more reliable results. If you

are not satisfied with the way a profiler works then you may put the desired measurement

instruments into the program itself. You may add counter variables that count how many

times each part of the program is executed. Furthermore, you may read the time before and

after each of the most important or critical parts of the program to measure how much time

each part takes. See page 167 for further discussion of this method.

Your measurement code should have #if directives around it so that it can be disabled in

the final version of the code. Inserting your own profiling instruments in the code itself is a

very useful way to keep track of the performance during the development of a program.

The time measurements may require a very high resolution if time intervals are short. In

Windows, you can use the GetTickCount or QueryPerformanceCounter functions for

millisecond resolution. A much higher resolution can be obtained with the time stamp

counter in the CPU, which counts at the CPU clock frequency (_rdtsc() or __rdtsc()).

The time stamp counter becomes invalid if a thread jumps between different CPU cores.

You may have to fix the thread to a specific CPU core during time measurements to avoid

this. (In Windows, SetThreadAffinityMask, in Linux, sched_setaffinity).

The program should be tested with a realistic set of test data. The test data should contain a

typical degree of randomness in order to get a realistic number of cache misses and branch

mispredictions.

When the most time-consuming parts of the program have been found, then it is important

to focus the optimization efforts on the time consuming parts only. Critical pieces of code

can be further tested and investigated by the methods described on page 167.

A profiler is most useful for finding problems that relate to CPU-intensive code. But many

programs use more time loading files or accessing databases, network and other resources

than doing arithmetic operations. The most common time-consumers are discussed in the

following sections.

3.3 Program installation

The time it takes to install a program package is not traditionally considered a software

optimization issue. But it is certainly something that can steal the user's time. The time it

takes to install a software package and make it work cannot be ignored if the goal of

software optimization is to save time for the user. With the high complexity of modern

software, it is not unusual for the installation process to take more than an hour. Neither is it

unusual that a user has to reinstall a software package several times in order to find and

resolve compatibility problems.

Software developers should take installation time and compatibility problems into account

when deciding whether to base a software package on a complex framework requiring

many files to be installed.

The installation process should always use standardized installation tools. It should be

possible to select all installation options at the start so that the rest of the installation

process can proceed unattended. Uninstallation should also proceed in a standardized

manner.

3.4 Automatic updates

Many software programs automatically download updates through the Internet at regular

time intervals. Some programs search for updates every time the computer starts up, even if

the program is never used. A computer with many such programs installed can take several

minutes to start up, which is a total waste of the user's time. Other programs spend time

searching for updates each time the program starts. The user may not need the updates if

the current version satisfies the user's needs. The search for updates should be optional

and off by default unless there is a compelling security reason for updating. The update

process should run in a low priority thread, and only if the program is actually used. No

program should leave a background process running when it is not in use. The installation

of downloaded program updates should be postponed until the program is shut down and

restarted anyway.

Updates to the operating system can be particularly time consuming. Sometimes it takes

hours to install automatic updates to the operating system. This is very problematic because

these time-consuming updates may come unpredictably at inconvenient times. This can be

a very big problem if the user has to turn off or log off the computer for security reasons

before leaving their workplace and the system forbids the user to turn off the computer

during the update process.

3.5 Program loading

Often, it takes more time to load a program than to execute it. The load time can be

annoyingly high for programs that are based on big runtime frameworks, intermediate code,

interpreters, just-in-time compilers, etc., as is commonly the case with programs written in

Java, C#, Visual Basic, etc.

But program loading can be a time-consumer even for programs implemented in compiled

C++. This typically happens if the program uses a lot of runtime DLL's (dynamically linked

libraries or shared objects), resource files, configuration files, help files and databases. The

operating system may not load all the modules of a big program when the program starts

up. Some modules may be loaded only when they are needed, or they may be swapped to

the hard disk if the RAM size is insufficient.

The user expects immediate responses to simple actions like a key press or mouse move. It

is unacceptable to the user if such a response is delayed for several seconds because it

requires the loading of modules or resource files from disk. Memory-hungry applications

force the operating system to swap memory to disk. Memory swapping is a frequent cause

of unacceptably long response times to simple things like a mouse move or key press.

Avoid an excessive number of DLLs, configuration files, resource files, help files etc.

scattered around on the hard disk. A few files, preferably in the same directory as the

executable file, is acceptable.

3.6 Dynamic linking and position-independent code

Function libraries can be implemented either as static link libraries (*.lib, *.a) or dynamic

link libraries, also called shared objects (*.dll, *.so). There are several factors that can

make dynamic link libraries slower than static link libraries. These factors are explained in

detail on page 158 below.

Position-independent code is used in shared objects in Unix-like systems. Mac systems

often use position-independent code everywhere by default. Position-independent code is

inefficient, especially in 32-bit mode, for reasons explained on page 158 below.

3.7 File access

Reading or writing a file on a hard disk often takes much more time than processing the

data in the file, especially if the user has a virus scanner that scans all files on access.

Sequential forward access to a file is faster than random access. Reading or writing big

blocks is faster than reading or writing a small bit at a time. Do not read or write less than a

few kilobytes at a time.

You may mirror the entire file in a memory buffer and read or write it in one operation rather

than reading or writing small bits in a non-sequential manner.

It is usually much faster to access a file that has been accessed recently than to access it

the first time. This is because the file has been copied to the disk cache.

Files on remote or removable media such as USB sticks may not be cached. This can have

quite dramatic consequences. I once made a Windows program that created a file by calling

WritePrivateProfileString, which opens and closes the file for each line written. This

worked sufficiently fast on a hard disk because of disk caching, but it took several minutes

to write the file to a floppy disk.

A big file containing numerical data is more compact and efficient if the data are stored in

binary form than if the data are stored in ASCII form. A disadvantage of binary data storage

is that it is not human readable and not easily ported to systems with big-endian storage.

Optimizing file access is more important than optimizing CPU use in programs that have

many file input/output operations. It can be advantageous to put file access in a separate

thread if there is other work that the processor can do while waiting for disk operations to

finish.

3.8 System database

It can take several seconds to access the system database in Windows. It is more efficient

to store application-specific information in a separate file than in the big registration

database in the Windows system. Note that the system may store the information in the

database anyway if you are using functions such as GetPrivateProfileString and

WritePrivateProfileString to read and write configuration files (*.ini files).

3.9 Other databases

Many software applications use a database for storing user data. A database can consume

a lot of CPU time, RAM and disk space. It may be possible to replace a database by a plain

old data file in simple cases. Database queries can often be optimized by using indexes,

working with sets rather than loops, etc. Optimizing database queries is beyond the scope

of this manual, but you should be aware that there is often a lot to gain by optimizing

database access.

3.10 Graphics

A graphical user interface can use a lot of computing resources. Typically, a specific

graphics framework is used. The operating system may supply such a framework in its API.

In some cases, there is an extra layer of a third-party graphics framework between the

operating system API and the application software. Such an extra framework can consume

a lot of extra resources.

Each graphics operation in the application software is implemented as a function call to a

graphics library or API function which then calls a device driver. A call to a graphics function

is time consuming because it may go through multiple layers and it needs to switch to

protected mode and back again. Obviously, it is more efficient to make a single call to a

graphics function that draws a whole polygon or bitmap than to draw each pixel or line

separately through multiple function calls.

The calculation of graphics objects in computer games and animations is of course also

time consuming, especially if there is no graphics processing unit.

Various graphics function libraries and drivers differ a lot in performance. I have no specific

recommendation of which one is best.

3.11 Other system resources

Writes to a printer or other device should preferably be done in big blocks rather than a

small piece at a time because each call to a driver involves the overhead of switching to

protected mode and back again.

Accessing system devices and using advanced facilities of the operating system can be

time consuming because it may involve the loading of several drivers, configuration files and

system modules.

3.12 Network access

Some application programs use internet or intranet for automatic updates, remote help files,

database access, etc. The problem here is that access times cannot be controlled. The

network access may be fast in a simple test setup but slow or completely absent in a use

situation where the network is overloaded or the user is far from the server.

剩余178页未读，继续阅读

黑不溜秋的

粉丝: 2217
资源: 29

C++软件优化指南：关键因素与性能提升策略

Optimizing software in C++

optimizing c++.pdf

Optimizing C++

C++优化手册Optimizing software in C++

《Optimizing software in C++》

Optimizing software in C++ An optimization guide

Optimizing software in C++ An optimization guide for Windows, Linux

Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms

Agner Fog - Optimizing Software in C++ (2014-08-07)-计算机科学

Agner Fog - Optimizing Software in C++ (2017-05-02)-计算机科学

最新资源