C++软件优化指南：平台选择与性能提升

需积分: 5 159 浏览量更新于2024-06-20 收藏 3MB PDF 举报

"优化C++软件：一份针对Windows、Linux和Mac平台的优化指南，由Agner Fog撰写，丹麦技术大学版权，2004-2020年，最后更新于2020-01-17。" 在《Optimizing software in C++》这份指南中，作者Agner Fog探讨了如何在C++编程中实现最佳性能优化，适用于Windows、Linux和Mac等多个平台。以下是一些关键知识点： 1. **优化成本**：在开始优化之前，作者强调了优化的成本问题。优化可能涉及更多的开发时间和复杂性，因此需要权衡优化的收益与投入。 2. **选择最佳平台**： - **硬件平台**：硬件选择对软件性能有很大影响，需要考虑处理器速度、内存大小、硬盘类型等因素。 - **微处理器**：不同微处理器有不同的架构和指令集，理解它们的特点有助于编写更高效的代码。 - **操作系统**：不同的操作系统有不同的调度策略和资源管理方式，影响软件运行效率。 - **编程语言**：选择C++是因为其底层控制能力，但也存在一些性能上的挑战。 - **编译器**：不同的编译器有各自的优化策略，选择合适的编译器可以提升代码性能。 - **函数库**：高效、优化的函数库能极大提升程序性能。 - **用户界面框架**：UI框架的选择会影响程序的响应速度和用户体验。 - **克服C++语言的缺点**：C++提供了丰富的特性，但也会引入额外的开销，如内存管理和运行时类型检查。 3. **找到最大的时间消耗者**： - **时钟周期的含义**：理解时钟周期对于衡量性能至关重要，因为它代表了处理器执行一个基本操作所需的时间。 - **使用性能分析器**：通过性能分析工具来定位程序中的瓶颈或“热点”区域。 - **安装过程**：安装程序的效率也影响用户体验，优化此过程可以减少启动时间。 - **自动更新**：更新机制的设计应避免在关键时刻占用资源。 - **程序加载**：优化程序加载速度可以提升启动速度。 - **动态链接和位置无关代码**：动态链接可能导致额外的开销，而位置无关代码可能影响性能。 - **文件访问**：频繁的磁盘I/O是性能杀手，优化文件操作至关重要。 - **系统数据库**和**其他数据库**：数据库查询优化可以显著提高数据处理速度。 - **图形处理**：优化图形渲染可以提升游戏和可视化应用的性能。 - **系统资源**：如内存和网络访问等，都需要合理使用和优化。 - **上下文切换**：过多的上下文切换会降低CPU效率。 - **依赖链**：减少依赖关系可以减少等待时间。 - **执行单元吞吐量**：了解处理器执行单元的能力，以充分利用硬件资源。 4. **性能与可用性**：优化不仅要考虑性能，还要兼顾软件的易用性和稳定性，良好的用户体验同样重要。 5. **选择最优算法**：选择正确的算法是优化的关键，根据问题的特性和数据规模选择最合适的算法，可以显著提高程序效率。这份指南深入浅出地讨论了C++软件优化的多个层面，包括平台选择、性能分析、资源管理以及算法选择，为开发者提供了一套全面的优化策略和实践指导。

3.2 Use a profiler to find hot spots

Before you start to optimize anything, you have to identify the critical parts of the program.

In some programs, more than 99% of the time is spent in the innermost loop doing

mathematical calculations. In other programs, 99% of the time is spent on reading and

writing data files while less than 1% goes to actually doing something on these data. It is

very important to optimize the parts of the code that matters rather than the parts of the

code that use only a small fraction of the total time. Optimizing less critical parts of the code

will not only be a waste of time, it also makes the code less clear and more difficult to debug

and maintain.

Most compiler packages include a profiler that can tell how many times each function is

called and how much time it uses. There are also third-party profilers such as AQtime, Intel

VTune and AMD CodeAnalyst.

There are several different profiling methods:

• Instrumentation: The compiler inserts extra code at each function call to count how

many times the function is called and how much time it takes.

• Debugging. The profiler inserts temporary debug breakpoints at every function or every

code line.

• Time-based sampling: The profiler tells the operating system to generate an interrupt,

e.g. every millisecond. The profiler counts how many times an interrupt occurs in each

part of the program. This requires no modification of the program under test, but is less

reliable.

• Event-based sampling: The profiler tells the CPU to generate interrupts at certain

events, for example every time a thousand cache misses have occurred. This makes it

possible to see which part of the program has most cache misses, branch

mispredictions, floating point exceptions, etc. Event-based sampling requires a CPU-

specific profiler. For Intel CPUs use Intel VTune, for AMD CPUs use AMD CodeAnalyst.

Unfortunately, profilers are often unreliable. They sometimes give misleading results or fail

completely because of technical problems.

Some common problems with profilers are:

• Coarse time measurement. If time is measured with millisecond resolution and the

critical functions take microseconds to execute then measurements can become

imprecise or simply zero.

• Execution time too small or too long. If the program under test finishes in a short time

then the sampling generates too little data for analysis. If the program takes too long

time to execute then the profiler may sample more data than it can handle.

• Waiting for user input. Many programs spend most of their time waiting for user input or

network resources. This time is included in the profile. It may be necessary to modify the

program to use a set of test data instead of user input in order to make profiling feasible.

• Interference from other processes. The profiler measures not only the time spent in the

program under test but also the time used by all other processes running on the same

computer, including the profiler itself.

• Function addresses are obscured in optimized programs. The profiler identifies any hot

spots in the program by their address and attempts to translate these addresses to

function names. But a highly optimized program is often reorganized in such a way that

there is no clear correspondence between function names and code addresses. The

names of inlined functions may not be visible at all to the profiler. The result will be

misleading reports of which functions take most time.

• Uses debug version of the code. Some profilers require that the code you are testing

contains debug information in order to identify individual functions or code lines. The

debug version of the code is not optimized.

• Jumps between CPU cores. A process or thread does not necessarily stay in the same

processor core on multi-core CPUs, but event-counters do. This results in meaningless

event counts for threads that jump between multiple CPU cores. You may need to lock a

thread to a specific CPU core by setting a thread affinity mask.

• Poor reproducibility. Delays in program execution may be caused by random events that

are not reproducible. Such events as task switches and garbage collection can occur at

random times and make parts of the program appear to take longer time than normally.

There are various alternatives to using a profiler. A simple alternative is to run the program

in a debugger and press break while the program is running. If there is a hot spot that uses

90% of the CPU time then there is a 90% chance that the break will occur in this hot spot.

Repeating the break a few times may be enough to identify a hot spot. Use the call stack in

the debugger to identify the circumstances around the hot spot.

Sometimes, the best way to identify performance bottlenecks is to put measurement

instruments directly into the code rather than using a ready-made profiler. This does not

solve all the problems associated with profiling, but it often gives more reliable results. If you

are not satisfied with the way a profiler works then you may put the desired measurement

instruments into the program itself. You may add counter variables that count how many

times each part of the program is executed. Furthermore, you may read the time before and

after each of the most important or critical parts of the program to measure how much time

each part takes. See page 164 for further discussion of this method.

Your measurement code should have #if directives around it so that it can be disabled in

the final version of the code. Inserting your own profiling instruments in the code itself is a

very useful way to keep track of the performance during the development of a program.

The time measurements may require a very high resolution if time intervals are short. In

Windows, you can use the GetTickCount or QueryPerformanceCounter functions for

millisecond resolution. A much higher resolution can be obtained with the time stamp

counter in the CPU, which counts at the CPU clock frequency (in Windows: __rdtsc()).

The time stamp counter becomes invalid if a thread jumps between different CPU cores.

You may have to fix the thread to a specific CPU core during time measurements to avoid

this. (In Windows, SetThreadAffinityMask, in Linux, sched_setaffinity).

The program should be tested with a realistic set of test data. The test data should contain a

typical degree of randomness in order to get a realistic number of cache misses and branch

mispredictions.

When the most time-consuming parts of the program have been found, then it is important

to focus the optimization efforts on the time consuming parts only. Critical pieces of code

can be further tested and investigated by the methods described on page 164.

A profiler is most useful for finding problems that relate to CPU-intensive code. But many

programs use more time loading files or accessing databases, network and other resources

than doing arithmetic operations. The most common time-consumers are discussed in the

following sections.

www.dbooks.org

3.3 Program installation

The time it takes to install a program package is not traditionally considered a software

optimization issue. But it is certainly something that can steal the user's time. The time it

takes to install a software package and make it work cannot be ignored if the goal of

software optimization is to save time for the user. With the high complexity of modern

software, it is not unusual for the installation process to take more than an hour. Neither is it

unusual that a user has to reinstall a software package several times in order to find and

resolve compatibility problems.

Software developers should take installation time and compatibility problems into account

when deciding whether to base a software package on a complex framework requiring

many files to be installed.

The installation process should always use standardized installation tools. It should be

possible to select all installation options at the start so that the rest of the installation

process can proceed unattended. Uninstallation should also proceed in a standardized

manner.

3.4 Automatic updates

Many software programs automatically download updates through the Internet at regular

time intervals. Some programs search for updates every time the computer starts up, even if

the program is never used. A computer with many such programs installed can take several

minutes to start up, which is a total waste of the user's time. Other programs use time

searching for updates each time the program starts. The user may not need the updates if

the current version satisfies the user's needs. The search for updates should be optional

and off by default unless there is a compelling security reason for updating. The update

process should run in a low priority thread, and only if the program is actually used. No

program should leave a background process running when it is not in use. The installation

of downloaded program updates should be postponed until the program is shut down and

restarted anyway.

Updates to the operating system can be particularly time consuming. Sometimes it takes

hours to install automatic updates to the operating system. This is very problematic because

these time consuming updates may come unpredictably at inconvenient times. This can be

a very big problem if the user has to turn off or log off the computer for security reasons

before leaving their workplace and the system forbids the user to turn off the computer

during the update process.

3.5 Program loading

Often, it takes more time to load a program than to execute it. The load time can be

annoyingly high for programs that are based on big runtime frameworks, intermediate code,

interpreters, just-in-time compilers, etc., as is commonly the case with programs written in

Java, C#, Visual Basic, etc.

But program loading can be a time-consumer even for programs implemented in compiled

C++. This typically happens if the program uses a lot of runtime DLL's (dynamically linked

libraries or shared objects), resource files, configuration files, help files and databases. The

operating system may not load all the modules of a big program when the program starts

up. Some modules may be loaded only when they are needed, or they may be swapped to

the hard disk if the RAM size is insufficient.

The user expects immediate responses to simple actions like a key press or mouse move. It

is unacceptable to the user if such a response is delayed for several seconds because it

requires the loading of modules or resource files from disk. Memory-hungry applications

force the operating system to swap memory to disk. Memory swapping is a frequent cause

of unacceptably long response times to simple things like a mouse move or key press.

Avoid an excessive number of DLLs, configuration files, resource files, help files etc.

scattered around on the hard disk. A few files, preferably in the same directory as the .exe

file, is acceptable.

3.6 Dynamic linking and position-independent code

Function libraries can be implemented either as static link libraries (*.lib, *.a) or dynamic

link libraries, also called shared objects (*.dll, *.so). There are several factors that can

make dynamic link libraries slower than static link libraries. These factors are explained in

detail on page 155 below.

Position-independent code is used in shared objects in Unix-like systems. Mac systems

often use position-independent code everywhere by default. Position-independent code is

inefficient, especially in 32-bit mode, for reasons explained on page 155 below.

3.7 File access

Reading or writing a file on a hard disk often takes much more time than processing the

data in the file, especially if the user has a virus scanner that scans all files on access.

Sequential forward access to a file is faster than random access. Reading or writing big

blocks is faster than reading or writing a small bit at a time. Do not read or write less than a

few kilobytes at a time.

You may mirror the entire file in a memory buffer and read or write it in one operation rather

than reading or writing small bits in a non-sequential manner.

It is usually much faster to access a file that has been accessed recently than to access it

the first time. This is because the file has been copied to the disk cache.

Files on remote or removable media such as floppy disks and USB sticks may not be

cached. This can have quite dramatic consequences. I once made a Windows program that

created a file by calling WritePrivateProfileString, which opens and closes the file

for each line written. This worked sufficiently fast on a hard disk because of disk caching,

but it took several minutes to write the file to a floppy disk.

A big file containing numerical data is more compact and efficient if the data are stored in

binary form than if the data are stored in ASCII form. A disadvantage of binary data storage

is that it is not human readable and not easily ported to systems with big-endian storage.

Optimizing file access is more important than optimizing CPU use in programs that have

many file input/output operations. It can be advantageous to put file access in a separate

thread if there is other work that the processor can do while waiting for disk operations to

finish.

3.8 System database

It can take several seconds to access the system database in Windows. It is more efficient

to store application-specific information in a separate file than in the big registration

database in the Windows system. Note that the system may store the information in the

database anyway if you are using functions such as GetPrivateProfileString and

WritePrivateProfileString to read and write configuration files (*.ini files).

www.dbooks.org

3.9 Other databases

Many software applications use a database for storing user data. A database can consume

a lot of CPU time, RAM and disk space. It may be possible to replace a database by a plain

old data file in simple cases. Database queries can often be optimized by using indexes,

working with sets rather than loops, etc. Optimizing database queries is beyond the scope

of this manual, but you should be aware that there is often a lot to gain by optimizing

database access.

3.10 Graphics

A graphical user interface can use a lot of computing resources. Typically, a specific

graphics framework is used. The operating system may supply such a framework in its API.

In some cases, there is an extra layer of a third-party graphics framework between the

operating system API and the application software. Such an extra framework can consume

a lot of extra resources.

Each graphics operation in the application software is implemented as a function call to a

graphics library or API function which then calls a device driver. A call to a graphics function

is time consuming because it may go through multiple layers and it needs to switch to

protected mode and back again. Obviously, it is more efficient to make a single call to a

graphics function that draws a whole polygon or bitmap than to draw each pixel or line

separately through multiple function calls.

The calculation of graphics objects in computer games and animations is of course also

time consuming, especially if there is no graphics processing unit.

Various graphics function libraries and drivers differ a lot in performance. I have no specific

recommendation of which one is best.

3.11 Other system resources

Writes to a printer or other device should preferably be done in big blocks rather than a

small piece at a time because each call to a driver involves the overhead of switching to

protected mode and back again.

Accessing system devices and using advanced facilities of the operating system can be

time consuming because it may involve the loading of several drivers, configuration files and

system modules.

3.12 Network access

Some application programs use internet or intranet for automatic updates, remote help files,

data base access, etc. The problem here is that access times cannot be controlled. The

network access may be fast in a simple test setup but slow or completely absent in a use

situation where the network is overloaded or the user is far from the server.

These problems should be taken into account when deciding whether to store help files and

other resources locally or remotely. If frequent updates are necessary then it may be

optimal to mirror the remote data locally.

Access to remote databases usually requires log on with a password. The log on process is

known to be an annoying time consumer to many hard working software users. In some

cases, the log on process may take more than a minute if the network or database is heavily

loaded.

剩余175页未读，继续阅读

雪虎-JL

粉丝: 8
资源: 6

C++软件优化指南：平台选择与性能提升

Optimizing C++

Optimizing C++.

Optimizing software in C++ An optimization guide

Optimizing software in C++ pdf

C++优化手册Optimizing software in C++

《Optimizing software in C++》

Optimizing software in C++ An optimization guide for Windows, Linux

Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms

C++性能优化指南：Agner Fog的《Optimizing Software in C++》

Agner Fog - Optimizing Software in C++ (2014-08-07)-计算机科学

最新资源