C++优化指南：平台选择与性能提升策略

140 浏览量更新于2024-07-14 收藏 1.68MB PDF 举报

"《Agner Fog - Optimizing Software in C++》是丹麦技术大学的Agner Fog撰写的一本关于在Windows、Linux和Mac平台优化C++软件的指南。这本书涵盖了从2004年到2014年的知识，最新更新日期为2014年8月7日。书中详细讨论了优化的成本、选择最佳平台、识别性能瓶颈以及平衡性能与可用性等多个方面的问题。" 在《Optimizing Software in C++》中，Agner Fog首先介绍了优化软件时可能面临的成本问题。他强调，优化可能会带来额外的时间和精力投入，包括代码维护、测试和调试的复杂性增加，因此优化策略的选择必须谨慎。在选择最佳平台的章节中，作者讨论了多个因素，包括硬件平台的选择，如何根据应用需求挑选合适的微处理器；操作系统的选择对性能的影响，比如Windows、Linux和Mac之间的差异；编程语言的角色，C++的优势与限制；编译器的重要性，不同的编译器可能产生不同的优化结果；函数库的选择，高质量的库可以提升代码效率；用户界面框架的考量，优化UI交互的策略；以及克服C++语言本身的某些限制，如模板和运行时类型信息的使用。寻找最大的时间消耗者是优化的关键步骤。Fog建议使用分析工具（如profiler）来定位程序中的热点，他还指出程序安装、自动更新、加载过程、动态链接、位置无关代码、文件访问、系统数据库、其他数据库、图形处理、网络访问、内存访问、上下文切换、依赖链、执行单元吞吐量等都是可能的性能瓶颈。此外，书中还探讨了性能和可用性的平衡。在追求速度的同时，不能忽视用户体验，因为过度优化可能导致软件变得难以使用或不稳定。最后，选择最佳算法是优化过程中的核心。正确地选择和实现算法可以极大地提高程序效率，Agner Fog很可能会在这个部分深入讨论各种数据结构和算法的优缺点，以及如何在特定场景下做出最佳决策。这本书是C++开发者和系统优化工程师的重要参考资源，它提供了丰富的实践经验和技术细节，帮助读者理解和改进他们的软件性能。

Integer overflow is another security problem. The official C standard says that the behavior

of signed integers in case of overflow is "undefined". This allows the compiler to ignore

overflow or assume that it doesn't occur. In the case of the Gnu compiler, the assumption

that signed integer overflow doesn't occur has the unfortunate consequence that it allows

the compiler to optimize away an overflow check. There are a number of possible remedies

against this problem: (1) check for overflow before it occurs, (2) use unsigned integers -

they are guaranteed to wrap around, (3) trap integer overflow with the option -ftrapv, but

this is extremely inefficient, (4) get a compiler warning for such optimizations with option

-Wstrict-overflow=2, or (5) make the overflow behavior well-defined with option

-fwrapv or -fno-strict-overflow.

You may deviate from the above security advices in critical parts of the code where speed is

important. This can be permissible if the unsafe code is limited to well-tested functions,

classes, templates or modules with a well-defined interface to the rest of the program.

3 Finding the biggest time consumers

3.1 How much is a clock cycle?

In this manual, I am using CPU clock cycles rather than seconds or microseconds as a time

measure. This is because computers have very different speeds. If I write that something

takes 10 μs today, then it may take only 5 μs on the next generation of computers and my

manual will soon be obsolete. But if I write that something takes 10 clock cycles then it will

still take 10 clock cycles even if the CPU clock frequency is doubled.

The length of a clock cycle is the reciprocal of the clock frequency. For example, if the clock

frequency is 2 GHz then the length of a clock cycle is

ns.5.0

GHz2



A clock cycle on one computer is not always comparable to a clock cycle on another

computer. The Pentium 4 (NetBurst) CPU is designed for a higher clock frequency than

other CPUs, but it uses more clock cycles than other CPUs for executing the same piece of

code in general.

Assume that a loop in a program repeats 1000 times and that there are 100 floating point

operations (addition, multiplication, etc.) inside the loop. If each floating point operation

takes 5 clock cycles, then we can roughly estimate that the loop will take 1000 * 100 * 5 *

0.5 ns = 250 μs on a 2 GHz CPU. Should we try to optimize this loop? Certainly not! 250 μs

is less than 1/50 of the time it takes to refresh the screen. There is no way the user can see

the delay. But if the loop is inside another loop that also repeats 1000 times then we have

an estimated calculation time of 250 ms. This delay is just long enough to be noticeable but

not long enough to be annoying. We may decide to do some measurements to see if our

estimate is correct or if the calculation time is actually more than 250 ms. If the response

time is so long that the user actually has to wait for a result then we will consider if there is

something that can be improved.

3.2 Use a profiler to find hot spots

Before you start to optimize anything, you have to identify the critical parts of the program.

In some programs, more than 99% of the time is spent in the innermost loop doing

mathematical calculations. In other programs, 99% of the time is spent on reading and

writing data files while less than 1% goes to actually doing something on these data. It is

very important to optimize the parts of the code that matters rather than the parts of the

code that use only a small fraction of the total time. Optimizing less critical parts of the code

will not only be a waste of time, it also makes the code less clear and more difficult to debug

and maintain.

Most compiler packages include a profiler that can tell how many times each function is

called and how much time it uses. There are also third-party profilers such as AQtime, Intel

VTune and AMD CodeAnalyst.

There are several different profiling methods:

 Instrumentation: The compiler inserts extra code at each function call to count how

many times the function is called and how much time it takes.

 Debugging. The profiler inserts temporary debug breakpoints at every function or every

code line.

 Time-based sampling: The profiler tells the operating system to generate an interrupt,

e.g. every millisecond. The profiler counts how many times an interrupt occurs in each

part of the program. This requires no modification of the program under test, but is less

reliable.

 Event-based sampling: The profiler tells the CPU to generate interrupts at certain

events, for example every time a thousand cache misses have occurred. This makes it

possible to see which part of the program has most cache misses, branch

mispredictions, floating point exceptions, etc. Event-based sampling requires a CPU-

specific profiler. For Intel CPUs use Intel VTune, for AMD CPUs use AMD CodeAnalyst.

Unfortunately, profilers are often unreliable. They sometimes give misleading results or fail

completely because of technical problems.

Some common problems with profilers are:

 Coarse time measurement. If time is measured with millisecond resolution and the

critical functions take microseconds to execute then measurements can become

imprecise or simply zero.

 Execution time too small or too long. If the program under test finishes in a short time

then the sampling generates too little data for analysis. If the program takes too long

time to execute then the profiler may sample more data than it can handle.

 Waiting for user input. Many programs spend most of their time waiting for user input or

network resources. This time is included in the profile. It may be necessary to modify the

program to use a set of test data instead of user input in order to make profiling feasible.

 Interference from other processes. The profiler measures not only the time spent in the

program under test but also the time used by all other processes running on the same

computer, including the profiler itself.

 Function addresses are obscured in optimized programs. The profiler identifies any hot

spots in the program by their address and attempts to translate these addresses to

function names. But a highly optimized program is often reorganized in such a way that

there is no clear correspondence between function names and code addresses. The

names of inlined functions may not be visible at all to the profiler. The result will be

misleading reports of which functions take most time.

 Uses debug version of the code. Some profilers require that the code you are testing

contains debug information in order to identify individual functions or code lines. The

debug version of the code is not optimized.

 Jumps between CPU cores. A process or thread does not necessarily stay in the same

processor core on multi-core CPUs, but event-counters do. This results in meaningless

event counts for threads that jump between multiple CPU cores. You may need to lock a

thread to a specific CPU core by setting a thread affinity mask.

 Poor reproducibility. Delays in program execution may be caused by random events that

are not reproducible. Such events as task switches and garbage collection can occur at

random times and make parts of the program appear to take longer time than normally.

There are various alternatives to using a profiler. A simple alternative is to run the program

in a debugger and press break while the program is running. If there is a hot spot that uses

90% of the CPU time then there is a 90% chance that the break will occur in this hot spot.

Repeating the break a few times may be enough to identify a hot spot. Use the call stack in

the debugger to identify the circumstances around the hot spot.

Sometimes, the best way to identify performance bottlenecks is to put measurement

instruments directly into the code rather than using a ready-made profiler. This does not

solve all the problems associated with profiling, but it often gives more reliable results. If you

are not satisfied with the way a profiler works then you may put the desired measurement

instruments into the program itself. You may add counter variables that count how many

times each part of the program is executed. Furthermore, you may read the time before and

after each of the most important or critical parts of the program to measure how much time

each part takes. See page 153 for further discussion of this method.

Your measurement code should have #if directives around it so that it can be disabled in

the final version of the code. Inserting your own profiling instruments in the code itself is a

very useful way to keep track of the performance during the development of a program.

The time measurements may require a very high resolution if time intervals are short. In

Windows, you can use the GetTickCount or QueryPerformanceCounter functions for

millisecond resolution. A much higher resolution can be obtained with the time stamp

counter in the CPU, which counts at the CPU clock frequency (in Windows: __rdtsc()).

The time stamp counter becomes invalid if a thread jumps between different CPU cores.

You may have to fix the thread to a specific CPU core during time measurements to avoid

this. (In Windows, SetThreadAffinityMask, in Linux, sched_setaffinity).

The program should be tested with a realistic set of test data. The test data should contain a

typical degree of randomness in order to get a realistic number of cache misses and branch

mispredictions.

When the most time-consuming parts of the program have been found, then it is important

to focus the optimization efforts on the time consuming parts only. Critical pieces of code

can be further tested and investigated by the methods described on page 153.

A profiler is most useful for finding problems that relate to CPU-intensive code. But many

programs use more time loading files or accessing databases, network and other resources

than doing arithmetic operations. The most common time-consumers are discussed in the

following sections.

3.3 Program installation

The time it takes to install a program package is not traditionally considered a software

optimization issue. But it is certainly something that can steal the user's time. The time it

takes to install a software package and make it work cannot be ignored if the goal of

software optimization is to save time for the user. With the high complexity of modern

software, it is not unusual for the installation process to take more than an hour. Neither is it

unusual that a user has to reinstall a software package several times in order to find and

resolve compatibility problems.

Software developers should take installation time and compatibility problems into account

when deciding whether to base a software package on a complex framework requiring

many files to be installed.

The installation process should always use standardized installation tools. It should be

possible to select all installation options at the start so that the rest of the installation

process can proceed unattended. Uninstallation should also proceed in a standardized

manner.

3.4 Automatic updates

Many software programs automatically download updates through the Internet at regular

time intervals. Some programs search for updates every time the computer starts up, even if

the program is never used. A computer with many such programs installed can take several

minutes to start up, which is a total waste of the user's time. Other programs use time

searching for updates each time the program starts. The user may not need the updates if

the current version satisfies the user's needs. The search for updates should be optional

and off by default unless there is a compelling security reason for updating. The update

process should run in a low priority thread, and only if the program is actually used. No

program should leave a background process running when it is not in use. The installation

of downloaded program updates should be postponed until the program is shut down and

restarted anyway.

Updates to the operating system can be particularly time consuming. Sometimes it takes

hours to install automatic updates to the operating system. This is very problematic because

these time consuming updates may come unpredictably at inconvenient times. This can be

a very big problem if the user has to turn off or log off the computer for security reasons

before leaving their workplace and the system forbids the user to turn off the computer

during the update process.

3.5 Program loading

Often, it takes more time to load a program than to execute it. The load time can be

annoyingly high for programs that are based on big runtime frameworks, intermediate code,

interpreters, just-in-time compilers, etc., as is commonly the case with programs written in

Java, C#, Visual Basic, etc.

But program loading can be a time-consumer even for programs implemented in compiled

C++. This typically happens if the program uses a lot of runtime DLL's (dynamically linked

libraries or shared objects), resource files, configuration files, help files and databases. The

operating system may not load all the modules of a big program when the program starts

up. Some modules may be loaded only when they are needed, or they may be swapped to

the hard disk if the RAM size is insufficient.

The user expects immediate responses to simple actions like a key press or mouse move. It

is unacceptable to the user if such a response is delayed for several seconds because it

requires the loading of modules or resource files from disk. Memory-hungry applications

force the operating system to swap memory to disk. Memory swapping is a frequent cause

of unacceptably long response times to simple things like a mouse move or key press.

Avoid an excessive number of DLLs, configuration files, resource files, help files etc.

scattered around on the hard disk. A few files, preferably in the same directory as the .exe

file, is acceptable.

3.6 Dynamic linking and position-independent code

Function libraries can be implemented either as static link libraries (*.lib, *.a) or dynamic

link libraries, also called shared objects (*.dll, *.so). There are several factors that can

make dynamic link libraries slower than static link libraries. These factors are explained in

detail on page 146 below.

Position-independent code is used in shared objects in Unix-like systems. Mac systems

often use position-independent code everywhere by default. Position-independent code is

inefficient, especially in 32-bit mode, for reasons explained on page 146 below.

3.7 File access

Reading or writing a file on a hard disk often takes much more time than processing the

data in the file, especially if the user has a virus scanner that scans all files on access.

Sequential forward access to a file is faster than random access. Reading or writing big

blocks is faster than reading or writing a small bit at a time. Do not read or write less than a

few kilobytes at a time.

You may mirror the entire file in a memory buffer and read or write it in one operation rather

than reading or writing small bits in a non-sequential manner.

It is usually much faster to access a file that has been accessed recently than to access it

the first time. This is because the file has been copied to the disk cache.

Files on remote or removable media such as floppy disks and USB sticks may not be

cached. This can have quite dramatic consequences. I once made a Windows program that

created a file by calling WritePrivateProfileString, which opens and closes the file

for each line written. This worked sufficiently fast on a hard disk because of disk caching,

but it took several minutes to write the file to a floppy disk.

A big file containing numerical data is more compact and efficient if the data are stored in

binary form than if the data are stored in ASCII form. A disadvantage of binary data storage

is that it is not human readable and not easily ported to systems with big-endian storage.

Optimizing file access is more important than optimizing CPU use in programs that have

many file input/output operations. It can be advantageous to put file access in a separate

thread if there is other work that the processor can do while waiting for disk operations to

finish.

3.8 System database

It can take several seconds to access the system database in Windows. It is more efficient

to store application-specific information in a separate file than in the big registration

database in the Windows system. Note that the system may store the information in the

database anyway if you are using functions such as GetPrivateProfileString and

WritePrivateProfileString to read and write configuration files (*.ini files).

剩余163页未读，继续阅读

weixin_38697274

粉丝: 17
资源: 904

C++优化指南：平台选择与性能提升策略

Agner Fog - Microarchitecture of Intel, AMD and VIA CPUs - An optimization guide for assembly programmers and compiler makers (2013-09-04)-计算机科学

Optimizing software in C++

《Optimizing software in C++》

Agner Fog - Optimizing Software in C++ (2017-05-02)-计算机科学

C++性能优化指南：Agner Fog的《Optimizing Software in C++》

C++优化手册Optimizing software in C++

Optimizing software in C++ An optimization guide

Optimizing software in C++ An optimization guide for Windows, Linux

C++软件优化深度指南

C++程序优化指南：硬件选择与性能分析

最新资源