没有合适的资源?快使用搜索试试~ 我知道了~
首页C++性能优化全指南:平台、语言与调优策略
C++性能优化全指南:平台、语言与调优策略
需积分: 9 7 下载量 195 浏览量
更新于2024-07-18
收藏 1.72MB PDF 举报
《C++应用程序性能优化:Windows、Linux与Mac平台指南》是Agner Fog撰写的专著,针对C++程序员提供了一本详尽的性能优化手册。该书不仅涵盖了语言层面的优化技巧,还深入探讨了编译器选择、硬件平台、操作系统、编程语言版本、函数库选择、用户界面框架以及如何克服C++语言的局限性等关键领域。
在第1章“介绍”中,作者强调了优化软件的必要性和挑战,指出优化并非总是高收益的行为,需要权衡投入和产出。章节中解释了优化的不同层次,包括考虑硬件成本、微处理器特性、操作系统的影响以及选择最符合需求的编程语言和编译器。
第2章主要讨论如何选择合适的平台,涉及到硬件平台的选择,如处理器速度和架构,以及操作系统对性能的影响。此外,还会涉及编程语言(C++的最新版本)、编译器优化选项、功能库的选择,以及用户界面框架的选择,因为这些都会直接影响到程序的运行效率。
接着,第三部分着重于识别和解决性能瓶颈。作者指导读者理解每项操作的时间消耗,如时钟周期、使用性能分析工具定位热点代码,以及程序安装、更新、加载过程中的优化策略。动态链接和位置无关代码、文件访问、系统数据库查询、图形处理、系统资源使用、网络通信以及内存访问模式等都被逐一剖析。
第四章将性能优化与用户体验相结合,探讨如何在追求速度的同时,保持程序的易用性和响应性。这意味着要在性能和功能性之间找到平衡。
最后,第五章讨论选择最佳算法的重要性,因为算法选择直接影响到程序执行的效率。不同的问题可能需要不同算法来解决,而高效的算法设计是优化的关键环节。
《C++应用程序性能优化:Windows、Linux与Mac平台指南》为C++开发者提供了一个全面的优化框架,帮助他们理解如何通过技术手段提升程序的性能,使之适应各种平台,同时兼顾用户体验。这本书是每一个希望在C++领域追求卓越性能的程序员不可或缺的参考资料。
16
Integer overflow is another security problem. The official C standard says that the behavior
of signed integers in case of overflow is "undefined". This allows the compiler to ignore
overflow or assume that it doesn't occur. In the case of the Gnu compiler, the assumption
that signed integer overflow doesn't occur has the unfortunate consequence that it allows
the compiler to optimize away an overflow check. There are a number of possible remedies
against this problem: (1) check for overflow before it occurs, (2) use unsigned integers -
they are guaranteed to wrap around, (3) trap integer overflow with the option -ftrapv, but
this is extremely inefficient, (4) get a compiler warning for such optimizations with option
-Wstrict-overflow=2, or (5) make the overflow behavior well-defined with option
-fwrapv or -fno-strict-overflow.
You may deviate from the above security advices in critical parts of the code where speed is
important. This can be permissible if the unsafe code is limited to well-tested functions,
classes, templates or modules with a well-defined interface to the rest of the program.
3 Finding the biggest time consumers
3.1 How much is a clock cycle?
In this manual, I am using CPU clock cycles rather than seconds or microseconds as a time
measure. This is because computers have very different speeds. If I write that something
takes 10 μs today, then it may take only 5 μs on the next generation of computers and my
manual will soon be obsolete. But if I write that something takes 10 clock cycles then it will
still take 10 clock cycles even if the CPU clock frequency is doubled.
The length of a clock cycle is the reciprocal of the clock frequency. For example, if the clock
frequency is 2 GHz then the length of a clock cycle is
ns.5.0
GHz2
1
A clock cycle on one computer is not always comparable to a clock cycle on another
computer. The Pentium 4 (NetBurst) CPU is designed for a higher clock frequency than
other CPUs, but it uses more clock cycles than other CPUs for executing the same piece of
code in general.
Assume that a loop in a program repeats 1000 times and that there are 100 floating point
operations (addition, multiplication, etc.) inside the loop. If each floating point operation
takes 5 clock cycles, then we can roughly estimate that the loop will take 1000 * 100 * 5 *
0.5 ns = 250 μs on a 2 GHz CPU. Should we try to optimize this loop? Certainly not! 250 μs
is less than 1/50 of the time it takes to refresh the screen. There is no way the user can see
the delay. But if the loop is inside another loop that also repeats 1000 times then we have
an estimated calculation time of 250 ms. This delay is just long enough to be noticeable but
not long enough to be annoying. We may decide to do some measurements to see if our
estimate is correct or if the calculation time is actually more than 250 ms. If the response
time is so long that the user actually has to wait for a result then we will consider if there is
something that can be improved.
3.2 Use a profiler to find hot spots
Before you start to optimize anything, you have to identify the critical parts of the program.
In some programs, more than 99% of the time is spent in the innermost loop doing
mathematical calculations. In other programs, 99% of the time is spent on reading and
writing data files while less than 1% goes to actually doing something on these data. It is
very important to optimize the parts of the code that matters rather than the parts of the
code that use only a small fraction of the total time. Optimizing less critical parts of the code
17
will not only be a waste of time, it also makes the code less clear and more difficult to debug
and maintain.
Most compiler packages include a profiler that can tell how many times each function is
called and how much time it uses. There are also third-party profilers such as AQtime, Intel
VTune and AMD CodeAnalyst.
There are several different profiling methods:
Instrumentation: The compiler inserts extra code at each function call to count how
many times the function is called and how much time it takes.
Debugging. The profiler inserts temporary debug breakpoints at every function or every
code line.
Time-based sampling: The profiler tells the operating system to generate an interrupt,
e.g. every millisecond. The profiler counts how many times an interrupt occurs in each
part of the program. This requires no modification of the program under test, but is less
reliable.
Event-based sampling: The profiler tells the CPU to generate interrupts at certain
events, for example every time a thousand cache misses have occurred. This makes it
possible to see which part of the program has most cache misses, branch
mispredictions, floating point exceptions, etc. Event-based sampling requires a CPU-
specific profiler. For Intel CPUs use Intel VTune, for AMD CPUs use AMD CodeAnalyst.
Unfortunately, profilers are often unreliable. They sometimes give misleading results or fail
completely because of technical problems.
Some common problems with profilers are:
Coarse time measurement. If time is measured with millisecond resolution and the
critical functions take microseconds to execute then measurements can become
imprecise or simply zero.
Execution time too small or too long. If the program under test finishes in a short time
then the sampling generates too little data for analysis. If the program takes too long
time to execute then the profiler may sample more data than it can handle.
Waiting for user input. Many programs spend most of their time waiting for user input or
network resources. This time is included in the profile. It may be necessary to modify the
program to use a set of test data instead of user input in order to make profiling feasible.
Interference from other processes. The profiler measures not only the time spent in the
program under test but also the time used by all other processes running on the same
computer, including the profiler itself.
Function addresses are obscured in optimized programs. The profiler identifies any hot
spots in the program by their address and attempts to translate these addresses to
function names. But a highly optimized program is often reorganized in such a way that
there is no clear correspondence between function names and code addresses. The
names of inlined functions may not be visible at all to the profiler. The result will be
misleading reports of which functions take most time.
Uses debug version of the code. Some profilers require that the code you are testing
contains debug information in order to identify individual functions or code lines. The
18
debug version of the code is not optimized.
Jumps between CPU cores. A process or thread does not necessarily stay in the same
processor core on multi-core CPUs, but event-counters do. This results in meaningless
event counts for threads that jump between multiple CPU cores. You may need to lock a
thread to a specific CPU core by setting a thread affinity mask.
Poor reproducibility. Delays in program execution may be caused by random events that
are not reproducible. Such events as task switches and garbage collection can occur at
random times and make parts of the program appear to take longer time than normally.
There are various alternatives to using a profiler. A simple alternative is to run the program
in a debugger and press break while the program is running. If there is a hot spot that uses
90% of the CPU time then there is a 90% chance that the break will occur in this hot spot.
Repeating the break a few times may be enough to identify a hot spot. Use the call stack in
the debugger to identify the circumstances around the hot spot.
Sometimes, the best way to identify performance bottlenecks is to put measurement
instruments directly into the code rather than using a ready-made profiler. This does not
solve all the problems associated with profiling, but it often gives more reliable results. If you
are not satisfied with the way a profiler works then you may put the desired measurement
instruments into the program itself. You may add counter variables that count how many
times each part of the program is executed. Furthermore, you may read the time before and
after each of the most important or critical parts of the program to measure how much time
each part takes. See page 157 for further discussion of this method.
Your measurement code should have #if directives around it so that it can be disabled in
the final version of the code. Inserting your own profiling instruments in the code itself is a
very useful way to keep track of the performance during the development of a program.
The time measurements may require a very high resolution if time intervals are short. In
Windows, you can use the GetTickCount or QueryPerformanceCounter functions for
millisecond resolution. A much higher resolution can be obtained with the time stamp
counter in the CPU, which counts at the CPU clock frequency (in Windows: __rdtsc()).
The time stamp counter becomes invalid if a thread jumps between different CPU cores.
You may have to fix the thread to a specific CPU core during time measurements to avoid
this. (In Windows, SetThreadAffinityMask, in Linux, sched_setaffinity).
The program should be tested with a realistic set of test data. The test data should contain a
typical degree of randomness in order to get a realistic number of cache misses and branch
mispredictions.
When the most time-consuming parts of the program have been found, then it is important
to focus the optimization efforts on the time consuming parts only. Critical pieces of code
can be further tested and investigated by the methods described on page 157.
A profiler is most useful for finding problems that relate to CPU-intensive code. But many
programs use more time loading files or accessing databases, network and other resources
than doing arithmetic operations. The most common time-consumers are discussed in the
following sections.
3.3 Program installation
The time it takes to install a program package is not traditionally considered a software
optimization issue. But it is certainly something that can steal the user's time. The time it
takes to install a software package and make it work cannot be ignored if the goal of
19
software optimization is to save time for the user. With the high complexity of modern
software, it is not unusual for the installation process to take more than an hour. Neither is it
unusual that a user has to reinstall a software package several times in order to find and
resolve compatibility problems.
Software developers should take installation time and compatibility problems into account
when deciding whether to base a software package on a complex framework requiring
many files to be installed.
The installation process should always use standardized installation tools. It should be
possible to select all installation options at the start so that the rest of the installation
process can proceed unattended. Uninstallation should also proceed in a standardized
manner.
3.4 Automatic updates
Many software programs automatically download updates through the Internet at regular
time intervals. Some programs search for updates every time the computer starts up, even if
the program is never used. A computer with many such programs installed can take several
minutes to start up, which is a total waste of the user's time. Other programs use time
searching for updates each time the program starts. The user may not need the updates if
the current version satisfies the user's needs. The search for updates should be optional
and off by default unless there is a compelling security reason for updating. The update
process should run in a low priority thread, and only if the program is actually used. No
program should leave a background process running when it is not in use. The installation
of downloaded program updates should be postponed until the program is shut down and
restarted anyway.
Updates to the operating system can be particularly time consuming. Sometimes it takes
hours to install automatic updates to the operating system. This is very problematic because
these time consuming updates may come unpredictably at inconvenient times. This can be
a very big problem if the user has to turn off or log off the computer for security reasons
before leaving their workplace and the system forbids the user to turn off the computer
during the update process.
3.5 Program loading
Often, it takes more time to load a program than to execute it. The load time can be
annoyingly high for programs that are based on big runtime frameworks, intermediate code,
interpreters, just-in-time compilers, etc., as is commonly the case with programs written in
Java, C#, Visual Basic, etc.
But program loading can be a time-consumer even for programs implemented in compiled
C++. This typically happens if the program uses a lot of runtime DLL's (dynamically linked
libraries or shared objects), resource files, configuration files, help files and databases. The
operating system may not load all the modules of a big program when the program starts
up. Some modules may be loaded only when they are needed, or they may be swapped to
the hard disk if the RAM size is insufficient.
The user expects immediate responses to simple actions like a key press or mouse move. It
is unacceptable to the user if such a response is delayed for several seconds because it
requires the loading of modules or resource files from disk. Memory-hungry applications
force the operating system to swap memory to disk. Memory swapping is a frequent cause
of unacceptably long response times to simple things like a mouse move or key press.
20
Avoid an excessive number of DLLs, configuration files, resource files, help files etc.
scattered around on the hard disk. A few files, preferably in the same directory as the .exe
file, is acceptable.
3.6 Dynamic linking and position-independent code
Function libraries can be implemented either as static link libraries (*.lib, *.a) or dynamic
link libraries, also called shared objects (*.dll, *.so). There are several factors that can
make dynamic link libraries slower than static link libraries. These factors are explained in
detail on page 149 below.
Position-independent code is used in shared objects in Unix-like systems. Mac systems
often use position-independent code everywhere by default. Position-independent code is
inefficient, especially in 32-bit mode, for reasons explained on page 149 below.
3.7 File access
Reading or writing a file on a hard disk often takes much more time than processing the
data in the file, especially if the user has a virus scanner that scans all files on access.
Sequential forward access to a file is faster than random access. Reading or writing big
blocks is faster than reading or writing a small bit at a time. Do not read or write less than a
few kilobytes at a time.
You may mirror the entire file in a memory buffer and read or write it in one operation rather
than reading or writing small bits in a non-sequential manner.
It is usually much faster to access a file that has been accessed recently than to access it
the first time. This is because the file has been copied to the disk cache.
Files on remote or removable media such as floppy disks and USB sticks may not be
cached. This can have quite dramatic consequences. I once made a Windows program that
created a file by calling WritePrivateProfileString, which opens and closes the file
for each line written. This worked sufficiently fast on a hard disk because of disk caching,
but it took several minutes to write the file to a floppy disk.
A big file containing numerical data is more compact and efficient if the data are stored in
binary form than if the data are stored in ASCII form. A disadvantage of binary data storage
is that it is not human readable and not easily ported to systems with big-endian storage.
Optimizing file access is more important than optimizing CPU use in programs that have
many file input/output operations. It can be advantageous to put file access in a separate
thread if there is other work that the processor can do while waiting for disk operations to
finish.
3.8 System database
It can take several seconds to access the system database in Windows. It is more efficient
to store application-specific information in a separate file than in the big registration
database in the Windows system. Note that the system may store the information in the
database anyway if you are using functions such as GetPrivateProfileString and
WritePrivateProfileString to read and write configuration files (*.ini files).
剩余167页未读,继续阅读
2010-12-16 上传
2019-02-14 上传
zhbbuaa
- 粉丝: 0
- 资源: 6
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 高清艺术文字图标资源,PNG和ICO格式免费下载
- mui框架HTML5应用界面组件使用示例教程
- Vue.js开发利器:chrome-vue-devtools插件解析
- 掌握ElectronBrowserJS:打造跨平台电子应用
- 前端导师教程:构建与部署社交证明页面
- Java多线程与线程安全在断点续传中的实现
- 免Root一键卸载安卓预装应用教程
- 易语言实现高级表格滚动条完美控制技巧
- 超声波测距尺的源码实现
- 数据可视化与交互:构建易用的数据界面
- 实现Discourse外聘回复自动标记的简易插件
- 链表的头插法与尾插法实现及长度计算
- Playwright与Typescript及Mocha集成:自动化UI测试实践指南
- 128x128像素线性工具图标下载集合
- 易语言安装包程序增强版:智能导入与重复库过滤
- 利用AJAX与Spotify API在Google地图中探索世界音乐排行榜
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功