没有合适的资源?快使用搜索试试~ 我知道了~
首页Intel 64位汇编优化手册:加速程序性能
本资源是Intel 64位和IA-32架构的优化参考手册,编号为248966-019,发布于2009年10月。该文档旨在帮助开发者深入了解和优化在Intel 64位架构下的汇编编程,以提升程序运行效率。它涵盖了针对Intel处理器的高级汇编语言优化技巧,包括但不限于指令集、内存管理、数据类型处理、向量化操作和SIMD(单指令多数据)技术。
文档强调了Intel不授予任何与产品相关的知识产权许可,除非明确在销售条款中有所规定。用户应清楚,Intel对产品的提供不承担任何责任,包括但不限于特定用途的适用性、产品品质或对专利、版权或其他知识产权的侵犯。此外,Intel提醒开发者,其产品并非设计用于可能导致人身伤害或死亡的场合,因此在使用时需谨慎考虑安全因素。
在技术内容方面,手册详细说明了如何避免使用未定义或预留的指令,因为Intel可能随时更新规格和产品描述。这意味着开发者在编写代码时应密切关注官方文档,确保他们的优化策略不会与未来版本的处理器兼容性产生冲突。
优化指南可能涉及以下核心知识点:
1. **指令集优化**:介绍不同类型的Intel 64位指令,如SSE(Streaming SIMD Extensions)、AVX(Advanced Vector Extensions)等,以及如何利用这些特性进行高效计算。
2. **内存管理优化**:讲解如何有效地管理64位地址空间,包括缓存优化、内存布局和数据结构设计,以减少内存访问延迟。
3. **数据类型和向量化**:阐述如何利用宽字节数据类型和SIMD指令处理大量数据,提高并行性和性能。
4. **控制流优化**:探讨循环展开、分支预测和无条件转移的使用,以避免控制流的分支开销。
5. **错误处理和异常处理**:介绍如何编写健壮的代码,以减少中断和异常带来的性能损失。
6. **编译器和汇编器特性**:解释如何利用Intel编译器和汇编器提供的优化选项,如编译器的指令集选择和循环矢量化功能。
7. **平台依赖性**:警告开发者注意特定处理器特有的特性和限制,确保跨平台代码的兼容性。
这份文档对于任何希望在Intel 64位系统上实现高性能汇编编程的开发者来说,是一份不可或缺的参考资料,它提供了深入的理论指导和技术实践案例,能够显著提升程序的运行速度和效率。
CONTENTS
xvi
PAGE
EXAMPLES
Example 3-1. Assembly Code with an Unpredictable Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Example 3-2. Code Optimization to Eliminate Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Example 3-4. Use of PAUSE Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
Example 3-3. Eliminating Branch with CMOV Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
Example 3-5. Pentium 4 Processor Static Branch Prediction Algorithm . . . . . . . . . . . . . . . . . . . . . . 3-10
Example 3-6. Static Taken Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
Example 3-7. Static Not-Taken Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
Example 3-8. Indirect Branch With Two Favored Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
Example 3-9. A Peeling Technique to Reduce Indirect Branch Misprediction . . . . . . . . . . . . . . . . . 3-15
Example 3-10. Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16
Example 3-11. Macro-fusion, Unsigned Iteration Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19
Example 3-12. Macro-fusion, If Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
Example 3-13. Macro-fusion, Signed Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
Example 3-14. Macro-fusion, Signed Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
Example 3-15. Avoiding False LCP Delays with 0xF7 Group Instructions . . . . . . . . . . . . . . . . . . . . . . 3-23
Example 3-16. Clearing Register to Break Dependency While Negating Array Elements. . . . . . . . 3-29
Example 3-17. Spill Scheduling Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
Example 3-18. Dependencies Caused by Referencing Partial Registers . . . . . . . . . . . . . . . . . . . . . . . 3-35
Example 3-19. Avoiding Partial Register Stalls in Integer Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35
Example 3-20. Avoiding Partial Register Stalls in SIMD Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36
Example 3-21. Avoiding Partial Flag Register Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
Example 3-22. Reference Code Template for Partially Vectorizable Program . . . . . . . . . . . . . . . . . 3-41
Example 3-23. Three Alternate Packing Methods for Avoiding Store Forwarding Difficulty . . . . 3-42
Example 3-24. Using Four Registers to Reduce Memory Spills and Simplify Result Passing. . . . . 3-43
Example 3-25. Stack Optimization Technique to Simplify Parameter Passing. . . . . . . . . . . . . . . . . . 3-43
Example 3-26. Base Line Code Sequence to Estimate Loop Overhead . . . . . . . . . . . . . . . . . . . . . . . . 3-45
Example 3-27. Loads Blocked by Stores of Unknown Address. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
Example 3-28. Code That Causes Cache Line Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-49
Example 3-29. Situations Showing Small Loads After Large Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52
Example 3-30. Non-forwarding Example of Large Load After Small Store . . . . . . . . . . . . . . . . . . . . . 3-53
Example 3-31. A Non-forwarding Situation in Compiler Generated Code . . . . . . . . . . . . . . . . . . . . . . 3-53
Example 3-32. Two Ways to Avoid Non-forwarding Situation in Example 3-31. . . . . . . . . . . . . . . . 3-53
Example 3-33. Large and Small Load Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
Example 3-34. Loop-carried Dependence Chain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-56
Example 3-35. Rearranging a Data Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Example 3-36. Decomposing an Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Example 3-37. Dynamic Stack Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-59
Example 3-38. Aliasing Between Loads and Stores Across Loop Iterations. . . . . . . . . . . . . . . . . . . . 3-63
Example 3-39. Using Non-temporal Stores and 64-byte Bus Write Transactions . . . . . . . . . . . . . . 3-68
Example 3-40. On-temporal Stores and Partial Bus Write Transactions . . . . . . . . . . . . . . . . . . . . . . . 3-68
Example 3-41. Using DCU Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-71
Example 3-42. Avoid Causing DCU Hardware Prefetch to Fetch Un-needed Lines . . . . . . . . . . . . . 3-72
Example 3-43. Technique For Using L1 Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
Example 3-44. REP STOSD with Arbitrary Count Size and 4-Byte-Aligned Destination . . . . . . . . . 3-76
Example 3-45. Algorithm to Avoid Changing Rounding Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-82
xvii
CONTENTS
PAGE
Example 4-1. Identification of MMX Technology with CPUID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Example 4-2. Identification of SSE with CPUID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Example 4-3. Identification of SSE2 with cpuid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Example 4-4. Identification of SSE3 with CPUID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Example 4-5. Identification of SSSE3 with cpuid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Example 4-6. Identification of SSE4.1 with cpuid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Example 4-7. Simple Four-Iteration Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-10
Example 4-8. Streaming SIMD Extensions Using Inlined Assembly Encoding . . . . . . . . . . . . . . . . . .4-11
Example 4-9. Simple Four-Iteration Loop Coded with Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-12
Example 4-10. C++ Code Using the Vector Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-13
Example 4-11. Automatic Vectorization for a Simple Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-14
Example 4-12. C Algorithm for 64-bit Data Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-17
Example 4-14. SoA Data Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-20
Example 4-15. AoS and SoA Code Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-20
Example 4-13. AoS Data Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-20
Example 4-16. Hybrid SoA Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-22
Example 4-17. Pseudo-code Before Strip Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-23
Example 4-18. Strip Mined Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-24
Example 4-19. Loop Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-25
Example 4-20. Emulation of Conditional Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-27
Example 5-1. Resetting Register Between __m64 and FP Data Types Code. . . . . . . . . . . . . . . . . . . 5-4
Example 5-2. FIR Processing Example in C language Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Example 5-3. SSE2 and SSSE3 Implementation of FIR Processing Code . . . . . . . . . . . . . . . . . . . . . . . 5-5
Example 5-5. Signed Unpack Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Example 5-4. Zero Extend 16-bit Values into 32 Bits Using Unsigned Unpack Instructions Code. 5-
7
Example 5-6. Interleaved Pack with Saturation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9
Example 5-7. Interleaved Pack without Saturation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-10
Example 5-8. Unpacking Two Packed-word Sources in Non-interleaved Way Code. . . . . . . . . . . .5-12
Example 5-9. PEXTRW Instruction Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-13
Example 5-11. Repeated PINSRW Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-14
Example 5-10. PINSRW Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-14
Example 5-12. Non-Unit Stride Load/Store Using SSE4.1 Instructions . . . . . . . . . . . . . . . . . . . . . . . . .5-15
Example 5-13. Scatter and Gather Operations Using SSE4.1 Instructions . . . . . . . . . . . . . . . . . . . . . .5-15
Example 5-14. PMOVMSKB Instruction Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-16
Example 5-15. Broadcast a Word Across XMM, Using 2 SSE2 Instructions . . . . . . . . . . . . . . . . . . . . .5-17
Example 5-16. Swap/Reverse words in an XMM, Using 3 SSE2 Instructions . . . . . . . . . . . . . . . . . . .5-18
Example 5-17. Generating Constants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-19
Example 5-18. Absolute Difference of Two Unsigned Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-21
Example 5-19. Absolute Difference of Signed Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-21
Example 5-20. Computing Absolute Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-22
Example 5-21. Basic C Implementation of RGBA to BGRA Conversion . . . . . . . . . . . . . . . . . . . . . . . . .5-22
Example 5-22. Color Pixel Format Conversion Using SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-23
Example 5-23. Color Pixel Format Conversion Using SSSE3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-24
Example 5-24. Big-Endian to Little-Endian Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-25
Example 5-25. Clipping to a Signed Range of Words [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-26
Example 5-26. Clipping to an Arbitrary Signed Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-26
CONTENTS
xviii
PAGE
Example 5-28. Clipping to an Arbitrary Unsigned Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27
Example 5-27. Simplified Clipping to an Arbitrary Signed Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27
Example 5-29. Complex Multiply by a Constant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
Example 5-30. Using PTEST to Separate Vectorizable and non-Vectorizable Loop Iterations . . . 5-31
Example 5-31. Using PTEST and Variable BLEND to Vectorize Heterogeneous Loops. . . . . . . . . . 5-32
Example 5-32. Baseline C Code for Mandelbrot Set Map Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . 5-33
Example 5-33. Vectorized Mandelbrot Set Map Evaluation Using SSE4.1 Intrinsics . . . . . . . . . . . . 5-34
Example 5-34. A Large Load after a Series of Small Stores (Penalty) . . . . . . . . . . . . . . . . . . . . . . . . . 5-36
Example 5-36. A Series of Small Loads After a Large Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-37
Example 5-37. Eliminating Delay for a Series of Small Loads after a Large Store . . . . . . . . . . . . . . 5-37
Example 5-35. Accessing Data Without Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-37
Example 5-38. An Example of Video Processing with Cache Line Splits . . . . . . . . . . . . . . . . . . . . . . . 5-38
Example 5-39. Video Processing Using LDDQU to Avoid Cache Line Splits. . . . . . . . . . . . . . . . . . . . . 5-39
Example 5-40. Un-optimized Reverse Memory Copy in C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-41
Example 5-41. Using PSHUFB to Reverse Byte Ordering 16 Bytes at a Time. . . . . . . . . . . . . . . . . . 5-42
Example 5-42. PMOVSX/PMOVZX Work-around to Avoid False Dependency . . . . . . . . . . . . . . . . . . 5-45
Example 5-43. Table Look-up Operations in C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-46
Example 5-44. Shift Techniques on Non-Vectorizable Table Look-up . . . . . . . . . . . . . . . . . . . . . . . . . 5-47
Example 5-45. PEXTRD Techniques on Non-Vectorizable Table Look-up . . . . . . . . . . . . . . . . . . . . . . 5-48
Example 6-1. Pseudocode for Horizontal (xyz, AoS) Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
Example 6-2. Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation . . . . . . . . . . . . . . . . . . 6-6
Example 6-3. Swizzling Data Using SHUFPS, MOVLHPS, MOVHLPS . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Example 6-4. Swizzling Data Using UNPCKxxx Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Example 6-5. Deswizzling Single-Precision SIMD Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
Example 6-6. Deswizzling Data Using SIMD Integer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
Example 6-7. Horizontal Add Using MOVHLPS/MOVLHPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Example 6-8. Horizontal Add Using Intrinsics with MOVHLPS/MOVLHPS . . . . . . . . . . . . . . . . . . . . . 6-12
Example 6-9. Multiplication of Two Pair of Single-precision Complex Number . . . . . . . . . . . . . . . . 6-15
Example 6-10. Division of Two Pair of Single-precision Complex Numbers . . . . . . . . . . . . . . . . . . . . 6-16
Example 6-11. Double-Precision Complex Multiplication of Two Pairs . . . . . . . . . . . . . . . . . . . . . . . . . 6-17
Example 6-12. Double-Precision Complex Multiplication Using Scalar SSE2 . . . . . . . . . . . . . . . . . . . . 6-17
Example 6-13. Dot Product of Vector Length 4 Using SSE/SSE2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19
Example 6-14. Dot Product of Vector Length 4 Using SSE3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19
Example 6-15. Dot Product of Vector Length 4 Using SSE4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19
Example 6-16. Unrolled Implementation of Four Dot Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
Example 6-17. Normalization of an Array of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21
Example 6-18. Normalize (x, y, z) Components of an Array of Vectors Using SSE2. . . . . . . . . . . . . 6-22
Example 6-19. Normalize (x, y, z) Components of an Array of Vectors Using SSE4.1 . . . . . . . . . . . 6-23
Example 6-20. Data Organization in Memory for AOS Vector-Matrix Multiplication . . . . . . . . . . . . 6-24
Example 6-21. AOS Vector-Matrix Multiplication with HADDPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-24
Example 6-22. AOS Vector-Matrix Multiplication with DPPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25
Example 6-23. Data Organization in Memory for SOA Vector-Matrix Multiplication . . . . . . . . . . . . 6-26
Example 6-24. Vector-Matrix Multiplication with Native SOA Data Layout . . . . . . . . . . . . . . . . . . . . 6-27
Example 7-1. Pseudo-code Using CLFLUSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
Example 7-2. Populating an Array for Circular Pointer Chasing with Constant Stride . . . . . . . . . 7-15
Example 7-3. Prefetch Scheduling Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18
Example 7-4. Using Prefetch Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
xix
CONTENTS
PAGE
Example 7-5. Concatenation and Unrolling the Last Iteration of Inner Loop . . . . . . . . . . . . . . . . . .7-20
Example 7-6. Data Access of a 3D Geometry Engine without Strip-mining . . . . . . . . . . . . . . . . . . .7-26
Example 7-7. Data Access of a 3D Geometry Engine with Strip-mining. . . . . . . . . . . . . . . . . . . . . . .7-26
Example 7-8. Using HW Prefetch to Improve Read-Once Memory Traffic. . . . . . . . . . . . . . . . . . . . .7-28
Example 7-9. Basic Algorithm of a Simple Memory Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-33
Example 7-10. A Memory Copy Routine Using Software Prefetch. . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-34
Example 7-11. Memory Copy Using Hardware Prefetch and Bus Segmentation . . . . . . . . . . . . . . . .7-36
Example 8-1. Serial Execution of Producer and Consumer Work Items . . . . . . . . . . . . . . . . . . . . . . . . 8-6
Example 8-2. Basic Structure of Implementing Producer Consumer Threads . . . . . . . . . . . . . . . . . . 8-7
Example 8-3. Thread Function for an Interlaced Producer Consumer Model . . . . . . . . . . . . . . . . . . . 8-9
Example 8-4. Spin-wait Loop and PAUSE Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-17
Example 8-5. Coding Pitfall using Spin Wait Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-20
Example 8-6. Placement of Synchronization and Regular Variables . . . . . . . . . . . . . . . . . . . . . . . . . .8-22
Example 8-7. Declaring Synchronization Variables without Sharing a Cache Line . . . . . . . . . . . . .8-22
Example 8-8. Batched Implementation of the Producer Consumer Threads . . . . . . . . . . . . . . . . . .8-29
Example 8-9. Parallel Memory Initialization Technique Using OpenMP and NUMA . . . . . . . . . . . . .8-34
Example 10-1. A Hash Function Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-5
Example 10-2. Hash Function Using CRC32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-6
Example 10-3. Strlen() Using General-Purpose Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-9
Example 10-4. Sub-optimal PCMPISTRI Implementation of EOS handling . . . . . . . . . . . . . . . . . . . . 10-11
Example 10-5. Strlen() Using PCMPISTRI without Loop-Carry Dependency. . . . . . . . . . . . . . . . . . . 10-12
Example 10-6. WordCnt() Using C and Byte-Scanning Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13
Example 10-7. WordCnt() Using PCMPISTRM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-15
Example 10-8. KMP Substring Search in C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17
Example 10-9. Brute-Force Substring Search Using PCMPISTRI Intrinsic . . . . . . . . . . . . . . . . . . . . . 10-19
Example 10-10.Substring Search Using PCMPISTRI and KMP Overlap Table . . . . . . . . . . . . . . . . . . 10-22
Example 10-11.I Equivalent Strtok_s() Using PCMPISTRI Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-26
Example 10-12.I Equivalent Strupr() Using PCMPISTRM Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-29
Example 10-13.UTF16 VerStrlen() Using C and Table LookupTechnique . . . . . . . . . . . . . . . . . . . . . 10-31
Example 10-14.Assembly Listings of UTF16 VerStrlen() Using PCMPISTRI . . . . . . . . . . . . . . . . . . . 10-32
Example 10-15.Intrinsic Listings of UTF16 VerStrlen() Using PCMPISTRI . . . . . . . . . . . . . . . . . . . . . 10-35
Example 10-16.Replacement String Library Strcmp Using SSE4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-38
Example 10-17.High-level flow of Character Subset Validation for String Conversion. . . . . . . . . 10-40
Example 10-18.Intrinsic Listings of atol() Replacement Using PCMPISTRI. . . . . . . . . . . . . . . . . . . . . 10-41
Example 10-19.Auxiliary Routines and Data Constants Used in sse4i_atol() listing. . . . . . . . . . . . 10-44
Example 12-1. Instruction Pairing and Alignment to Optimize Decode Throughput on Intel® Atom™ Mi-
croarchitecture 12-5
Example 12-2. Alternative to Prevent AGU and Execution Unit Dependency . . . . . . . . . . . . . . . . . .12-8
Example 12-3. Pipeling Instruction Execution in Integer Computation . . . . . . . . . . . . . . . . . . . . . . . . .12-9
Example 12-4. Memory Copy of 64-byte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14
Example 12-5. Examples of Dependent Multiply and Add Computation . . . . . . . . . . . . . . . . . . . . . . 12-16
Example 12-6. Instruction Pointer Query Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17
Example 12-8. Auto-Generated Code of Storing Absolutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
Example 12-9. Changes Signs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
Example 12-7. Storing Absolute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
Example 12-11.Data Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-9
Example 12-10.Auto-Generated Code of Sign Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-9
CONTENTS
xx
PAGE
Example 12-13.Un-aligned Data Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-10
Example 12-12.Auto-Generated Code of Data Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-10
Example 12-14.Auto-Generated Code to Avoid Unaligned Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-11
Example D-1. Aligned esp-Based Stack Frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3
Example D-2. Aligned ebp-based Stack Frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-5
剩余603页未读,继续阅读
点击了解资源详情
点击了解资源详情
点击了解资源详情
2008-10-31 上传
104 浏览量
2019-08-08 上传
2009-04-04 上传
2022-09-14 上传
2009-06-17 上传
lch21
- 粉丝: 2
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- JWCHAT+++OpenFire配置.pdf
- NS中文手册精美版.pdf
- DirectX9技术文档
- WebLogic的安装和配置
- BGP with an Adaptive Minimal Rout Advertisment Interval.pdf
- pb通过sql语句实现分组小计统计
- ADS射频入门开发软件使用介绍
- Net Domain Driven Design With C sharp
- FLUENT HELP 算例精选中文版(一)
- MS SQL Server 2000 安装·启用·卸载
- C++复习资料(期末考试)
- SQLServer数据库实验指导书
- ASP+access论文
- NS中文手册精美版 ns2
- 高级PHP 模式,框架,测试和其他(英文版)
- powerdesinger的CDM理论篇
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功