CONTENTS
xviii
PAGE
Example 5-28. Clipping to an Arbitrary Unsigned Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27
Example 5-27. Simplified Clipping to an Arbitrary Signed Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27
Example 5-29. Complex Multiply by a Constant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
Example 5-30. Using PTEST to Separate Vectorizable and non-Vectorizable Loop Iterations . . . 5-31
Example 5-31. Using PTEST and Variable BLEND to Vectorize Heterogeneous Loops. . . . . . . . . . 5-32
Example 5-32. Baseline C Code for Mandelbrot Set Map Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . 5-33
Example 5-33. Vectorized Mandelbrot Set Map Evaluation Using SSE4.1 Intrinsics . . . . . . . . . . . . 5-34
Example 5-34. A Large Load after a Series of Small Stores (Penalty) . . . . . . . . . . . . . . . . . . . . . . . . . 5-36
Example 5-36. A Series of Small Loads After a Large Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-37
Example 5-37. Eliminating Delay for a Series of Small Loads after a Large Store . . . . . . . . . . . . . . 5-37
Example 5-35. Accessing Data Without Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-37
Example 5-38. An Example of Video Processing with Cache Line Splits . . . . . . . . . . . . . . . . . . . . . . . 5-38
Example 5-39. Video Processing Using LDDQU to Avoid Cache Line Splits. . . . . . . . . . . . . . . . . . . . . 5-39
Example 5-40. Un-optimized Reverse Memory Copy in C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-41
Example 5-41. Using PSHUFB to Reverse Byte Ordering 16 Bytes at a Time. . . . . . . . . . . . . . . . . . 5-42
Example 5-42. PMOVSX/PMOVZX Work-around to Avoid False Dependency . . . . . . . . . . . . . . . . . . 5-45
Example 5-43. Table Look-up Operations in C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-46
Example 5-44. Shift Techniques on Non-Vectorizable Table Look-up . . . . . . . . . . . . . . . . . . . . . . . . . 5-47
Example 5-45. PEXTRD Techniques on Non-Vectorizable Table Look-up . . . . . . . . . . . . . . . . . . . . . . 5-48
Example 6-1. Pseudocode for Horizontal (xyz, AoS) Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
Example 6-2. Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation . . . . . . . . . . . . . . . . . . 6-6
Example 6-3. Swizzling Data Using SHUFPS, MOVLHPS, MOVHLPS . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Example 6-4. Swizzling Data Using UNPCKxxx Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Example 6-5. Deswizzling Single-Precision SIMD Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
Example 6-6. Deswizzling Data Using SIMD Integer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
Example 6-7. Horizontal Add Using MOVHLPS/MOVLHPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Example 6-8. Horizontal Add Using Intrinsics with MOVHLPS/MOVLHPS . . . . . . . . . . . . . . . . . . . . . 6-12
Example 6-9. Multiplication of Two Pair of Single-precision Complex Number . . . . . . . . . . . . . . . . 6-15
Example 6-10. Division of Two Pair of Single-precision Complex Numbers . . . . . . . . . . . . . . . . . . . . 6-16
Example 6-11. Double-Precision Complex Multiplication of Two Pairs . . . . . . . . . . . . . . . . . . . . . . . . . 6-17
Example 6-12. Double-Precision Complex Multiplication Using Scalar SSE2 . . . . . . . . . . . . . . . . . . . . 6-17
Example 6-13. Dot Product of Vector Length 4 Using SSE/SSE2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19
Example 6-14. Dot Product of Vector Length 4 Using SSE3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19
Example 6-15. Dot Product of Vector Length 4 Using SSE4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19
Example 6-16. Unrolled Implementation of Four Dot Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
Example 6-17. Normalization of an Array of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21
Example 6-18. Normalize (x, y, z) Components of an Array of Vectors Using SSE2. . . . . . . . . . . . . 6-22
Example 6-19. Normalize (x, y, z) Components of an Array of Vectors Using SSE4.1 . . . . . . . . . . . 6-23
Example 6-20. Data Organization in Memory for AOS Vector-Matrix Multiplication . . . . . . . . . . . . 6-24
Example 6-21. AOS Vector-Matrix Multiplication with HADDPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-24
Example 6-22. AOS Vector-Matrix Multiplication with DPPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25
Example 6-23. Data Organization in Memory for SOA Vector-Matrix Multiplication . . . . . . . . . . . . 6-26
Example 6-24. Vector-Matrix Multiplication with Native SOA Data Layout . . . . . . . . . . . . . . . . . . . . 6-27
Example 7-1. Pseudo-code Using CLFLUSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
Example 7-2. Populating an Array for Circular Pointer Chasing with Constant Stride . . . . . . . . . 7-15
Example 7-3. Prefetch Scheduling Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18
Example 7-4. Using Prefetch Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20