Intel与AMD CPU微架构：优化指南

需积分: 12 8 浏览量更新于2024-07-17 收藏 1.02MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

; Example 3.3a. P1 consecutive branches

call P

test eax,eax

jz L2

L1: mov [edi],ebx

add edi,4

dec eax

jnz L1

L2: call P

First, we may note that the function P is called alternately from two different locations. This

means that the target for the return from P will be changing all the time. Consequently, the

return from P will always be mispredicted.

Assume, now, that EAX is zero. The jump to L2 will not have its target loaded because the

mispredicted return caused a pipeline flush. Next, the second CALL P will also fail to have

its target loaded because JZ L2 caused a pipeline flush. Here we have the situation where

a chain of consecutive jumps makes the pipeline flush repeatedly because the first jump

was mispredicted. The BTB entry for JZ L2 is stored at the address of P's return

instruction. This BTB entry will now be misapplied to whatever comes after the second CALL

P, but that doesn't give a penalty because the pipeline is flushed by the mispredicted

second return.

Now, let's see what happens if EAX has a nonzero value the next time: JZ L2 is always

predicted to fall through because of the flush. The second CALL P has a BTB entry at the

address of TEST EAX,EAX. This entry will be misapplied to the MOV/ADD pair, predicting it

to jump to P. This causes a flush which prevents JNZ L1 from loading its target. If we have

been here before, then the second CALL P will have another BTB entry at the address of

DEC EAX. On the second and third iteration of the loop, this entry will also be misapplied to

the MOV/ADD pair, until it has had its state decremented to 1 or 0. This will not cause a

penalty on the second iteration because the flush from JNZ L1 prevents it from loading its

false target, but on the third iteration it will. The subsequent iterations of the loop have no

penalties, but when it exits, JNZ L1 is mispredicted. The flush would now prevent CALL P

from loading its target, were it not for the fact that the BTB entry for CALL P has already

been destroyed by being misapplied several times. We can improve this code by putting in

some NOP's to separate all consecutive jumps:

; Example 3.3b. P1 consecutive branches

call P

test eax,eax

nop

jz L2

L1: mov [edi],ebx

add edi,4

dec eax

jnz L1

L2: nop

nop

call P

The extra NOP's cost 2 clock cycles, but they save much more. Furthermore, JZ L2 is now

moved to the U-pipe which reduces its penalty from 4 to 3 when mispredicted. The only

problem that remains is that the returns from P are always mispredicted. This problem can

only be solved by replacing the call to P by an inline macro.

3.3 Branch prediction in PMMX, PPro, P2, and P3

BTB organization

The branch target buffer (BTB) of the PMMX has 256 entries organized as 16 ways * 16

sets. Each entry is identified by bits 2-31 of the address of the last byte of the control

transfer instruction it belongs to. Bits 2-5 define the set, and bits 6-31 are stored in the BTB

as a tag. Control transfer instructions which are spaced 64 bytes apart have the same set-

value and may therefore occasionally push each other out of the BTB. Since there are 16

ways per set, this won't happen too often.

The branch target buffer (BTB) of the PPro, P2 and P3 has 512 entries organized as 16

ways * 32 sets. Each entry is identified by bits 4-31 of the address of the last byte of the

control transfer instruction it belongs to. Bits 4-8 define the set, and all bits are stored in the

BTB as a tag. Control transfer instructions which are spaced 512 bytes apart have the same

set-value and may therefore occasionally push each other out of the BTB. Since there are

16 ways per set, this won't happen too often.

The PPro, P2 and P3 allocate a BTB entry to any control transfer instruction the first time it

is executed. The PMMX allocates it the first time it jumps. A branch instruction that never

jumps will stay out of the BTB on the PMMX. As soon as it has jumped once, it will stay in

the BTB, even if it never jumps again. An entry may be pushed out of the BTB when another

control transfer instruction with the same set-value needs a BTB entry.

Misprediction penalty

In the PMMX, the penalty for misprediction of a conditional jump is 4 clocks in the U-pipe,

and 5 clocks if it is executed in the V-pipe. For all other control transfer instructions it is 4

clocks.

In the PPro, P2 and P3, the misprediction penalty is higher due to the long pipeline. A

misprediction usually costs between 10 and 20 clock cycles.

Pattern recognition for conditional jumps

The PMMX, PPro, P2 and P3 all use a two-level adaptive branch predictor with a local 4-bit

history, as explained on page 8. Simple repetitive patterns are predicted well by this

mechanism. For example, a branch which is alternately taken twice and not taken twice, will

be predicted all the time after a short learning period. The rule on page 9 tells which

repetitive branch patterns can be predicted perfectly. All patterns with a period of five or less

are predicted perfectly. This means that a loop which always repeats five times will have no

mispredictions, but a loop that repeats six or more times will not be predicted.

The branch prediction mechanism is also good at handling 'almost regular' patterns, or

deviations from the regular pattern. Not only does it learn what the regular pattern looks like.

It also learns what deviations from the regular pattern look like. If deviations are always of

the same type, then it will remember what comes after the irregular event, and the deviation

will cost only one misprediction. Likewise, a branch which switches back and forth between

two different regular patterns is predicted well.

Tight loops (PMMX)

Branch prediction in the PMMX is not reliable in tiny loops where the pattern recognition

mechanism doesn't have time to update its data before the next branch is met. This means

that simple patterns, which would normally be predicted perfectly, are not recognized.

Incidentally, some patterns which normally would not be recognized, are predicted perfectly

in tight loops. For example, a loop which always repeats 6 times would have the branch

pattern 111110 for the branch instruction at the bottom of the loop. This pattern would

normally have one or two mispredictions per iteration, but in a tight loop it has none. The

same applies to a loop which repeats 7 times. Most other repeat counts are predicted

poorer in tight loops than normally.

To find out whether a loop will behave as 'tight' on the PMMX you may follow the following

rule of thumb: Count the number of instructions in the loop. If the number is 6 or less, then

the loop will behave as tight. If you have more than 7 instructions, then you can be

reasonably sure that the pattern recognition functions normally. Strangely enough, it doesn't

matter how many clock cycles each instruction takes, whether it has stalls, or whether it is

paired or not. Complex integer instructions do not count. A loop can have lots of complex

integer instructions and still behave as a tight loop. A complex integer instruction is a non-

pairable integer instruction that always takes more than one clock cycle. Complex floating

point instructions and MMX instructions still count as one. Note, that this rule of thumb is

heuristic and not completely reliable.

Tight loops on PPro, P2 and P3 are predicted normally, and take minimum two clock cycles

per iteration.

Indirect jumps and calls (PMMX, PPro, P2 and P3)

There is no pattern recognition for indirect jumps and calls, and the BTB can remember no

more than one target for an indirect jump. It is simply predicted to go to the same target as it

did last time.

JECXZ and LOOP (PMMX)

There is no pattern recognition for these two instructions in the PMMX. They are simply

predicted to go the same way as last time they were executed. These two instructions

should be avoided in time-critical code for PMMX. In PPro, P2 and P3 they are predicted

using pattern recognition, but the LOOP instruction is still inferior to DEC ECX / JNZ.

3.4 Branch prediction in P4 and P4E

The organization of the branch target buffer (BTB) in the P4 and P4E is not known in detail.

It has 4096 entries, probably organized as 8 ways * 512 sets. It is indexed by addresses in

the trace cache which do not necessarily have a simple correspondence to addresses in the

original code. Consequently, it is difficult for the programmer to predict or avoid BTB

contentions. Far jumps, calls and returns are not predicted in the P4 and P4E.

The processor allocates a BTB entry to any near control transfer instruction the first time it

jumps. A branch instruction which never jumps will stay out of the BTB, but not out of the

branch history register. As soon as it has jumped once, it will stay in the BTB, even if it

never jumps again. An entry may be pushed out of the BTB when another control transfer

instruction with the same set-value needs a BTB entry. All conditional jumps, including

JECXZ and LOOP, contribute to the branch history register. Unconditional and indirect

jumps, calls and returns do not contribute to the branch history.

Branch mispredictions are much more expensive on the P4 and P4E than on previous

generations of microprocessors. The time it takes to recover from a misprediction is rarely

less than 24 clock cycles, and typically around 45 uops. Apparently, the microprocessor

cannot cancel a bogus uop before it has reached the retirement stage. This means that if

you have a lot of uops with long latency or poor throughput, then the penalty for a

misprediction may be as high as 100 clock cycles or more. It is therefore very important to

organize code so that the number of mispredictions is minimized.

Pattern recognition for conditional jumps in P4

The P4 uses an "agree" predictor with a 16-bit global history, as explained on page 11. The

branch history table has 4096 entries, according to an article in Ars Technica (J. Stokes:

The Pentium 4 and the G4e: an Architectural Comparison: Part I. arstechnica.com, Mar.

2001). The prediction rule on page 9 tells us that the P4 can predict any repetitive pattern

with a period of 17 or less, as well as some patterns with higher history. However, this

applies to the global history, not the local history. You therefore have to look at the

preceding branches in order to determine whether a branch is likely to be well predicted. I

will explain this with the following example:

; Example 3.4. P4 loops and branches

mov eax, 100

A: ...

...

mov ebx, 16

B: ...

sub ebx, 1

jnz B

test eax, 1

jnz X1

call EAX_IS_EVEN

jmp X2

X1: call EAX_IS_ODD

X2: ...

mov ecx, 0

C1: cmp ecx, 10

jnb C2

...

add ecx, 1

jmp C1

C2: ...

sub eax, 1

jnz A

The A loop repeats 100 times. The JNZ A instruction is taken 99 times and falls through 1

time. It will be mispredicted when it falls through. The B and C loops are inside the A loop.

The B loop repeats 16 times, so without considering the prehistory, we would expect it to be

predictable. But we have to consider the prehistory. With the exception of the first time, the

prehistory for JNZ B will look like this: JNB C2: not taken 10 times, taken 1 time (JMP C1

does not count because it is unconditional); JNZ A taken; JNZ B taken 15 times, not taken

1 time. This totals 17 consecutive taken branches in the global history before JNZ B is not

taken. It will therefore be mispredicted once or twice for each cycle. There is a way to avoid

this misprediction. If you insert a dummy branch that always falls through anywhere

between the A: and B: labels, then JNZ B is likely to be predicted perfectly, because the

prehistory now has a not taken before the 15 times taken. The time saved by predicting JNZ

B well is far more than the cost of an extra dummy branch. The dummy branch may, for

example, be TEST ESP,ESP / JC B.

JNZ X1 is taken every second time and is not correlated with any of the preceding 16

conditional jump events, so it will not be predicted well.

Assuming that the called procedures do not contain any conditional jumps, the prehistory for

JNB C2 is the following: JNZ B taken 15 times, not taken 1 time; JNZ X1 taken or not

taken; JNB C2: not taken 10 times, taken 1 time. The prehistory of JNB C2 is thus always

unique. In fact, it has 22 different and unique prehistories, and it will be predicted well. If

there was another conditional jump inside the C loop, for example if the JMP C1 instruction

was conditional, then the JNB C2 loop would not be predicted well, because there would

be 20 instances between each time JNB C2 is taken.

In general, a loop cannot be predicted well on the P4 if the repeat count multiplied by the

number of conditional jumps inside the loop exceeds 17.

Alternating branches

While the C loop in the above example is predictable, and the B loop can be made

predictable by inserting a dummy branch, we still have a big problem with the JNZ X1

branch. This branch is alternately taken and not taken, and it is not correlated with any of

the preceding 16 branch events. Let's study the behavior of the predictors in this case. If the

local predictor starts in state "weakly not taken", then it will alternate between "weakly not

taken" and "strongly not taken" (see figure 3.1). If the entry in the global pattern history table

starts in an agree state, then the branch will be predicted to fall through every time, and we

will have 50% mispredictions (see figure 3.3). If the global predictor happens to start in state

"strongly disagree", then it will be predicted to be taken every time, and we still have 50%

mispredictions. The worst case is if the global predictor starts in state "weakly disagree". It

will then alternate between "weakly agree" and "weakly disagree", and we will have 100%

mispredictions. There is no way to control the starting state of the global predictor, but we

can control the starting state of the local predictor. The local predictor starts in state "weakly

not taken" or "weakly taken", according to the rules of static prediction, explained on page

26 below. If we swap the two branches and replace JNZ with JZ, so that the branch is taken

the first time, then the local predictor will alternate between state "weakly not taken" and

"weakly taken". The global predictor will soon go to state "strongly disagree", and the branch

will be predicted correctly all the time. A backward branch that alternates would have to be

organized so that it is not taken the first time, to obtain the same effect. Instead of swapping

the two branches, we may insert a 3EH prediction hint prefix immediately before the JNZ

X1 to change the static prediction to "taken" (see p. 26). This will have the same effect.

While this method of controlling the initial state of the local predictor solves the problem in

most cases, it is not completely reliable. It may not work if the first time the branch is seen is

after a mispredicted preceding branch. Furthermore, the sequence may be broken by a task

switch or other event that pushes the branch out of the BTB. We have no way of predicting

whether the branch will be taken or not taken the first time it is seen after such an event.

Fortunately, it appears that the designers have been aware of this problem and

implemented a way to solve it. While researching these mechanisms, I discovered an

undocumented prefix, 64H, which does the trick on the P4. This prefix doesn't change the

static prediction, but it controls the state of the local predictor after the first event so that it

will toggle between state "weakly not taken" and "weakly taken", regardless of whether the

branch is taken or not taken the first time. This trick can be summarized in the following rule:

A branch which is taken exactly every second time, and which doesn't correlate with any of

the preceding 16 branch events, can be predicted well on the P4 if it is preceded by a 64H

prefix. This prefix is coded in the following way:

; Example 3.5. P4 alternating branch hint

DB 64H ; Hint prefix for alternating branch

jnz X1 ; Branch instruction

No prefix is needed if the branch can see a previous instance of itself in the 16-bit

prehistory.

The 64H prefix has no effect and causes no harm on any previous microprocessor. It is an

FS segment prefix. The 64H prefix cannot be used together with the 2EH and 3EH static

prediction prefixes.

Pattern recognition for conditional jumps in P4E

Branch prediction in the P4E is simpler than in the P4. There is no agree predictor, but only

a 16-bit global history and a global pattern history table. This means that a loop can be

predicted well on the P4E if the repeat count multiplied by the number of conditional jumps

inside the loop does not exceed 17.

剩余128页未读，继续阅读

drjiachen

粉丝: 171
资源: 2138

Intel与AMD CPU微架构：优化指南

LC3-Microarchitecture.pdf

GNUtoolchain_Optimization_MicroArchitecture.pdf

the microarchitecture of superscalar processors博客

convnext模型的网络结构

计算机体系结构和微处理器原理

计算机体系结构微体系结构

处理器微架构和微代码（microcode）涉及的工作有什么区别

cpu的计算速度是什么决定的

【图像分割】基于matlab粒子群算法和OSTU和分水岭和K-means脂肪肝水平识别【含Matlab源码 2397期】.md

HengCe-18900-2024-2030全球与中国先进封装市场现状及未来发展趋势 Sample-样本V2(2).docx

fastrlock-0.4-cp35-cp35m-win_amd64.whl

xxhash-3.0.0-cp39-cp39-win_amd64.whl

【图像配准】基于matlab结合张量与互信息的混合模型多模态图像配准【含Matlab源码 3779期】.md

数字语音处理课设 基于python两种去噪算法传统的维纳滤波和改进的谱减法+维纳滤波源码.zip

【图像融合】基于matlab GUI小波变换彩色图像融合（含评价指标）【含Matlab源码 1756期】.md

【图像去噪】均值+中值+高斯低通+多种小波变换图像去噪（含PSNR和MSE）【含Matlab源码 856期】.md

dubins-1.0.1-cp39-cp39-win_amd64.whl

基于YOLOv11的安全防护装备检测系统（包含详细的完整的程序和数据）

lru_dict-1.1.7-cp310-cp310-win_amd64.whl

Python中的Echo.pdf

最新资源

数字语音处理课设基于python两种去噪算法传统的维纳滤波和改进的谱减法+维纳滤波源码.zip