SMT架构中推测性循环执行的动态适应策略

108 浏览量更新于2024-08-26 收藏 351KB PDF 举报

"本文提出了一种在SMT(同时多线程)架构中进行推测性循环执行的动态自适应方法，旨在解决由于线程间干扰导致的性能问题。该方法利用周期计数器架构收集并分析每个并行化循环的性能特征，以发掘循环级并行潜力，并基于线程执行周期分解预测推测线程的单线程执行时间，从而动态评估不同循环层的性能，并选择最佳循环层进行推测执行。" 在SMT体系结构中，多个线程可以在同一处理器上并发执行，这使得通过线程级推测来提高处理器效率成为可能。然而，共享处理器资源的竞争往往会导致线程间的相互干扰，这种干扰对推测性线程的性能造成负面影响。现有的静态编译器难以准确预估这种干扰的程度，因此，文章提出了一个创新的动态自适应策略，以解决这个问题。该方法的核心是动态确定和提取并行区域中的推测性线程。它依赖于一个周期计数器架构，这个架构能够记录每个并行循环的执行情况，收集其性能数据。通过对每个循环的执行周期进行细分，可以得到推测线程相对于单线程执行的时间预测。这些性能概况揭示了循环级并行性的潜力，这对于优化推测执行至关重要。接下来，文章通过预测结果动态评估不同循环层的性能。每个循环层的执行效率根据预测的单线程执行时间来比较，从而选取性能最佳的循环层进行推测执行。这种方法能够根据运行时的实际条件调整推测策略，避免了因静态估计不准确而导致的性能损失。此外，这种方法还有助于减少线程间干扰，因为只选择最优循环层进行推测，可以减少不必要的资源竞争。同时，它还提供了一种机制来适应不断变化的工作负载和处理器状态，提高了SMT架构的灵活性和整体性能。这项研究为SMT架构下的性能优化提供了一个新的视角，通过动态自适应推测性循环执行，可以更好地应对处理器资源的共享和线程间的交互影响，从而提升系统效率。该方法对于未来多线程处理器的设计和优化具有重要的理论和实践意义。

A Dynamically Adaptive Approach for Speculative

Loop Execution in SMT Architectures

Meirong Li and Yinliang Zhao

Department of Computer Science and Technology

Xi’an Jiaotong University

Xi’an, China

Email: meirongli.xjtu@gmail.com, zhaoy@mail.xjtu.edu.cn

Abstract—Simultaneous multithreading allows the exploitation

of thread-level speculation on the same processor. Due to the

contention for shared processor resources, the performance of

speculative threads often suffers from the potential of inter-

thread interference, which is hard to be statically estimated

by the compiler. Thus we propose an approach to dynamically

determine and extract speculative threads from parallel regions

until runtime. It relies on a cycle counter architecture to collect

the performance proﬁles of each parallelized loop and uncover

the potential of loop-level parallelism. These performance proﬁles

are obtained from the relative single-threaded execution time

prediction for speculative threads using thread execution cycle

breakdown. The performance of different loop levels is dynami-

cally evaluated by the prediction and only the best loop level will

be chosen to parallelize. Several performance tuning policies are

also examined. The best policy can achieve an average speedup

of 1.45 using SPEC CPU2000 benchmarks, and it outperforms

the static loop selection by 33%.

Keywords—Simultaneous multithreading, Performance predic-

tion, Loop-level parallelism, Thread-level speculation

I. INTRODUCTION

Simultaneous Multithreading (SMT) [1], as one multi-

threaded architecture, becomes more effective for the exploita-

tion of the high degrees of both instruction-level parallelism

(ILP) and thread-level parallelism (TLP) relative to the tra-

ditional single-threaded processors. Thread-level speculation

(TLS), which attempts to extract multiple dependent threads

from irregular sequential programs, has been studied extensive-

ly on SMT processors [2]–[4]. One of the most attractive TLS

techniques is loop-level speculation [5]–[7]. A large amount

of compiler-based research has been done on loop selection

under the assumption that each iteration as a thread can be

executed independently by means of extensive proﬁling or

re-execution [7]–[9]. In fact, all such spawned threads are

not always proved to be safe due to either the inaccuracy of

static performance estimation or the effects of the underlying

processor resources and different program behaviors. To make

a more efﬁcient speculation, several issues should be addressed

for SMT processors:

All spawned threads run on the same processor suffer

from the imprecision of static performance estimation: The

static performance estimation comes from a wide range of

standards such as the heuristic rules [10], the cost models [9],

[11] and the cost-beneﬁcial estimation [6], [8], etc. Although

all of them have been proved to be effective, it is hard to

keep the best performance of all parallelized loops, whereas

those discarded ones have to be serialized directly even if some

of them are more appropriate for parallelism. Meanwhile, the

thread progress on the SMT processor not only depends on

different fetch policies, but also suffers from an unexpected

interference from other threads. This additional overhead is

still difﬁcult to be accurately estimated before speculation.

Instead, dynamically determining the best loop level can ﬁnd

more loop candidates and thus avoid such performance losses.

The underlying processor resources affect thread behav-

iors: The progress of one thread is often associated with others

due to the contention for the same processor. It will cause

the behavior of multithreaded execution is not similar to that

of single-threaded execution. Particularly, cycles stalled for

the branch misprediction and data cache accesses are largely

inﬂuenced. Both of these behaviors are usually neglected and

discarded by the compiler. Similarly, the number of outstand-

ing long-latency loads is dynamically increased/decreased, and

it will become unpredictable as the number of spawned threads

increases. The performance of speculation also has a close

relation with the hardware resources, such as the reorder

buffer, issue queue, and cache size, etc. Different hardware

conﬁgurations have an effect on the thread progress. To better

understand the actual beneﬁts of each parallelized loop, the

adjustments of such thread behaviors seem to be needed.

The performance of speculation depends on different pro-

gram behaviors: For a loop, the execution time of the same

iteration will be varied dramatically due to different input data

sets. The same input data sets of different invocations will lead

to different results as well. However, it is not easy to handle

them by means of several proﬁler runs. Thus, the compiler-

based loop selection will only achieve suboptimal results or

even suffer from poor speculation in case the parallel overhead

of a loop outperforms its sequential overhead. Besides that,

different program phases have a different effect on the same

region. When multiple dependent threads have been revealed to

be independent of each other, it is desirable to take advantage

of TLP between them. Otherwise, the serialization is needed

for exploiting more ILP. Therefore, we need to dynamically

determine where and how to extract speculative threads, and

maximize the coverage and beneﬁts of the whole program.

This paper proposes an adaptive approach for loop-level

speculation due to the runtime performance proﬁles. The

performance proﬁles are obtained from a cycle counter archi-

tecture, where the performance impact of speculative threads is

dynamically identiﬁed and adjusted for predicting their relative

single-threaded alone execution time using thread execution

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International