A Dynamically Adaptive Approach for Speculative
Loop Execution in SMT Architectures
Meirong Li and Yinliang Zhao
Department of Computer Science and Technology
Xi’an Jiaotong University
Xi’an, China
Email: meirongli.xjtu@gmail.com, zhaoy@mail.xjtu.edu.cn
Abstract—Simultaneous multithreading allows the exploitation
of thread-level speculation on the same processor. Due to the
contention for shared processor resources, the performance of
speculative threads often suffers from the potential of inter-
thread interference, which is hard to be statically estimated
by the compiler. Thus we propose an approach to dynamically
determine and extract speculative threads from parallel regions
until runtime. It relies on a cycle counter architecture to collect
the performance profiles of each parallelized loop and uncover
the potential of loop-level parallelism. These performance profiles
are obtained from the relative single-threaded execution time
prediction for speculative threads using thread execution cycle
breakdown. The performance of different loop levels is dynami-
cally evaluated by the prediction and only the best loop level will
be chosen to parallelize. Several performance tuning policies are
also examined. The best policy can achieve an average speedup
of 1.45 using SPEC CPU2000 benchmarks, and it outperforms
the static loop selection by 33%.
Keywords—Simultaneous multithreading, Performance predic-
tion, Loop-level parallelism, Thread-level speculation
I. INTRODUCTION
Simultaneous Multithreading (SMT) [1], as one multi-
threaded architecture, becomes more effective for the exploita-
tion of the high degrees of both instruction-level parallelism
(ILP) and thread-level parallelism (TLP) relative to the tra-
ditional single-threaded processors. Thread-level speculation
(TLS), which attempts to extract multiple dependent threads
from irregular sequential programs, has been studied extensive-
ly on SMT processors [2]–[4]. One of the most attractive TLS
techniques is loop-level speculation [5]–[7]. A large amount
of compiler-based research has been done on loop selection
under the assumption that each iteration as a thread can be
executed independently by means of extensive profiling or
re-execution [7]–[9]. In fact, all such spawned threads are
not always proved to be safe due to either the inaccuracy of
static performance estimation or the effects of the underlying
processor resources and different program behaviors. To make
a more efficient speculation, several issues should be addressed
for SMT processors:
All spawned threads run on the same processor suffer
from the imprecision of static performance estimation: The
static performance estimation comes from a wide range of
standards such as the heuristic rules [10], the cost models [9],
[11] and the cost-beneficial estimation [6], [8], etc. Although
all of them have been proved to be effective, it is hard to
keep the best performance of all parallelized loops, whereas
those discarded ones have to be serialized directly even if some
of them are more appropriate for parallelism. Meanwhile, the
thread progress on the SMT processor not only depends on
different fetch policies, but also suffers from an unexpected
interference from other threads. This additional overhead is
still difficult to be accurately estimated before speculation.
Instead, dynamically determining the best loop level can find
more loop candidates and thus avoid such performance losses.
The underlying processor resources affect thread behav-
iors: The progress of one thread is often associated with others
due to the contention for the same processor. It will cause
the behavior of multithreaded execution is not similar to that
of single-threaded execution. Particularly, cycles stalled for
the branch misprediction and data cache accesses are largely
influenced. Both of these behaviors are usually neglected and
discarded by the compiler. Similarly, the number of outstand-
ing long-latency loads is dynamically increased/decreased, and
it will become unpredictable as the number of spawned threads
increases. The performance of speculation also has a close
relation with the hardware resources, such as the reorder
buffer, issue queue, and cache size, etc. Different hardware
configurations have an effect on the thread progress. To better
understand the actual benefits of each parallelized loop, the
adjustments of such thread behaviors seem to be needed.
The performance of speculation depends on different pro-
gram behaviors: For a loop, the execution time of the same
iteration will be varied dramatically due to different input data
sets. The same input data sets of different invocations will lead
to different results as well. However, it is not easy to handle
them by means of several profiler runs. Thus, the compiler-
based loop selection will only achieve suboptimal results or
even suffer from poor speculation in case the parallel overhead
of a loop outperforms its sequential overhead. Besides that,
different program phases have a different effect on the same
region. When multiple dependent threads have been revealed to
be independent of each other, it is desirable to take advantage
of TLP between them. Otherwise, the serialization is needed
for exploiting more ILP. Therefore, we need to dynamically
determine where and how to extract speculative threads, and
maximize the coverage and benefits of the whole program.
This paper proposes an adaptive approach for loop-level
speculation due to the runtime performance profiles. The
performance profiles are obtained from a cycle counter archi-
tecture, where the performance impact of speculative threads is
dynamically identified and adjusted for predicting their relative
single-threaded alone execution time using thread execution
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International
Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software
and Systems (ICESS)
978-1-4799-6123-8/14 $31.00 © 2014 IEEE
DOI 10.1109/HPCC.2014.171
1031
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International
Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software
and Systems (ICESS)
978-1-4799-6123-8/14 $31.00 © 2014 IEEE
DOI 10.1109/HPCC.2014.171
1036
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International
Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software
and Systems (ICESS)
978-1-4799-6123-8/14 $31.00 © 2014 IEEE
DOI 10.1109/HPCC.2014.171
1036
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International
Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software
and Systems (ICESS)
978-1-4799-6123-8/14 $31.00 © 2014 IEEE
DOI 10.1109/HPCC.2014.171
1036
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International
Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software
and Systems (ICESS)
978-1-4799-6123-8/14 $31.00 © 2014 IEEE
DOI 10.1109/HPCC.2014.171
1024
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International
Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software
and Systems (ICESS)
978-1-4799-6123-8/14 $31.00 © 2014 IEEE
DOI 10.1109/HPCC.2014.171
1024
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International
Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software
and Systems (ICESS)
978-1-4799-6123-8/14 $31.00 © 2014 IEEE
DOI 10.1109/HPCC.2014.171
1024