Intel Xeon Phi性能测试：实验室与真实环境的应用探索

137 浏览量更新于2024-08-25 收藏 1.53MB PDF 举报

"Test-driving Intel Xeon Phi - 2013 (p137)-计算机科学" 这篇论文探讨了基于Intel的Many Integrated Core (MIC)架构的Intel Xeon Phi处理器，这是一种真正意义上的多核CPU，拥有大约60个相当强大的核心，两级缓存以及图形内存，并通过高速互连网络进行连接。作者团队包括来自荷兰代尔夫特理工大学、中国国防科技大学和荷兰阿姆斯特丹大学的研究人员，他们对Xeon Phi进行了深入测试，以评估其易用性和高性能特性。在论文中，研究分为两个层次进行：(1) 微基准测试级别，研究团队在实验室环境中对Xeon Phi的每一个组件进行了压力测试，关注每个部分的最高性能及达到这种性能的前提条件；(2) 应用级别，他们在实际应用场景中分析了Xeon Phi的性能响应，以了解其在真实工作负载下的表现。在微基准测试阶段，研究人员展示了Xeon Phi架构中的五个关键组件的高性能特性，这些组件可能包括计算单元、内存子系统、缓存结构、I/O接口和并行处理能力等。他们详细分析了这些组件的最大性能指标，并探讨了如何优化配置以达到最佳性能。接下来，论文选择了一个或多个具体应用作为案例，通过这些案例进一步研究Xeon Phi在处理大规模并行计算任务时的性能表现。这可能涉及科学计算、图像处理、机器学习或其他需要大量浮点运算的领域。作者们分析了应用程序的运行效率，包括并行化程度、数据吞吐量和计算密集度等因素，以此来评估Xeon Phi在这些任务上的适用性。在应用级别，作者们讨论了Xeon Phi在实际工作环境中的性能瓶颈、优化策略以及与传统CPU和GPU的性能对比。他们可能还讨论了编程模型，如OpenMP、MPI或者Intel的Offload技术，这些技术对于充分利用Xeon Phi的并行处理能力至关重要。最后，论文可能还包含了对未来发展趋势的展望，如硬件升级的可能性、软件栈的优化以及如何将Xeon Phi集成到更广泛的应用场景中。通过这些详尽的测试和分析，研究团队为开发者和研究人员提供了宝贵的见解，帮助他们更好地理解和利用Intel Xeon Phi的计算潜力。这篇论文是关于Intel Xeon Phi处理器性能测试的深度研究报告，它不仅揭示了这款处理器在微观层面的性能特性，还探讨了在实际应用中的表现，为高性能计算领域的专业人士提供了重要的参考依据。

Figure 2: The MIC-Meter Overview.

a highly accurate, non-intrusive timing method. Alterna-

tively, we can measure a long enough sequence of operations

with an accurate timer, and estimate latency per operation

by dividing the measured time by the number of operations.

In this paper, latency measurements are done with a single

thread (for individual operations) or two threads (for trans-

fer operations) with Pthreads. All latency benchmarks are

written in C (with inline assembly).

Throughput is the number of (a type of) operations exe-

cuted in a given unit of time. As higher throughput means

better performance, microbenchmarking focuses on measur-

ing the maximum achievable throughput for diﬀerent oper-

ations, under diﬀerent loads; typically, the benchmarked

throughput values are slightly lower than the theoretical

ones. Thus, to measure maximum throughput, the main

challenge is to build the workload such that the resource that

is being evaluated is fully utilized. For example, when mea-

suring computational throughput, enough threads should be

used to fully utilize the cores, while when measuring memory

bandwidth, the workload needs to have suﬃcient threads to

generate enough memory requests. For all the throughput

measurements in this paper, our multi-threaded workloads

are written in C and OpenMP.

We note that the similarities between Phi and a regular

multi-core CPU allow us to adapt existing CPU benchmarks

to the requirements of Xeon Phi. In most cases, we use such

”refurbished” solutions, that prove to serve our purposes.

3. EMPIRICAL EVALUATION

In the following sections, we present in detail the MIC-

Meter and the results for each of the components: (1) the

vector processing cores, (2) the on-chip and oﬀ-chip memory,

(3) the ring interconnect, and (4) the PCIe connection.

3.1 Vector Processing Cores

We evaluate the vector processing cores in terms of both

instruction latency and throughput. For latency, we use

a method similar to those proposed by Agner Fog [7] and

Torbjorn Granlund [9]: we measure instruction latency by

running a (long enough) sequence of dependent instructions

(i.e., a list of instructions that, being dependent on each

other, are forced to be executed sequentially - an instruction

stream).

The same papers propose a similar approach to measure

throughput in terms of instructions per cycle (IPC). How-

ever, we argue that a measurement that uses all processing

cores together, and not in isolation, is more realistic for pro-

grammers. Thus, we develop a flops microbenchmark to

explore the factors for reaching the theoretical maximum

throughput on Xeon Phi (Section 3.1.2).

3.1.1 Vector Instruction Latency

Xeon Phi introduces 177 vector instructions [11]. We

roughly divide these instructions into ﬁve classes

: mask

instructions, arithmetic (logic) instructions, conversion in-

structions, permutation instructions, and extended mathe-

matical instructions.

The benchmark for measuring the latency of vector in-

structions is measuring the execution time of a sequence

of 100 vector operations using the same format: zmm1 =

op(zmm1, zmm2), where zmm1 and zmm2 represent two

vectors and op is the instruction being measured. By mak-

ing zmm1 be both a source operand and the destination

operand, we ensure the instruction dependency - i.e., the

current operation will depend on the result of the previous

one.

For special classes of instructions - such as the conversion

instructions vcvtps2pd and vcvtpd2ps - we have to measure

the latency of the conversion pair (zmm2 = op12(zmm1);

zmm1 = op21(zmm2)) in order to guarantee the depen-

dency between contiguous instructions (i.e., it is not possi-

ble to write the result of the conversion in the same source

operand, due to type incompatibility). Similarly, we mea-

sure the latency of extended mathematical instructions such

as vexp223ps and vlog2ps in pairs, to avoid overﬂow (e.g.,

when using 100 successive exp()’s).

The interesting results for vector instruction latency are

presented in Table 1. With these latency numbers, we know

how many threads or instruction streams we need to hide

the latency on one processing core.

Table 1: The vector instruction latency (in cycles).

Instruction Category Latency

kand, kor,

knot, kxor

mask instructions 2

vaddpd, vfmadd213pd,

vmulpd, vsubpd

arithmetic instructions 4

vcvtdq2pd, vcvtfxpntdq2ps,

vcvtfxpntps2dq, vcvtps2pd

convert instructions 5

vpermd, vpermf32x4 permutation instructions 6

vexp223ps, vlog2ps,

vrcp23ps, vrsqrt23ps

extended

mathematical instructions

3.1.2 Vector Instruction Throughput

The Xeon Phi 5100 has 60 cores working at 1.05 GHz,

and each core can process 8 double-precision data elements

at a time, with maximum 2 operations (multiply-add or

mad) per cycle in each lane (i.e., a vector element). There-

fore, the theoretical instruction throughput is 1008 GFlops

(approximately 1 TFlop). But is this 1 TFlop perfor-

mance actually achievable? To measure the instruction

throughput, we run 1, 2, 4 threads on a core (60, 120, and

240 threads in total). During measurement, each thread per-

forms one or two instruction streams for a ﬁxed number of

iterations: b

i+1

= b

op a, where i represents the iteration,

a is a constant, and b serves as an operand and the destina-

tion. The loop was fully unrolled to avoid branch overheads.

The microbenchmark is vectorized using explicit intrinsics,

to ensure a 100% vector usage.

Note that we choose not to measure the latency of memory

access instructions because the latency results are highly

dependent on the data location(s).

139

剩余11页未读，继续阅读

weixin_38609765

粉丝: 5

Intel Xeon Phi性能测试：实验室与真实环境的应用探索

Intel Xeon Phi 5110P：极致并行计算性能提升

Intel Xeon与Intel Xeon Phi编程特性解析

Intel Xeon Phi初学者指南：并行编程与OpenMP实战

SWAPHI-LS: Alignment on Xeon Phi Cluster:Smith-Waterman在Xeon Phi簇上的长DNA序列比对-开源

matlab集成c代码-Intel-Xeon-Phi:在英特尔至强融核协处理器上运行软件的示例

SWAPHI: Smith-Waterman on Intel Xeon Phi:在共享主机Xeon Phis上进行的第一个蛋白质序列数据库搜索-开源

Streaming Store Instructions in the Intel Xeon Phi coprocessor - Slides (2012)-计算机科学

Intel-Xeon-Phi-Bachelor-Thesis-

Intel Xeon Phi 5110P产品介绍

Intel Xeon Phi入门

最新资源