多核处理器与系统：技术趋势与创新

Multicore

需积分: 9 23 浏览量更新于2023-06-07 收藏 9.74MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

“Multicore Processors and Systems”这本书提供了一个全面的多核处理器和系统的概述，涵盖了影响多核技术的趋势、多核架构创新、多核软件创新以及最先进的商业多核系统的案例研究。书中强调了扩展到数百个核心的多核系统所面临的挑战。本书由来自学术界和工业界的贡献者共同撰写，是首本专注于多核处理器和系统，特别是其独特的技术影响、架构和实现的专著。在多核处理器领域，随着半导体技术的发展，单核处理器已经无法满足日益增长的计算需求。多核处理器通过集成两个或更多的处理核心在一个芯片上，实现了并行处理，显著提升了系统性能。书中的“技术趋势”部分可能讨论了摩尔定律的放缓以及多核作为应对这一挑战的解决方案。 “多核架构创新”章节可能涉及了不同类型的多核架构，如SMP（对称多处理）、MPP（大规模并行处理）和AMP（非对称多处理）。这些架构各自有其优缺点，适应不同的应用需求。例如，SMP适合共享内存的应用，而MPP和AMP则更适合分布式内存环境。 “多核软件创新”部分可能探讨了如何优化软件以利用多核优势，包括并行编程模型（如OpenMP、MPI）、任务调度策略、线程管理和内存管理。这部分可能还讨论了编程挑战，如数据一致性、同步问题和通信延迟。 “基础设施”章节可能深入介绍了多核处理器的内存系统和片上互连。内存子系统对于多核性能至关重要，因为它们决定了数据访问的速度和效率。片上网络设计（如NoC，Network-on-Chip）则是确保多核之间高效通信的关键。 “案例研究”部分提供了实际应用的例子，展示了不同领域的多核实现，如通用计算、服务器、媒体/宽带、网络处理和信号处理。这些案例可能分析了特定应用中多核架构的设计决策、性能优化和实际效果。 “Multicore Processors and Systems”这本书是理解和掌握多核处理器技术的宝贵资源，它不仅涵盖了理论知识，还通过实例展示了多核技术在现实世界中的应用和挑战。对于希望在多核计算领域深入学习的IT专业人士，这本书提供了一个全面的学习框架。

资源详情

资源推荐

Chapter 1

Tiled Multicore Processors

Michael B. Taylor, Walter Lee, Jason E. Miller, David Wentzlaff, Ian Bratt,

Ben Greenwald, Henry Hoffmann, Paul R. Johnson, Jason S. Kim, James

Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matthew I. Frank,

Saman Amarasinghe, and Anant Agarwal

Abstract For the last few decades Moore’s Law has continually provided expo-

nential growth in the number of transistors on a single chip. This chapter describes

a class of architectures, called tiled multicore architectures, that are designed to

exploit massive quantities of on-chip resources in an efﬁcient, scalable manner.

Tiled multicore architectures combine each processor core with a switch to create

a modular element called a tile. Tiles are replicated on a chip as needed to cre-

ate multicores with any number of tiles. The Raw processor, a pioneering exam-

ple of a tiled multicore processor, is examined in detail to explain the philosophy,

design, and strengths of such architectures. Raw addresses the challenge of building

a general-purpose architecture that performs well on a larger class of stream and

embedded computing applications than existing microprocessors, while still run-

ning existing ILP-based sequential programs with reasonable performance. Central

to achieving this goal is Raw’s ability to exploit all forms of parallelism, including

ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by imple-

menting plenty of on-chip resources – including logic, wires, and pins – in a tiled

arrangement, and exposing them through a new ISA, so that the software can take

advantage of these resources for parallel applications. Compared to a traditional

superscalar processor, Raw performs within a factor of 2x for sequential applica-

tions with a very low degree of ILP, about 2x–9x better f or higher levels of ILP, and

10x–100x better when highly parallel applications are coded in a stream language

or optimized by hand.

J.E. Miller (B)

MIT CSAIL 32 Vassar St, Cambridge, MA 02139, USA

e-mail: jasonm@alum.mit.edu

Based on “Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture

for ILP and Streams”, by M.B. Taylor, W. Lee, J.E. Miller, et al. which appeared in The 31st

S.W. Keckler et al. (eds.), Multicore Processors and Systems, Integrated Circuits and

Systems, DOI 10.1007/978-1-4419-0263-4_1,



Springer Science+Business Media, LLC 2009

2M.B.Tayloretal.

1.1 Introduction

For the last few decades, Moore’s Law has continually provided exponential growth

in the number of transistors on a single chip. The challenge for computer architects

is to ﬁnd a way to use these additional transistors to increase application perfor-

mance. This chapter describes a class of architectures, called tiled multicore archi-

tectures, that are designed to exploit massive quantities of on-chip resources in an

efﬁcient and scalable manner. Tiled multicore architectures combine each processor

core with a switch to create a modular element called a tile. Tiles are replicated on

a chip as needed to create multicores with a few or large numbers of tiles.

The Raw processor, a pioneering example of a tiled multicore processor, will be

examined in detail to explain the philosophy, design, and strengths of such architec-

tures. Raw addresses the challenge of building a general-purpose architecture that

performs well on a larger class of stream and embedded computing applications

than existing microprocessors, while still running existing ILP-based sequential pro-

grams with reasonable performance. Central to achieving this goal is Raw’s ability

to exploit all forms of parallelism, including instruction-level parallelism (ILP),

data-level parallelism (DLP), task or thread-level parallelism (TLP), and stream

parallelism.

1.1.1 The End of Monolithic Processors

Over the past few decades, general-purpose processor designs have evolved by

attempting to automatically ﬁnd and exploit increasing amounts of parallelism in

sequential programs: ﬁrst came pipelined single-issue processors, then in-order

superscalars, and ﬁnally out-of-order superscalars. Each generation has employed

larger and more complex circuits (e.g., highly ported register ﬁles, huge bypass net-

works, reorder buffers, complex cache hierarchies, and load/store queues) to extract

additional parallelism from a simple single-threaded program.

As clock frequencies have increased, wire delay within these large centralized

structures has begun to limit scalability [9, 28, 2]. With a higher clock frequency,

the fraction of a chip that a signal can reach in a single clock period becomes

smaller. This makes it very difﬁcult and costly to scale centralized structures to

the sizes needed by large monolithic processors. As an example, the Itanium II pro-

cessor [32] spends over half of its critical path in the bypass paths of the ALUs

(which form a large centralized interconnect). Techniques like resource partition-

ing and super-pipelining attempt to hide the realities of wire delay from the pro-

grammer, but create other inefﬁciencies that result in diminishing performance

returns.

Besides the performance implications, many of these large centralized struc-

tures are power-hungry and very costly to design and verify. Structures such as

bypass networks and multiported register ﬁles grow with the square or cube of the

issue-width in monolithic superscalars. Because power relates to both area and fre-

quency in CMOS VLSI design, the power consumed by these complex monolithic

1 Tiled Multicore Processors 3

processors is approaching VLSI limits. Increased complexity also results in

increased design and veriﬁcation costs. For large superscalar processors, veriﬁca-

tion can account for as much as 70% of total development cost [10].

Due to these limited performance improvements, skyrocketing energy consump-

tion, and run-away design costs, it has become clear that large monolithic processor

architectures will not scale into the billion-transistor era [35, 2, 16, 36, 44].

1.1.2 Tiled Multicore Architectures

Tiled multicore architectures avoid the inefﬁciencies of large monolithic sequen-

tial processors and provide unlimited scalability as Moore’s law provides additional

transistors per chip. Like all multicore processors, tiled multicores (such as Raw,

TRIPS [37], Tilera’s TILE64 [50, 7], and Intel’s Tera-Scale Research Processor [6])

contain several small computational cores rather than a single large one. Since sim-

pler cores are more area- and energy-efﬁcient than larger ones, more functional units

can be supported within a single chip’s area and power budget [1]. Speciﬁcally, in

the absence of other bottlenecks, multicores increase throughput in proportion to the

number of cores for parallel workloads without the need to increase clock frequency.

The multicore approach is power efﬁcient because increasing the clock frequency

requires operating at proportionally higher voltages in optimized processor designs,

which can increase power by the cube of the increase in frequency. However, the key

concept in tiled multicores is the way in which the processing cores are intercon-

nected. In a tiled multicore, each core is combined with a communication network

router, as shown in Fig. 1.1, to form an independent modular “tile.” By replicating

Compute

Core

Switch

Pipeline

I-Cache D-Cache

Tile

Fig. 1.1 A tiled multicore processor is composed of an array of tiles. Each tile contains an inde-

pendent compute core and a communication switch to connect it to its neighbors. Because cores

are only connected through the routers, this design can easily be scaled by adding additional tiles

4M.B.Tayloretal.

tiles across the area of a chip and connecting neighboring routers together, a com-

plete on-chip communication network is created.

The use of a general network router in each tile distinguishes tiled multicores

from other mainstream multicores such as Intel’s Core processors, Sun’s Niagara

[21], and the Cell Broadband Engine [18]. Most of these multicores have distributed

processing elements but still connect cores together using non-scalable centralized

structures such as bus interconnects, crossbars, and shared caches. The Cell pro-

cessor uses ring networks that are physically scalable but can suffer from signiﬁ-

cant performance degradation due to congestion as t he number of cores increases.

Although these designs are adequate for small numbers of cores, they will not scale

to the t housand-core chips we will see within the next decade.

Tiled multicores distribute both computation and communication structures pro-

viding advantages in efﬁciency, scalability, design costs, and versatility. As men-

tioned previously, smaller simpler cores are faster and more efﬁcient due to the

scaling properties of certain internal processor structures. In addition, they pro-

vide fast, cheap access to local resources (such as caches) and incur extra cost

only when additional distant resources are required. Centralized designs, on the

other hand, force every access to incur the costs of using a single large, dis-

tant resource. This is true to a lesser extent even for other multicore designs

with centralized interconnects. Every access that leaves a core must use the sin-

gle large interconnect. In a tiled multicore, an external access is routed through

the on-chip network and uses only the network segments between the source and

destination.

Tiled multicore architectures are speciﬁcally designed to scale easily as improve-

ments in process technology provide more transistors on each chip. Because tiled

multicores use distributed communication structures as well as distributed compu-

tation, processors of any size can be built by simply laying down additional tiles.

Moving to a new process generation does not require any redesign or re-veriﬁcation

of the tile design. Besides future scalability, this property has enormous advantages

for design costs today. To design a huge billion-transistor chip, one only needs to

design, layout, and verify a small, relatively simple tile and then replicate it as

needed to ﬁll the die area. Multicores with centralized interconnect allow much of

the core design to be re-used, but still require some customized layout for each core.

In addition, the interconnect may need to be completely redesigned to add additional

cores.

As we will see in Sect. 1.5, tiled multicores are also much more versatile than

traditional general-purpose processors. This versatility stems from the fact that,

much like FPGAs, tiled multicores provide large quantities of general processing

resources and allow the application to decide how best to use them. This is in con-

trast to large monolithic processors where the majority of die area is consumed

by special-purpose structures that may not be needed by all applications. If an

application does need a complex function, it can dedicate some of the resources

to emulating it in software. Thus, tiled multicores are, in a sense, more general than

general-purpose processors. They can provide competitive performance on single-

threaded ILP (instruction-level parallelism) applications as well as applications that

1 Tiled Multicore Processors 5

are traditionally the domain of DSPs, FPGAs, and ASICs. As demonstrated in Sect.

1.4, they do so by supporting multiple models of computation such as ILP, stream-

ing, TLP (task or thread-level parallelism), and DLP (data-level parallelism).

1.1.3 Raw: A Prototype Tiled Multicore Processor

The Raw processor is a prototype tiled multicore. Developed in the Computer Archi-

tecture Group at MIT from 1997 to 2002, it is one of the ﬁrst multicore processors.

The design was initially motivated by the increasing importance of managing wire

delay and the desire to expand the domain of “general-purpose” processors into

the realm of applications traditionally implemented in ASICs. To obtain some intu-

ition on how to approach this challenge, we conducted an early study [5, 48] on the

factors responsible for the signiﬁcantly better performance of application-speciﬁc

VLSI chips. We identiﬁed four main factors: specialization; exploitation of parallel

resources (gates, wires, and pins); management of wires and wire delay; and man-

agement of pins.

1. Specialization: ASICs specialize each “operation” at the gate level. In both the

VLSI circuit and microprocessor context, an operation roughly corresponds to

the unit of work that can be done in one cycle. A VLSI circuit forms operations

by combinational logic paths, or “operators,” between ﬂip-ﬂops. A microproces-

sor, on the other hand, has an instruction set that deﬁnes the operations that can

be performed. Specialized operators, for example, for implementing an incom-

patible ﬂoating-point operation, or implementing a linear feedback shift regis-

ter, can yield an order of magnitude performance improvement over an extant

general-purpose processor that may require many instructions to perform the

same one-cycle operation as the VLSI hardware.

2. Exploitation of Parallel Resources: ASICs further exploit plentiful silicon

area to implement enough operators and communications channels to sustain

a tremendous number of parallel operations in each clock cycle. Applications

that merit direct digital VLSI circuit implementations typically exhibit mas-

sive, operation-level parallelism. While an aggressive VLIW implementation

like Intel’s Itanium II [32] executes six instructions per cycle, graphics accel-

erators may perform hundreds or thousands of word-level operations per cycle.

Because they operate on very small word operands, logic emulation circuits such

as Xilinx II Pro FPGAs can perform hundreds of thousands of operations each

cycle. Clearly the presence of many physical execution units is a minimum pre-

requisite to the exploitation of the same massive parallelism that ASICs are able

to exploit.

However, the term “multicore” was not coined until more recently when commercial processors

with multiple processing cores began to appear in the marketplace.

剩余308页未读，继续阅读

jasonh002

粉丝: 0
资源: 7

会员权益专享

多核处理器与系统：技术趋势与创新

Multicore Processors – A Necessity

Multicore processors and systems

Multicore操作系统

Multicore and System Software

《Assurance of Multicore Processors in Airborne Systems》是一份什么样的标准

List at least three challenges when designing programming for multicore systems.

上述这份标准在哪个网页可以下载到？

bottom-up的方法也在上述标准中被提及，这种方法对于辅助top-down方法有哪些好处？

name of ldpc en and decode ic

plan("multiprocess", workers = 2) Error: Strategy 'multiprocess' is defunct in future (>= 1.32.0) [2023-03-06]. Instead, explicitly specify either 'multisession' (recommended) or 'multicore'.

https://www.infineon.com/cms/en/product/microcontroller/tricore-microcontroller-32-bit-64-bit/32-bit-tc2/multicore-audo-processor/该页面无法找到

DSP c66的共享内存通信带宽

在乐鑫 的官网上找一个上述的开源项目

能介绍下Vellamo测试工具吗

linux安装NAMD

chrono多体动力学用到的库

r 语言future.apply

t32mtc使用教程

会员权益专享

最新资源

在乐鑫的官网上找一个上述的开源项目