欧洲ELSA大规模集成项目成果与挑战：64x64阵列处理器探索

需积分: 9 135 浏览量更新于2024-09-07 收藏 1.48MB PDF 举报

本文将深入探讨"Wafer Scale Integration"（WSI）项目，特别是欧洲大型项目ELSA（European Large SIMD Array）的一部分。该项目作为欧洲共同体资助的Esprit项目824的一部分，旨在实现大规模集成电路的集成，目标是在一个4英寸晶圆上构建一个64x64阵列处理器，该处理器的运算能力将达到每秒超过100亿次操作。这项技术挑战旨在推动计算性能的极限，并展示其在工业界的应用潜力。首先，论文的介绍部分简要回顾了项目背景，强调了WSI作为未来半导体制造的关键技术，其目的是通过大规模集成来提高芯片密度和性能。ELSA项目集合了来自4个欧洲国家的8家公司、子承包商以及研究机构的力量，规模庞大且复杂。文章详细描述了研发过程中的关键技术尝试，包括但不限于光刻、封装和互联技术。成功的部分涉及精密的微电子工艺，如多重曝光和复杂的硅通孔（TSV）结构，这些技术允许数据在不同层面之间高效传输，增强了芯片的并行处理能力。然而，论文也提到了项目中遭遇的一些挑战，如良率问题、温度控制、散热管理等，这些都是在大规模集成中必须克服的技术难题。设计的重点在于构建64x64阵列处理器，每个核心单元可以独立执行指令，从而实现高度并行计算。为了验证这一概念，研究者还制作了一块测试芯片，以及一个演示系统，旨在实际展示这种高性能的计算能力。尽管遇到了技术上的困难，但这些成果证明了WSI在提高芯片性能方面的巨大潜力。这篇论文不仅提供了关于ELSA项目的技术细节，还揭示了WSI技术在实际应用中可能面临的挑战以及如何克服它们。这对于理解当前半导体行业的创新趋势，以及评估未来芯片设计和制造的发展方向具有重要意义。此外，论文还暗示了这些技术成果可能对其他包装和大面积器件项目产生的影响，推动了整个电子行业向更高集成度和性能迈进。

628

IEEE

TRANSACTIONS

COMPONENTS,

HYBRIDS,

AND

MANUFACTURING

TECHNOLOGY,

VOL.

16,

NO.

NOVEMBER

1993

and

and the destination of data in either RAM A or

The read-modify-write cycle is limited to single sources and

destinations in RAM A and

and the RAM address may

not be changed during an operation. If a different destination

address is required, or if source addresses in the same RAM

are required, additional clock cycles are necessary. However,

this is

not

a significant performance limitation since it is rare

in practice to require this type of operation.

Multiplexers:

The

multiplexers control the selection

of input data to the addedsubtractor, RAM’s

and

vertical communications register (CM) and a no-op signal

(FG). Command bit combinations which are out of range

will result

no input selection and the latches following the

multiplexer will then retain the previous data.

Latches:

The latches store the output data from five of

the multiplexers, three of them (NS, EW, C) hold inputs to the

adder/subtractor, one (CM) forms part of a communication

latches are

full

clock cycle delay latches. However, an output

is fed from the NS, EW, and C latches to the ALU in the first

half

of the clock cycle

that the

ALU

output is available

to be written to the RAM’s during the second half of the

cycle. The output from the FLG latch is used to produce a

conditional flag signal FG

not

FLG, where

a global command input. When FLG is low, all the latches,

except CM, are inhibited from receiving new data and retain

their previous state. This has the effect of halting operation of

the PE

that local conditionals can be executed. The FLG

or by a simple memory fetch. By initializing the memories

of the PE’s in groups, isolated areas of the array can execute

different programs. This feature can be used to form arbitrary

arrays of processors under software control.

AdderISubtractor:

The adderlsubtractor has inputs A,

CIN, and outputs SM, CY, where CIN and CY are the

carry/borrow signals. The adderlsubtractor can perform the

usual arithmetic and logical functions by appropriate setting

of the inputs and can be programmed to perform (A-B-CIN)

and (B-A-CIN).

The global output signal is

a wired-OR output that is pulled low by any PE for which FG

is high and

is low.

The Array Structure:

Each PE has an input and an output

line

each edge, which is connected to the north, south, east

and west neighbors (Fig.

3).

The four input lines are connected

to the input multiplexers of the registers EW and NS. The

north and south, east, and west outputs are derived from the

NS and EW registers, respectively. These lines allow nearest

neighbor communication between PE’s during computation.

Additional communication facilities are supplied by the CM

registers, which allow global data movement and distribution

“on

the fly” during computation.

Chip Level Architecture:

The basic ELSA chip consists

by 12 array of PE’s, used to produce a final 6 by

array.

addition there are decoders for the RAM addresses

and command signals and a clock generator. Around the edges

of the chip there are bidirectional buffers for the array edge

data and unidirectional buffers for the communication bus.

The

Global

Output

-GO:

Fig.

communication

To reduce the number of connectors around the chip edge,

the two signals in each of the horizontal and vertical directions

are combined into bidirectional buses. This combining is per-

formed using bidirectional buffers whose mode is determined

by decoding the current command during each cycle.

10)

Reticle Architecture:

The top level building block for

the wafer is the reticle, which comprises four ELSA chips

together with reconfiguration switches, command bus buffers,

and pad drivers. In addition, alignment marks and process test

structures are also included, as in conventional IC design.

Two blocks of fourteen bidirectional I10 pad drivers are

provided

each reticle edge, each block being connected

to the center switch block

the chip edge. Twelve of the

drivers are used for the data and two for the switch control

signals. During normal operation, the drivers on the east and

west edges are controlled by decoding the current command.

the north edge, six of the drivers are outputs (for the

communication signals) and six are controlled by command

decoding.

the south edge, six communication signals are

inputs and six are command controlled.

The final

by 12 array size of the chip results in twelve

bidirectional signals from the east and west edges and six

bidirectional and six unidirectional from the north and south

edges. Thus a 12-b wide bidirectional bus is necessary to

interconnect the chips. The data from each chip edge feeds

into a 12-b reconfiguration switch. Four more switches are

located at each corner of the chip, forming a box of eight

switches surrounding each chip. Each of these switch blocks

is interconnected to its immediate neighbors horizontally and

vertically (Fig.

4).

Each switch has four edges giving four

possible switch configurations which are selected by a 2-b

code. The code can be set by either laser fuses or software

control.

In order to reduce the loading

the global command

signals and

RAM

addresses, these signals are buffered and

retimed at each reticle before distribution to the chips within

the reticle. This results in a one clock pipeline delay between

application of the control word to the wafer and execution of

a command.

剩余10页未读，继续阅读

crawlsnailx

粉丝: 0
资源: 1

欧洲ELSA大规模集成项目成果与挑战：64x64阵列处理器探索

D封装与硅通孔TSV工艺技术PPT课件.pptx

Algorithms for VLSI Physical Design Automation

Spring MVC架构详解与配置指南：实现Web应用的高效开发

基于golang的渗透测试武器，将web打点部分与常规的漏扫部分进行整合与改进.zip

渗透测试与搭建.zip

【java毕业设计】野生动物公益保护系统源码（ssm+mysql+说明文档+LW）.zip

【java毕业设计】易商B2C网上交易系统ssh+mysql源码（完整前后端+说明文档+LW）.zip

网站渗透测试系统.zip

主要用于渗透测试中的字典.zip

精选微信小程序源码：点外卖小程序（含源码+源码导入视频教程&文档教程，亲测可用）

最新资源