高性能科学计算入门

1星需积分: 18 74 浏览量更新于2024-07-20 1 收藏 54.4MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"《Introduction to High Performance Computing》是一本免费的书籍，主要介绍高能科学计算的入门知识，由Victor Eijkhout、Edmond Chow和Robert vande Geijn共同编写，2016年进行了第二版修订。这本书遵循Creative Commons Attribution 3.0 Unported (CC BY 3.0)许可协议，并由Saylor Foundation资助发布。" 在高性能计算（High Performance Computing, HPC）领域，涉及多学科和技能集的交叉，为了在科学计算中成功运用HPC，需要对这些领域都有基本的认识和技能。首先，计算来源于应用背景，因此了解物理学和工程科学的基本原理是必要的。其次，这些应用领域的问题通常会转化为线性代数和有时是组合问题，这就需要计算机科学家具备数值分析、线性代数以及离散数学等多个方面的知识。本书的序言提到，高效的实现是关键，这意味着需要理解并掌握算法优化、并行计算以及分布式系统的概念。在HPC中，解决大规模问题通常需要并行化技术，如消息传递接口(MPI)和OpenMP等，这些工具和框架允许在多处理器系统或集群上进行计算任务的分解和协调。此外，内存管理和数据布局对于优化计算性能也至关重要，因为它们可以影响数据访问速度和计算效率。 HPC还涉及到计算资源的调度和管理，这包括作业调度系统和资源管理系统，如SLURM、Torque或 PBS等，它们用于有效地分配和监控计算任务。另外，性能监控和分析工具，如HPCToolkit、Perf或VTune，能够帮助开发者识别性能瓶颈，从而进行性能调优。在实际应用中，HPC还涵盖了各种科学模拟和建模方法，例如有限元法(FEM)、有限差分法(FDM)或蒙特卡洛(Monte Carlo)方法。这些方法需要根据具体问题进行数学模型的建立和数值求解，而HPC则提供了实现这些复杂计算的平台。《Introduction to High Performance Computing》这本书旨在提供一个全面的入门指南，涵盖HPC领域的基础理论、关键技术以及实际应用，旨在帮助读者建立起跨学科的HPC知识体系，以便更好地利用高性能计算资源解决科学研究中的挑战性问题。

资源详情

资源推荐

1. Single-processor Computing

Figure 1.2: Block diagram of the Intel Sandy Bridge core

are. Division operations can take 10 or 20 clock cycles, while a CPU can have multiple addition and/or

multiplication units that (asymptotically) can produce a result per cycle.

1.2.1.3 Pipelining

The ﬂoating point add and multiply units of a processor are pipelined, which has the effect that a stream of

independent operations can be performed at an asymptotic speed of one result per clock cycle.

The idea behind a pipeline is as follows. Assume that an operation consists of multiple simpler opera-

tions, and that for each suboperation there is separate hardware in the processor. For instance, an addition

instruction can have the following components:

• Decoding the instruction, including ﬁnding the locations of the operands.

• Copying the operands into registers (‘data fetch’).

• Aligning the exponents; the addition .35×10

−1

+ .6×10

−2

becomes .35×10

−1

+ .06×10

−1

• Executing the addition of the mantissas, in this case giving .41.

16 Introduction to High Performance Scientiﬁc Computing

1.2. Modern processors

Processor year add/mult/fma units daxpy cycles

(count×width) (arith vs load/store)

MIPS R10000 1996 1 × 1 + 1 × 1 + 0 8/24

Alpha EV5 1996 1 × 1 + 1 × 1 + 0 8/12

IBM Power5 2004 0 + 0 + 2 × 1 4/12

AMD Bulldozer 2011 2 × 2 + 2 × 2 + 0 2/4

Intel Sandy Bridge 2012 1 × 4 + 1 × 4 + 0 2/4

Intel Haswell 2014 0 + 0 + 2 × 4 1/2

Table 1.1: Floating point capabilities (per core) of several processor architectures, and DAXPY cycle num-

ber for 8 operands

• Normalizing the result, in this example to .41 × 10

−1

. (Normalization in this example does not

do anything. Check for yourself that in .3 ×10

+ .8 ×10

and .35 ×10

−3

+ (−.34) ×10

−3

there is a non-trivial adjustment.)

• Storing the result.

These parts are often called the ‘stages’ or ‘segments’ of the pipeline.

If every component is designed to ﬁnish in 1 clock cycle, the whole instruction takes 6 cycles. However, if

each has its own hardware, we can execute two operations in less than 12 cycles:

• Execute the decode stage for the ﬁrst operation;

• Do the data fetch for the ﬁrst operation, and at the same time the decode for the second.

• Execute the third stage for the ﬁrst operation and the second stage of the second operation

simultaneously.

• Et cetera.

You see that the ﬁrst operation still takes 6 clock cycles, but the second one is ﬁnished a mere 1 cycle later.

Let us make a formal analysis of the speedup you can get from a pipeline. On a traditional FPU, producing

n results takes t(n) = n`τ where ` is the number of stages, and τ the clock cycle time. The rate at which

results are produced is the reciprocal of t(n)/n: r

serial

≡ (`τ)

−1

On the other hand, for a pipelined FPU the time is t(n) = [s + ` + n −1]τ where s is a setup cost: the ﬁrst

operation still has to go through the same stages as before, but after that one more result will be produced

each cycle. We can also write this formula as

t(n) = [n + n

1/2

]τ.

Exercise 1.1. Let us compare the speed of a classical FPU, and a pipelined one. Show that

the result rate is now dependent on n: give a formula for r(n), and for r

∞

lim

n→∞

r(n). What is the asymptotic improvement in r over the non-pipelined case?

Next you can wonder how long it takes to get close to the asymptotic behaviour.

Show that for n = n

1/2

you get r(n) = r

∞

/2. This is often used as the deﬁnition

of n

1/2

Since a vector processor works on a number of instructions simultaneously, these instructions have to

be independent. The operation ∀

: a

← b

+ c

has independent additions; the operation ∀

: a

i+1

←

Victor Eijkhout 17

1.2. Modern processors

(You may wonder why we are mentioning some fairly old computers here: true pipeline supercomputers

hardly exist anymore. In the US, the Cray X1 was the last of that line, and in Japan only NEC still makes

them. However, the functional units of a CPU these days are pipelined, so the notion is still important.)

Exercise 1.4. The operation

for (i) {

x[i+1] = a[i]

x[i] + b[i];

}

can not be handled by a pipeline because there is a dependency between input of one

iteration of the operation and the output of the previous. However, you can transform

the loop into one that is mathematically equivalent, and potentially more efﬁcient

to compute. Derive an expression that computes x[i+2] from x[i] without in-

volving x[i+1]. This is known as recursive doubling. Assume you have plenty of

temporary storage. You can now perform the calculation by

• Doing some preliminary calculations;

• computing x[i],x[i+2],x[i+4],..., and from these,

• compute the missing terms x[i+1],x[i+3],....

Analyze the efﬁciency of this scheme by giving formulas for T

(n) and T

(n). Can

you think of an argument why the preliminary calculations may be of lesser impor-

tance in some circumstances?

1.2.1.4 Peak performance

Thanks to pipelining, for modern CPUs there is a simple relation between the clock speed and the peak

performance. Since each FPU can produce one result per cycle asymptotically, the peak performance is the

clock speed times the number of independent FPUs. The measure of ﬂoating point performance is ‘ﬂoating

point operations per second’, abbreviated ﬂops. Considering the speed of computers these days, you will

mostly hear ﬂoating point performance being expressed in ‘gigaﬂops’: multiples of 10

ﬂops.

1.2.2 8-bit, 16-bit, 32-bit, 64-bit

Processors are often characterized in terms of how big a chunk of data they can process as a unit. This can

relate to

• The width of the path between processor and memory: can a 64-bit ﬂoating point number be

loaded in one cycle, or does it arrive in pieces at the processor.

• The way memory is addressed: if addresses are limited to 16 bits, only 64,000 bytes can be

identiﬁed. Early PCs had a complicated scheme with segments to get around this limitation: an

address was speciﬁed with a segment number and an offset inside the segment.

• The number of bits in a register, in particular the size of the integer registers which manipulate

data address; see the previous point. (Floating point register are often larger, for instance 80 bits

in the x86 architecture.) This also corresponds to the size of a chunk of data that a processor can

operate on simultaneously.

Victor Eijkhout 19

1. Single-processor Computing

• The size of a ﬂoating point number. If the arithmetic unit of a CPU is designed to multiply 8-byte

numbers efﬁciently (‘double precision’; see section 3.2.2) then numbers half that size (‘single

precision’) can sometimes be processed at higher efﬁciency, and for larger numbers (‘quadruple

precision’) some complicated scheme is needed. For instance, a quad precision number could

be emulated by two double precision numbers with a ﬁxed difference between the exponents.

These measurements are not necessarily identical. For instance, the original Pentium processor had 64-bit

data busses, but a 32-bit processor. On the other hand, the Motorola 68000 processor (of the original Apple

Macintosh) had a 32-bit CPU, but 16-bit data busses.

The ﬁrst Intel microprocessor, the 4004, was a 4-bit processor in the sense that it processed 4 bit chunks.

These days, 64 bit processors are becoming the norm.

1.2.3 Caches: on-chip memory

The bulk of computer memory is in chips that are separate from the processor. However, there is usually a

small amount (typically a few megabytes) of on-chip memory, called the cache. This will be explained in

detail in section 1.3.4.

1.2.4 Graphics, controllers, special purpose hardware

One difference between ‘consumer’ and ‘server’ type processors is that the consumer chips devote consid-

erable real-estate on the processor chip to graphics. Processors for cell phones and tablets can even have

dedicated circuitry for security or mp3 playback. Other parts of the processor are dedicated to communi-

cating with memory or the I/O subsystem. We will not discuss those aspects in this book.

1.2.5 Superscalar processing and instruction-level parallelism

In the von Neumann model processors operate through control ﬂow: instructions follow each other linearly

or with branches without regard for what data they involve. As processors became more powerful and

capable of executing more than one instruction at a time, it became necessary to switch to the data ﬂow

model. Such superscalar processors analyze several instructions to ﬁnd data dependencies, and execute

instructions in parallel that do not depend on each other.

This concept is also known as Instruction Level Parallelism (ILP), and it is facilitated by various mecha-

nisms:

• multiple-issue: instructions that are independent can be started at the same time;

• pipelining: already mentioned, arithmetic units can deal with multiple operations in various

stages of completion;

• branch prediction and speculative execution: a compiler can ‘guess’ whether a conditional in-

struction will evaluate to true, and execute those instructions accordingly;

• out-of-order execution: instructions can be rearranged if they are not dependent on each other,

and if the resulting execution will be more efﬁcient;

• prefetching: data can be speculatively requested before any instruction needing it is actually

encountered (this is discussed further in section 1.3.5).

20 Introduction to High Performance Scientiﬁc Computing

剩余574页未读，继续阅读

josias

粉丝: 0
资源: 1

高性能科学计算入门

树莓派100个精彩案例

树莓派开始，玩转Linux

比较全的树莓派入门介绍.pdf

parallel and high performance computing pdf

Software approaches for resilience of high performance computing systems: a survey

AI集群网络相关的顶级会议和社区有哪些

an introduction to parallel computing(并行程序设计导论 英文版)

parallel-and-high-performance-computing

introduction to computing systems: from bits & gates to c & beyond

jetson orin nano

oracle dgarch

Please comprehensively explain the concept of Edge Computing.

推荐几本并行计算的书？

TensorFlow Java

ARM Hotplug

Do you wish to build TensorFlow with ROCm support?

introduction to the arm architecture

slurmstepd: error: *** STEP 4349619.0 ON fb0104 CANCELLED AT

hpc与超级计算的区别

帮我罗列出electrical engineering和computer science的所有分支

最新资源

an introduction to parallel computing(并行程序设计导论英文版)