并行与顺序算法设计：CMU课程讲义

需积分: 9 134 浏览量更新于2024-07-17 收藏 3.26MB PDF 举报

"顺序与并行算法设计" 是CMU(卡内基梅隆大学)的一门课程讲义，由Umut A. Acar和Guy E. Blelloch撰写，全英文，包含书签，课程编号为15-210，主题涉及并行和顺序数据结构及算法。这本讲义主要探讨了算法设计的两个关键领域：顺序算法和并行算法。在现代计算环境中，理解如何有效地设计和实现这两种算法对于优化计算效率至关重要。以下是对讲义中主要内容的详细阐述： 1. **Introduction**（介绍） - **Parallelism**：首先介绍并行计算的概念，它是通过同时执行多个任务来提高计算速度。并行计算可以利用多核处理器、分布式系统或GPU等硬件资源。 - **Work and Span**：工作（Work）和跨度（Span）是评估并行算法效率的两个核心概念。工作通常表示完成任务所需的总计算量，而跨度则衡量算法的并发性，即最长时间的任务路径。 2. **Mathematical Preliminaries**（数学预备知识） - **Sets**：讨论集合理论基础，这是理解和描述算法问题的基础。 - **Relations and Functions**：关系和函数的理解对于构建算法模型至关重要，特别是在定义算法的输入、输出和操作之间关系时。 - **Graph Theory**：图论是分析算法复杂性和设计数据结构的关键工具。基本定义包括顶点、边和图的性质。权重图用于表示带权值的连接，子图是原图的一部分，连通性和连通组件描述了图的结构，而图的分割是将图分成多个不相交部分的过程。最后，树作为一种特殊的图，广泛应用于数据结构和算法设计中。 3. **SPARC: A Strict Language for Parallel Computing** - SPARC可能是一种用于并行计算的严格编程语言，它为并行算法的实现提供了一种结构化的方法，强调了并行性和正确性的保证。讲义的其他章节可能涵盖更多内容，如数据结构的实现、并行算法的设计策略、性能分析以及同步和通信机制。通过这门课程，学生能够掌握如何针对不同的计算环境设计高效且可扩展的算法，从而在实际问题中充分利用并行计算的优势。

4 CHAPTER 1. INTRODUCTION

Parallel systems. Today parallelism is available in all computer systems, and at many dif-

ferent scales starting with parallelism in the nano-circuits that implement individual instruc-

tions, and working the way up to parallel systems that occupy large data centers. Since the

early 2000s hardware manufacturers have been placing multiple processing units, often called

“cores”, onto a single chip. These cores can be general purpose processors, or more special pur-

pose processors, such as those found in Graphics Processing Units (GPUs). Each core can run

in parallel with the others. At the larger scale many such chips can be connected by a network

and used together to solve large problems. For example, when you perform a simple search on

the Internet, you engage a data center with thousands of computers in some part of the world,

likely near your geographic location. Many of these computers (perhaps as many as hundreds,

if not thousands) take up your query and sift through data to give you an accurate response as

quickly as possible.

There are several reason for why such parallel systems and thus parallelism has become so

important. First, parallelism is simply more powerful than sequential computing, where only

one computation can be run at a time, because it enables solving more complex problems in

shorter time. For example,an Internet search is not as effective if it cannot be completed at “in-

teractive speeds”, completing in several milliseconds. Similarly, a weather-forecast simulation

is essentially useless if it cannot be completed in time.

The second reason is efﬁciency in terms of energy usage. As it turns out, performing a computa-

tion twice as fast sequentially requires eight times as much energy. Precisely speaking, energy

consumption is a cubic function of clock frequency (speed). With parallelism we don’t need

more energy to speed up a computation, at least in principle. For example, to perform a compu-

tation in half the time, we need to divide the computation into two parallel sub-computations,

perform them in parallel and combine their results. This can require as little as half the time as

the sequential computation while consuming the same amount of energy. In reality, there are

some overheads and we will need more energy, for example, to divide the computation and

combine the results. Such overheads are usually small, e.g., constant fraction over sequential

computation, but can be larger. These two factors—time and energy—have become increas-

ingly important in the last decade, catapulting parallelism to the forefront of computing.

Example 1.1. As is historically popular in explaining algorithms, we can establish an

analogy between parallel algorithms and cooking. As in a kitchen with multiple cooks,

in parallel algorithms you can do things in parallel for faster turnaround time. For

example, if you want to prepare 3 dishes with a team of cooks you can do so by asking

each cook to prepare one. Doing so will often be faster that using one cook. But there

are some overheads, for example, the work has to be divided as evenly as possible.

Obviously, you also need more resources, e.g., each cook might need their own kitchen

utensils.

Parallel software. The important advantage of using a parallel instead of a sequential al-

gorithm is the ability to perform sophisticated computations quickly enough to make them

practical or relevant, without consuming large amounts of energy.

January 16, 2018 (DRAFT, PPAP)

1.1. PARALLELISM 5

Example 1.2. Example timings (reported in seconds) for some algorithms on 1 and on

32 cores.

Sequential Parallel

1-core 32-core

Sorting 10 million strings 2.9 2.9 .095

Remove duplicates for 10 million strings .66 1.0 .038

Minimum spanning tree for 10 million edges 1.6 2.5 .14

Breadth ﬁrst search for 10 million edges .82 1.2 .046

One way to quantify this advantage is to measure the performance improvements of paral-

lelism. Example 1.2 illustrates the sort of performance improvements that can achieved today.

These times are on a 32 core commodity server machine. In the table, the sequential timings

use sequential algorithms while the parallel timings use parallel algorithms. Notice that the

speedup for the parallel 32 core version relative to the sequential algorithm ranges from ap-

proximately 12 (minimum spanning tree) to approximately 32 (sorting).

Challenges of parallel software. It would be convenient to use sequential algorithms on par-

allel computers, but this does not work well because parallel computing requires a different

way of organizing the computation. The fundamental difference between sequential and par-

allel computation is that in the latter certain computations will be performed at the same time

but this is possible only if the computations are actually independent, i.e., do not depend on

each other. Thus when designing a parallel algorithm, we have to identify the underlying de-

pendencies in the computation to be performed and avoid creating unnecessary dependencies.

Example 1.3. Going back to or cooking example, suppose that we want to make a frit-

tata in our kitchen with 3 cooks. Making a frittata is quite a bit more involved than just

boiling eggs. We have to be careful about the dependencies between various tasks. For

example, vegetables cannot be sauteed before they are washed and chopped the eggs

cannot be added to the meal before being broken or before the vegetables are sauteed,

etc.

An important challenge is therefore to design algorithms that minimize the dependencies so

that more things can run in parallel. This design challenge is a primary focus of this book.

Another important challenge concerns the coding and usage of a parallel algorithm in the real

world. The many forms of parallelism, ranging from small to large scale, and from general

to special purpose, has led to many different programming languages and system for coding

parallel algorithms. These different programming languages systems often target a particular

kind of hardware, and even a particular kind of problem domain. For example, there are sepa-

rate systems for coding parallel numerical algorithms on shared memory hardware, for coding

January 16, 2018 (DRAFT, PPAP)

6 CHAPTER 1. INTRODUCTION

graphics algorithms on Graphical Processing Units (GPUs), and for coding data-analytics soft-

ware on a distributed system. Each such system tends to have its own programming interface,

its own cost model, and its own optimizations, making it practically impossible to take a paral-

lel algorithm and code it once and for all for all possible applications. As it turns out, one can

easily spend weeks or even months optimizing a parallel sorting algorithm on speciﬁc parallel

hardware, such as a GPU.

Maximizing speedup by coding and optimizing an algorithm is not the goal of this book. In-

stead, our goal is to cover general design principles for parallel algorithms that can be applied

in essentially all parallel systems, from the data center to the multicore chips on mobile phones.

We will learn to think about parallelism at a high-level, learning general techniques for de-

signing parallel algorithms and data structures, and learning how to approximately analyze

their costs. The focus is on understanding when things can run in parallel, and when not due

to dependencies. There is much more to learn about parallelism, and we hope you continue

studying this subject.

1.2 Work and Span

In this book we analyze the cost of algorithms in terms of two measures: work and span.

Together these measures capture both the sequential time and the parallelism available in an

algorithm. We typically analyze both of these asymptotically, using for example the big-O

notation, which will be described in more detail in Chapter 4.

The work of an algorithm corresponds to the total number of primitive operations performed

by an algorithm. If running on a sequential machine, it corresponds to the sequential time. On

a parallel machine, however, work can be split among multiple processors and thus reduce the

time.

The interesting question is to what extent can the work be shared. Ideally we would like the

work to be evenly shared. If we had W work and P processors to work on it in parallel, then

even sharing would imply each processor does

work, and hence the total time is

. An

algorithm that achieves such ideal sharing is said to have perfect speedup. Perfect speedup,

however, is not always possible. If our algorithm is fully sequential (each operation depends

on prior operations, leaving no room for parallelism), for example, we can only take advantage

of one processor, and the time would not be improved at all by adding more. There is no

sharing—-at least in parallel. More generally, when executing an algorithm in parallel, we

cannot break dependencies, if a task depends on another task, we have to complete them in

order.

The second measure, span, enables analyzing to what extent the work of an algorithm can

be split among processors. The span of an algorithm basically corresponds to the longest

sequence of dependences in the computation. It can be thought of the time an algorithm would

take if we had an unlimited number of processors on an ideal machine.

January 16, 2018 (DRAFT, PPAP)

1.2. WORK AND SPAN 7

As we shall see in Section 4.2.4, even though work and span, are abstract machine-independent

models of cost, they can be used to predict the run-time on any number of processors. Speciﬁ-

cally, if for an algorithm the work dominates, i.e., is much larger than, span, then we expect the

algorithm to deliver good speedups.

Example 1.4. As an example, consider the parallel mergeSort algorithm for sorting a

sequence of length n. The work is the same as the sequential time, which you might

know is

W (n) = O(n lg n).

In Chapter 8 we will see that the span for mergeSort is

S(n) = O(lg

n).

Thus, when sorting a million keys, work is 10

lg(10

) > 10

, and span is lg

(10

) < 500.

This means that we would expect to get good (close to perfect) speedups when using a

small to moderate number of processors, e.g., couple of tens or hundreds, because the

work term will dominate. We should note that in practice, the numbers might be more

conservative due to natural overheads of parallel execution.

In this book we calculate the work and span of algorithms in a very simple way that just in-

volves composing costs across subcomputations. Basically we assume that sub-computations

are either composed sequentially (one must be performed after the other) or in parallel (they

can be performed at the same time). We then calculate the work as the sum of the work of

the subcomputations and the span as the sum of the span of sequential subcomputations or

maximum of the work of the parallel subcomputations. More concretely, given two subcom-

putations, we can calculate the work and the span of their sequential and parallel composition

as follows.

W (Work) S (span)

Sequential composition 1 + W

+ W

1 + S

+ S

Parallel composition 1 + W

+ W

1 + max(S

, S

)

In the table, W

and S

are the work and span of the ﬁrst subcomputation and W

and S

of the

second. The 1 that is added to each rule is the cost of composing the subcomputations.

The intuition behind these rules is that work simply adds, whether we perform computations

sequentially or in parallel. The span, however, only depends on the span of the maximum of

January 16, 2018 (DRAFT, PPAP)

8 CHAPTER 1. INTRODUCTION

the two parallel computations. It might help to think of work as the total energy consumed by a

computation and span as the minimum possible time that the computation requires. Regardless

of whether computations are performed serially or in parallel, energy is equally required; time,

however, is determined only by the slowest computation.

Example 1.5. Suppose that we have 30 eggs to cook using 3 cooks. Whether all 3 cooks

to do the cooking or just one, the total work remains unchanged: 30 eggs need to be

cooked. Assuming that cooking an egg takes 5 minutes, the total work therefore is 150

minutes. The span of this job corresponds to the longest sequence of dependences that

we must follow. Since we can, in principle, cook all the eggs at the same time, span is 5

minutes.

Given that we have 3 cooks, how much time do we actually need? The greedy schedul-

ing principle tells us that we need no more that 150/3 + 5 = 55 minutes. That is almost

a factor 3 speedup over the 150 that we would need with just one cook.

How do we actually realize the greedy schedule? In this case, this is simple, all we have

to do is divide the eggs equally between our cooks.

If algorithm A has less work than algorithm B, but has greater span then which algorithm is

better? In analyzing sequential algorithms there is only one measure so it is clear when one

algorithm is asymptotically better than another, but now we have two measures. In general

the work is more important than the span. This is because the work reﬂects the total cost of

the computation (the processor-time product). Therefore typically the goal is to ﬁrst reduce the

work and then reduce the span by designing asymptotically work-efﬁcient algorithms that per-

form no work than the best sequential algorithm for the same problem. However, sometimes

it is worth giving up a little in work to gain a large improvement in span.

Deﬁnition 1.6. [Work Efﬁciency] We say that a parallel algorithm is asymptotically

work efﬁcient or, simply work efﬁcient, if the work is asymptotically the same as the

time for an optimal sequential algorithm that solves the same problem.

For example, the parallel mergeSort described in Example 1.4 is work efﬁcient since it does

O(n log n) work, which optimal time for comparison based sorting. In this course we will try

to develop work-efﬁcient or close to work-efﬁcient algorithms.

1.3 Speciﬁcation, Problem, Implementation

Problem solving in computer science requires reasoning precisely about problems being stud-

ied and the properties of solutions. To facilitate such reasoning, in this book, we deﬁne prob-

January 16, 2018 (DRAFT, PPAP)

剩余422页未读，继续阅读

omeletwww

粉丝: 0
资源: 2

并行与顺序算法设计：CMU课程讲义

并行算法设计方法与负载平衡策略

2017年CMU课程：并行与顺序算法设计

并行计算深入探索：并行算法设计与模型

并行计算深入探索：流水线技术与并行算法设计

"并行算法设计与性能优化研究

C语言实现PageRank算法：顺序与并行化效果对比

CUDA中的并行算法设计与优化方法

并行算法与计算：多线程环境下的算法设计策略

MaximumClique:顺序和并行Bron-Kerbosch算法

联接文法PFIRST和PFOLLOW集合的并行算法设计 (2013年)

最新资源