2017年CMU课程：并行与顺序算法设计

需积分: 12 33 浏览量更新于2024-07-19 收藏 3.25MB PDF 举报

《算法设计：并行与顺序》是CMU（卡内基梅隆大学）计算机科学课程15210《并行与顺序数据结构与算法》的一本教材，由Umut A. Acar和Guy E. Blelloch合著，于2017年9月发布。该书专为理解和设计高效的并发算法提供理论基础和实践指导，将并行计算的概念、工作量与时间跨度、问题表述与实现以及数学基础知识贯穿始终。第一章"Introduction and Preliminaries" 引导读者进入并行算法的世界，介绍了并行性概念，包括如何在多处理器系统中利用并行处理来加速计算。工作量（work）和时间跨度（span）是衡量算法效率的重要指标，前者指执行算法所需的基本操作数量，后者则是这些操作在最坏情况下的完成时间。此外，章节还讨论了如何明确问题描述、算法设计与实际实施的关系，并列举了一些典型问题作为学习示例。第二章"Mathematical Preliminaries" 针对算法设计中的数学基础进行了详尽讲解。这部分涵盖了集合论、关系和函数的概念，强调了它们在数据结构和算法分析中的应用。作者深入探讨了图论，包括基本定义如顶点、边、子图、连通性和连通分量等，以及更复杂的主题如加权图和图划分。此外，树结构的特性和表示也在此部分介绍，这些都是算法设计中的基石。第三章"SPARC: A Strict Language for Parallel Computing" 可能介绍了一种专门用于并行编程的语言或编程模型SPARC，它可能提供了严格的并行计算规则和语法，帮助学生理解如何编写可扩展和高效的并行程序。这部分内容可能涉及并发控制、同步机制、任务调度和数据分布等方面。第四章的内容没有在提供的摘录中给出，但可以推测会继续探讨其他重要的并行算法设计技术、数据结构的并行实现、并行算法分析方法，或者介绍更具体的并行算法实例，如排序、搜索、图算法的并行版本等。整个书籍旨在通过理论与实践相结合的方式，帮助读者掌握并行和顺序算法的设计技巧，为在计算机科学领域进行高效并行编程打下坚实基础。无论是对于专业研究人员还是研究生，或是对并行计算感兴趣的工程师，这都是一本不可或缺的学习资料。

4 CHAPTER 1. INTRODUCTION

Parallel systems. Over the past several decades, the computer scientists have developed

many different forms of parallel systems. On the one hand, hardware manufacturers have

developed multicore chips that contain multiple processors on the same chip, allowing the

processors to communicate quickly and efﬁciently. Today such chips contain anywhere from a

couple of processors to as many as couple of dozens. We refer to the form of parallelism that

arises in such multicore chips as, small-scale parallelism, as it involves smaller numbers of

processors packed into a relatively small area. While multicore chips were initially used only

in laptops and desktops and servers, they are now used in almost all smaller mobile devices

such as phones (many mobile phones today have 4- or 8- core chips.)

There has also been much interest in developing domain-speciﬁc hardware for speciﬁc appli-

cations. For example, Graphics Processing Units (GPUs) can pack as many as 1000 small cores

(processors) onto a single chip.

On the other hand, we have developed larger-scale parallel systems by connecting via a net-

work many, as much as thousands, of different computers. For example, when you perform a

simple search on the Internet, you engage a data center with thousands of computers in some

part of the world, likely near your geographic location. Many of these computers (perhaps as

many as hundreds, if not thousands) take up your query and sift through data to give you an

accurate response as quickly as possible. We refer to this form of parallelism as large-scale

parallelism, because it involves a number of computers over a relatively large area such as a

building or a data center.

There are several reason for why such parallel hardware and thus parallelism has become so

important. First, parallelism is simply more powerful than sequential computing, where only

one computation can be run at a time, because it enables solving more complex problems in

shorter time. For example,an Internet search is not as effective if it cannot be completed at “in-

teractive speeds”, completing in several milliseconds. Similarly, a weather-forecast simulation

is essentially useless if it cannot be completed in time.

The second reason is efﬁciency in terms of energy usage. As it turns out, performing a computa-

tion twice as fast sequentially requires eight times as much energy. Precisely speaking, energy

consumption is a cubic function of clock frequency (speed). With parallelism we don’t need

more energy to speed up a computation, at least in principle. For example, to perform a compu-

tation in half the time, we need to divide the computation into two parallel sub-computations,

perform them in parallel and combine their results. This can require as little as half the time as

the sequential computation while consuming the same amount of energy. In reality, there are

some overheads and we will need more energy, for example, to divide the computation and

combine the results. Such overheads are usually small, e.g., constant fraction over sequential

computation, but can be larger. These two factors—time and energy—have become increas-

ingly important in the last decade, catapulting parallelism to the forefront of computing.

August 30, 2017 (DRAFT, PPAP)

1.1. PARALLELISM 5

Example 1.2. Example timings (reported in seconds) for some algorithms on 1 and on

32 cores.

Sequential Parallel

1-core 32-core

Sorting 10 million strings 2.9 2.9 .095

Remove duplicates for 10 million strings .66 1.0 .038

Minimum spanning tree for 10 million edges 1.6 2.5 .14

Breadth ﬁrst search for 10 million edges .82 1.2 .046

Example 1.1. As is historically popular in explaining algorithms, we can establish an

analogy between parallel algorithms and cooking. As in a kitchen with multiple cooks,

in parallel algorithms you can do things in parallel for faster turnaround time. For

example, if you want to prepare 3 dishes with a team of cooks you can do so by asking

each cook to prepare one. Doing so will often be faster that using one cook. But there

are some overheads, for example, the work has to be divided as evenly as possible.

Obviously, you also need more resources, e.g., each cook might need their own kitchen

utensils.

Parallel software. The important advantage of using a parallel instead of a sequential al-

gorithm is the ability to perform sophisticated computations quickly enough to make them

practical or relevant, without consuming large amounts of energy.

One way to quantify this advantage is to measure the performance improvements of paral-

lelism. Example 1.2 illustrates the sort of performance improvements that can achieved today.

These times are on a 32 core commodity server machine. In the table, the sequential timings

use sequential algorithms while the parallel timings use parallel algorithms. Notice that the

speedup for the parallel 32 core version relative to the sequential algorithm ranges from ap-

proximately 12 (minimum spanning tree) to approximately 32 (sorting).

Challenges of parallel software. It would be convenient to use sequential algorithms on par-

allel computers, but this does not work well because parallel computing requires a different

way of organizing the computation. The fundamental difference between sequential and par-

allel computation is that in the latter certain computations will be performed at the same time

but this is possible only if the computations are actually independent, i.e., do not depend on

each other. Thus when designing a parallel algorithm, we have to identify the underlying de-

pendencies in the computation to be performed and avoid creating unnecessary dependencies.

August 30, 2017 (DRAFT, PPAP)

6 CHAPTER 1. INTRODUCTION

Example 1.3. Going back to or cooking example, suppose that we want to make a frit-

tata in our kitchen with 3 cooks. Making a frittata is quite a bit more involved than just

boiling eggs. We have to be careful about the dependencies between various tasks. For

example, vegetables cannot be sauteed before they are washed and chopped the eggs

cannot be added to the meal before being broken or before the vegetables are sauteed,

etc.

An important challenge is therefore to design algorithms that minimize the dependencies so

that more things can run in parallel. This design challenge is a primary focus of this book.

Another important challenge concerns the coding and usage of a parallel algorithm in the real

world. The many forms of parallelism, ranging from small to large scale, and from general

to special purpose, has led to many different programming languages and system for coding

parallel algorithms. These different programming languages systems often target a particular

kind of hardware, and even a particular kind of problem domain. For example, there are sepa-

rate systems for coding parallel numerical algorithms on shared memory hardware, for coding

graphics algorithms on Graphical Processing Units (GPUs), and for coding data-analytics soft-

ware on a distributed system. Each such system tends to have its own programming interface,

its own cost model, and its own optimizations, making it practically impossible to take a paral-

lel algorithm and code it once and for all for all possible applications. As it turns out, one can

easily spend weeks or even months optimizing a parallel sorting algorithm on speciﬁc parallel

hardware, such as a GPU.

Maximizing speedup by coding and optimizing an algorithm is not the goal of this book. In-

stead, our goal is to cover general design principles for parallel algorithms that can be applied

in essentially all parallel systems, from the data center to the multicore chips on mobile phones.

We will learn to think about parallelism at a high-level, learning general techniques for de-

signing parallel algorithms and data structures, and learning how to approximately analyze

their costs. The focus is on understanding when things can run in parallel, and when not due

to dependencies. There is much more to learn about parallelism, and we hope you continue

studying this subject.

1.2 Work and Span

In this book we analyze the cost of algorithms in terms of two measures: work and span.

Together these measures capture both the sequential time and the parallelism available in an

algorithm. We typically analyze both of these asymptotically, using for example the big-O

notation, which will be described in more detail in Chapter 4.

The work of an algorithm corresponds to the total number of primitive operations performed

by an algorithm. If running on a sequential machine, it corresponds to the sequential time. On

a parallel machine, however, work can be split among multiple processors and thus reduce the

time.

August 30, 2017 (DRAFT, PPAP)

1.2. WORK AND SPAN 7

The interesting question is to what extent can the work be shared. Ideally we would like the

work to be evenly shared. If we had W work and P processors to work on it in parallel, then

even sharing would imply each processor does

work, and hence the total time is

. An

algorithm that achieves such ideal sharing is said to have perfect speedup. Perfect speedup,

however, is not always possible. If our algorithm is fully sequential (each operation depends

on prior operations, leaving no room for parallelism), for example, we can only take advantage

of one processor, and the time would not be improved at all by adding more. There is no

sharing—-at least in parallel. More generally, when executing an algorithm in parallel, we

cannot break dependencies, if a task depends on another task, we have to complete them in

order.

The second measure, span, enables analyzing to what extent the work of an algorithm can

be split among processors. The span of an algorithm basically corresponds to the longest

sequence of dependences in the computation. It can be thought of the time an algorithm would

take if we had an unlimited number of processors on an ideal machine.

As we shall see in Section 4.2.4, even though work and span, are abstract machine-independent

models of cost, they can be used to predict the run-time on any number of processors. Speciﬁ-

cally, if for an algorithm the work dominates, i.e., is much larger than, span, then we expect the

algorithm to deliver good speedups.

Example 1.4. As an example, consider the parallel mergeSort algorithm for sorting a

sequence of length n. The work is the same as the sequential time, which you might

know is

W (n) = O(n lg n).

In Chapter 8 we will see that the span for mergeSort is

S(n) = O(lg

n).

Thus, when sorting a million keys, work is 10

lg(10

) > 10

, and span is lg

(10

) < 500.

This means that we would expect to get good (close to perfect) speedups when using a

small to moderate number of processors, e.g., couple of tens or hundreds, because the

work term will dominate. We should note that in practice, the numbers might be more

conservative due to natural overheads of parallel execution.

In this book we calculate the work and span of algorithms in a very simple way that just in-

volves composing costs across subcomputations. Basically we assume that sub-computations

are either composed sequentially (one must be performed after the other) or in parallel (they

can be performed at the same time). We then calculate the work as the sum of the work of

the subcomputations and the span as the sum of the span of sequential subcomputations or

maximum of the work of the parallel subcomputations. More concretely, given two subcom-

August 30, 2017 (DRAFT, PPAP)

8 CHAPTER 1. INTRODUCTION

putations, we can calculate the work and the span of their sequential and parallel composition

as follows.

W (Work) S (span)

Sequential composition 1 + W

+ W

1 + S

+ S

Parallel composition 1 + W

+ W

1 + max(S

, S

)

In the table, W

and S

are the work and span of the ﬁrst subcomputation and W

and S

of the

second. The 1 that is added to each rule is the cost of composing the subcomputations.

The intuition behind these rules is that work simply adds, whether we perform computations

sequentially or in parallel. The span, however, only depends on the span of the maximum of

the two parallel computations. It might help to think of work as the total energy consumed by a

computation and span as the minimum possible time that the computation requires. Regardless

of whether computations are performed serially or in parallel, energy is equally required; time,

however, is determined only by the slowest computation.

Example 1.5. Suppose that we have 30 eggs to cook using 3 cooks. Whether all 3 cooks

to do the cooking or just one, the total work remains unchanged: 30 eggs need to be

cooked. Assuming that cooking an egg takes 5 minutes, the total work therefore is 150

minutes. The span of this job corresponds to the longest sequence of dependences that

we must follow. Since we can, in principle, cook all the eggs at the same time, span is 5

minutes.

Given that we have 3 cooks, how much time do we actually need? The greedy schedul-

ing principle tells us that we need no more that 150/3 + 5 = 55 minutes. That is almost

a factor 3 speedup over the 150 that we would need with just one cook.

How do we actually realize the greedy schedule? In this case, this is simple, all we have

to do is divide the eggs equally between our cooks.

If algorithm A has less work than algorithm B, but has greater span then which algorithm is

better? In analyzing sequential algorithms there is only one measure so it is clear when one

algorithm is asymptotically better than another, but now we have two measures. In general

the work is more important than the span. This is because the work reﬂects the total cost of

the computation (the processor-time product). Therefore typically the goal is to ﬁrst reduce the

work and then reduce the span by designing asymptotically work-efﬁcient algorithms that per-

form no work than the best sequential algorithm for the same problem. However, sometimes

it is worth giving up a little in work to gain a large improvement in span.

August 30, 2017 (DRAFT, PPAP)

剩余419页未读，继续阅读

luanshixia

粉丝: 4
资源: 3

2017年CMU课程：并行与顺序算法设计

The Algorithm Design Manual(2nd Ed.)_Source Code

J. Kleinberg, E. Tardos. Algorithm Design.pdf

Algorithm Design: Parallel and Sequential

Algorithms Sequential and Parallel - A Unified Approach, 2rd Edition

Fuzzy and Neuro-Fuzzy Systems in Medicine

MATLAB Product Design Optimization: Case Studies and Applications Analysis

Application of fmincon in Engineering Design: Optimizing Structural Strength and Weight

MATLAB Optimization Algorithms: Mastery and Practice

Full Analysis of fmincon Solution: Iteration, Convergence, and Termination Criteria Elucidated

Applications of MATLAB Optimization Algorithms in Machine Learning: Case Studies and Practical Guide

最新资源