使用Linux构建Beowulf集群计算指南

需积分: 0 52 浏览量更新于2024-07-31 收藏 6.79MB PDF 举报

"《Beowulf Cluster Computing with Linux, Second Edition》是一本关于利用Linux构建Beowulf集群计算系统的指南。作者是William Gropp、Ewing Lusk和Thomas Sterling。这本书详细介绍了如何从硬件到软件层面搭建一个Beowulf类型的计算机集群，并提供了关于如何组织代码以利用并行计算的见解。" 本书共分为两大部分：启用技术（Enabling Technologies）和并行编程。在启用技术部分，作者首先在第一章"你想使用集群吗"中引出构建集群的动机和应用场景。接着，第二章"节点硬件"探讨了构建集群所需的基础硬件，包括普通PC的选择、组装和优化，以及如何考虑硬件的兼容性和扩展性。第三章"Linux"深入讨论了作为集群操作系统基础的Linux，包括其在集群环境中的安装、配置和管理。第四章"系统区域网络"介绍了用于集群通信的低延迟、高带宽的网络技术。第五章"配置和调优集群网络"则详细阐述了网络的配置、性能优化和故障排除。在并行编程部分，第七章"并行程序编写入门"为初学者提供了基础的并行编程概念和方法。第八章"使用MPI进行并行编程"详细介绍了Message Passing Interface (MPI) 这一广泛使用的并行编程接口，包括其基本原理、API和编程技巧。第九章"MPI编程的高级主题"进一步探讨了MPI的复杂特性和高级功能。第十章"平行虚拟机"介绍了PVM（Parallel Virtual Machine）系统，它是另一种并行计算框架，具有容错和自适应能力。第十一章"使用PVM实现容错和自适应程序"讲述了如何利用PVM构建能够容忍错误和动态适应环境变化的并行程序。最后，第十二章"集群的数值和科学软件"讨论了在集群环境中运行的科学计算应用和工具。本书不仅适合于对构建高性能计算集群感兴趣的读者，也是对并行计算和分布式系统有需求的科研人员、工程师和技术爱好者的宝贵资源。通过本书，读者可以掌握构建和管理Linux集群的基本技能，以及编写高效并行程序的策略和方法。

Chapter 1: So You Want to Use a Cluster

Overview

William Gropp

What is a "Beowulf Cluster" and what is it good for? Simply put, a Beowulf Cluster is a supercomputer that anyone can

build and use. More specifically, a Beowulf Cluster is a parallel computer built from commodity components. This

approach takes advantage of the astounding performance now available in commodity personal computers. By many

measures, including computational speed, size of main memory, available disk space and bandwidth, a single PC of

today is more powerful than the supercomputers of the past. By harnessing the power of tens to thousands of such low-

cost but powerful processing elements, you can create a powerful supercomputer. In fact, the number 5 machine on

the "Top500" list of the world's most powerful supercomputers is a Beowulf Cluster.

A Beowulf cluster is a form of parallel computer, which is nothing more than a computer that uses more than one

processor. There are many different kinds of parallel computer, distinguished by the kinds of processors they use and

the way in which those processors exchange data. A Beowulf cluster takes advantage of two commodity components:

fast CPUs designed primarily for the personal computer market and networks designed to connect personal computers

together (in what is called a local area network or LAN). Because these are commodity components, their cost is

relatively low. As we will see later in this chapter, there are some performance consequences, and Beowulf clusters are

not suitable for all problems. However, for the many problems for which they do work well, Beowulf clusters provide an

effective and low-cost solution for delivering enormous computational power to applications and are now used virtually

everywhere. This raises the following question: If Beowulf clusters are so great, why didn't they appear earlier?

Many early efforts used clusters of smaller machines, typically workstations, as building blocks in creating low-cost

parallel computers. In addition, many software projects developed the basic software for programming parallel

machines. Some of these made their software available for all users, and emphasized portability of the code, making

these tools easily portable to new machines. But the project that truly launched clusters was the Beowulf project at the

NASA Goddard Space Flight center. In 1994, Thomas Sterling, Donald Becker, and others took an early version of the

Linux operating system, developed Ethernet driver software for Linux, and installed PVM (a software package for

programming parallel computers) on 16 100MHz Intel 80486-based PCs. This cluster used dual 10-Mbit Ethernet to

provide improved bandwidth in communications between processors, but was otherwise very simple—and very low

cost.

Why did the Beowulf project succeed? Part of the answer is that it was the right solution at the right time. PCs were

beginning to become competent computational platforms (a 100MHz 80486 has a faster clock than the original Cray 1,

a machine considered one of the most important early supercomputers). The explosion in the size of the PC market

was reducing the cost of the hardware through economies of scale. Equally important, however, was a commitment by

the Beowulf project to deliver a working solution, not just a research testbed. The Beowulf project worked hard to "dot

the i's and cross the t's," addressing many of the real issues standing in the way of widespread adoption of cluster

technology for commodity components. This was a critical contribution; making a cluster solid and reliable often

requires solving new and even harder problems; it isn't just hacking. The contribution of the community to this effort,

through contributions of software and general help to others building clusters, made Beowulf clustering exciting.

Since the early Beowulf clusters, the use of commodity-off-the-shelf (COTS) components for building clusters has

mushroomed. Clusters are found everywhere, from schools to dorm rooms to the largest machine rooms. Large

clusters are an increasing percentage of the Top500 list. You can still build your own cluster by buying individual

components, but you can also buy a preassembled and tested cluster from many vendors, including both large and

well-established computer companies and companies formed just to sell clusters.

This book will give you an understanding of what Beowulfs are, where they can be used (and where they can't), and

how they work. To illustrate the issues, specific operations, such as installation of a software package are described.

However, this book is not a cookbook; software and even hardware change too fast for that to be practical. The best

use of this book is to read it for understanding; to build a cluster, then go out and find the most up-to-date information

file:///I|/a1/MIT.Press.-.Beowulf.Cluster.Computing.with.Linux,.Second.Edition.chm/7017final/LiB0007.html (1 of 2)2005/8/17 上午 11:12:12

1.2 Why Use a Cluster?

Why use a cluster instead of a single computer? There are really two reasons: performance and fault tolerance. The

original reason for the development of Beowulf clusters was to provide cost-effective computing power for scientific

applications, that is, to address the needs of applications that required greater performance than was available from

single (commodity) processors or affordable multiprocessors. An application may desire more computational power for

many reasons, but the following three are the most common:

● Real-time constraints, that is, a requirement that the computation finish within a certain period of time. Weather

forecasting is an example. Another is processing data produced by an experiment; the data must be processed (or

stored) at least as fast as it is produced.

● Throughput. A scientific or engineering simulation may require many computations. A cluster can provide the

resources to process many related simulations. On the other hand, some single simulations require so much

computing power that a single processor would require days or even years to complete the calculation. An

example of using a Linux Beowulf cluster for throughput is Google

[13], which uses over 15,000 commodity PCs

with fault-tolerant software to provide a high-performance Web search service.

● Memory. Some of the most challenging applications require huge amounts of data as part of the simulation. A

cluster provides an effective way to provide even terabytes (10

bytes) of program memory for an application.

Clusters provide the computational power through the use of parallel programming, a technique for coordinating the

use of many processors for a single problem.

Part II (Parallel Programming) discusses this approach in detail. What

clusters are not good for is accelerating calculations that are neither memory intensive nor processing-power intensive

or (in a way that will be made precise below) that require frequent communication between the processors in the

cluster.

Another reason for using clusters is to provide fault tolerance, that is, to ensure that computational power is always

available. Because clusters are assembled from many copies of the same or similar components, the failure of a single

part only reduces the cluster's power. Thus, clusters are particularly good choices for environments that require

guarantees of available processing power, such as Web servers and systems used for data collection.

We note that fault tolerance can be interpreted in several ways. For a Web server or data handling, the cluster can be

considered up as long as enough processors and network capacity are available to meet the demand. A well-designed

cluster can provide a virtual guarantee of availabilty, short of a disaster such as a fire that strikes the whole cluster.

Such a cluster will have virtually 100% uptime. For scientific applications, the interpretation of uptime is often different.

For clusters used for scientific applications, however, particularly ones used to provide adequate memory, uptime is

measured relative to the minimum size of cluster (e.g., number of nodes) that allows the applications to run. In many

cases, all or nearly all of the nodes in the cluster must be available to run these applications.

Of course, many uses of clusters are a blend of these two approaches.

Part III describes tools for sharing a cluster

among users and, in many cases, providing support for both performance-oriented and fault-tolerant computing.

file:///I|/a1/MIT.Press.-.Beowulf.Cluster.Computing.with.Linux,.Second.Edition.chm/7017final/LiB0009.html2005/8/17 上午 11:12:17

1.3 Understanding Application Requirements

In order to know what applications are suitable for cluster computing and what tradeoffs are involved in designing a

cluster, one needs to understand the requirements of applications.

1.3.1 Computational Requirements

The most obvious requirement (at least in scientific and technical applications) is the number of floating-point

operations needed to perform the calculation. For simple calculations, estimating this number is relatively easy; even in

more complex cases, a rough estimate is usually possible. Most communities have a large body of literature on the

floating-point requirements of applications, and these results should be consulted first. Most textbooks on numerical

analysis will give formulas for the number of floating-point operations required for many common operations. For

example, the solution of a system of n linear equations; solved with the most common algorithms, takes 2n

/3 floating-

point operations. Similar formulas hold for many common problems.

You might expect that by comparing the number of floating-point operations with the performance of the processor (in

terms of peak operations per second), you can make a good estimate of the time to perform a computation. For

example, on a 2 GHz processor, capable of 2 × 10

floating-point operations per second (2 GFLOPS), a computation

that required 1 billion floating-point operations would take only half a second. However, this estimate ignores the large

role that the performance of the memory system plays in the performance of the overall system. In many cases, the

rate at which data can be delivered to the processor is a better measure of the achievable performance of an

application (see [

45, 60] for examples).

Thus, when considering the computational requirements, it is imperative to know what the expected achievable

performance will be. In some cases this may be estimated by using standard benchmarks such as LINPACK

[34] and

STREAM

[71], but it is often best to run a representative sample of the application (or application mix) on a candidate

processor. After all, one of the advantages of cluster computing is that the individual components, such as the

processor nodes, are relatively inexpensive.

1.3.2 Memory

The memory needs of an application strongly affect both the performance of the application and the cost of the cluster.

As described in

Section 2.1, the memory on a compute node is divided into several major types. Main memory holds

the entire problem and should be chosen to be large enough to contain all of the data needed by an application

(distributed, of course, across all the nodes in the cluster). Cache memory is smaller but faster memory that is used to

improve the performance of applications. Some applications will benefit more from cache memory than others; in some

cases, application performance can be very sensitive to the size of cache memory. Virtual memory is memory that

appears to be available to the application but is actually mapped so that some of it can be stored on disk; this greatly

enlarges the available memory for an application for low monetary cost (disk space is cheap). Because disks are

electromechanical devices, access to memory that is stored on disk is very slow. Hence, some high-performance

clusters do not use virtual memory.

1.3.3 I/O

Results of computations must be placed into nonvolatile storage, such as a disk file. Parallel computing makes it

possible to perform computations very quickly, leading to commensurate demands on the I/O system. Other

applications, such as Web servers or data analysis clusters, need to serve up data previously stored on a file system.

Section 5.3.4 describes the use of the network file system (NFS) to allow any node in a cluster to access any file.

However, NFS provides neither high performance nor correct semantics for concurrent access to the same file (see

Section 19.3.2 for details). Fortunately, a number of high-performance parallel file systems exist for Linux; the most

file:///I|/a1/MIT.Press.-.Beowulf.Cluster.Computing.with.Linux,.Second.Edition.chm/7017final/LiB0010.html (1 of 5)2005/8/17 上午 11:12:21

剩余527页未读，继续阅读

lcw260

粉丝: 0
资源: 5

使用Linux构建Beowulf集群计算指南

Beowulf Cluster Computing with Linux, Second Edition

Beginning.Linux.Programming.Second.Edition

Linux第二版（最新版）

Beowulf Cluster Computing With Linux 2nd Edition

The MIT Press Beowulf Cluster Computing With Linux 2nd ed 2003

Building.a.BeagleBone.Black.Super.Cluster.1783989440

centos-beowulf-cluster-setup:关于如何构建 beowulf 集群的脚本和描述

Automated Beowulf Cluster ABC GNU/Linux-开源

高效Beowulf Cluster通信技术

Virtual Beowulf Cluster-开源

最新资源