[Practical Exercise] Data Storage and Analysis: Storing Scraped Data to Hadoop HDFS and Processing Big Data
发布时间: 2024-09-15 13:07:45 阅读量: 23 订阅数: 37
Big Data Made Easy - A Working Guide To The Complete Hadoop Toolset
# 2.1 HDFS Architecture and Principles
The Hadoop Distributed File System (HDFS) is a distributed file system designed for storing large data sets within the Hadoop ecosystem. It employs a master-slave architecture, comprised of a single NameNode and multiple DataNodes.
The NameNode acts as the metadata server for HDFS, managing the files and directories within the file system. It maintains all metadata information regarding files, including file names, file sizes, file block lists, and the locations of these blocks on the DataNodes.
DataNodes serve as the data storage servers for HDFS, responsible for storing the actual data blocks. Each DataNode holds a portion of the data blocks in the file system and periodically reports its storage status to the NameNode. When clients need to read or write files, they first request file metadata from the NameNode, then directly interact with the DataNode that stores the specific file blocks for data exchange.
# 2. Hadoop HDFS Data Storage Practices
### 2.1 HDFS Architecture and Principles
The Hadoop Distributed File System (HDFS) is a distributed file system specifically designed for storing and processing large data sets. It utilizes a master-slave architecture, with a single NameNode managing the file system metadata and multiple DataNodes responsible for storing the actual data.
#### HDFS Architecture
The HDFS architecture includes the following components:
***NameNode:** Manages the file system metadata, including the names, locations, and permissions of files and directories.
***DataNode:** Stores the actual data blocks and responds to requests from the NameNode and clients.
***Client:** Interacts with the NameNode to access the file system and with DataNodes to read and write data.
#### HDFS Principles
HDFS uses the following principles to implement distributed storage and processing:
***Block Storage:** Files are divided into fixed-size blocks (typically 128MB) and stored on DataNodes.
***Data Redundancy:** Each block is replicated on multiple DataNodes to enhance data reliability.
***Fault Tolerance:** If a DataNode fails, the NameNode automatically replicates the data blocks to other DataNodes.
***Load Balancing:** The NameNode is responsible for evenly distributing data blocks across DataNodes to optimize performance.
### 2.2 HDFS Data Writing and Reading
#### Data Writing
Clients request the NameNode to write files, which returns the locations of the file blocks. Clients write data blocks into DataNodes and report the completion of writing to the NameNode. The NameNode updates the metadata to reflect the new file locations.
#### Data Reading
Clients request the NameNode to read files, which returns the locations of the file blocks. Clients read data blocks from DataNodes and assemble them into complete files.
### 2.3 HDFS Data Management and Maintenance
#### Data Management
HDFS provides the following data management functions:
***File and Directory Management:** Creating, deleting, renaming, and moving files and directories.
***Access Control:** Setting file and directory access permissions.
***Quota Management:** Limiting the amount of data a user or group can store.
#### Data Maintenance
HDFS provides the following data maintenance functions:
***Data Block Reporting:** DataNodes periodically report their stored data blocks to the NameNode.
***Block Replication:** The NameNode monitors the number of block replicas and replicates blocks as needed to maintain redundancy.
***Block Reclamation:** When data blocks are no longer needed, the NameNode deletes them from DataNodes.
#### Code Example
The following code example demonstrates how to write to and read from HDFS using the HDFS API:
```java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
public class HdfsReadWrite {
public static void main(String[] args) throws Exception {
// Configuration
Configuration conf = new Configuration();
// Create a file system
FileSystem fs = FileSystem.get(conf);
// Write data
Path path = new Path("/user/hadoop/test.txt");
FSDataOutputStream out = fs.create(path);
out.writeUTF("Hello, HDFS!");
out.close();
// Read data
FSDataInputStream in = fs.open(path);
String data = in.readUTF();
in.close();
// Output data
System.out.println(data);
}
}
```
#### Logical Analysis
This code example demonstrates how to use the HDFS API to write to and read from the file system.
***Writing Data:**
* Create a Configuration object.
* Create a FileSystem object.
* Create a Path object specifying the file path to write to.
* Create an FSDataOutputStream object to write data.
* Use the writeUTF() method to write data.
* Close the FSDataOutputStream object.
***Reading Data:**
* Use the FileSystem object to open the file.
* Create an FSDataInputStream object to read data.
* Use the readUTF() method to read data.
* Close the FSDataInputStream object.
# 3.1 Hadoop MapReduce Programming Model
**Introduction**
Hadoop MapReduce is a distributed programming model for processing large data sets. It breaks down data processing tasks into two stages: Map and Reduce. The Map stage maps data to intermediate key-value pairs, while the Reduce stage aggregates intermediate values with the same key.
**MapReduce Workflow**
The MapReduce workflow is as follows:
1. **Input Da
0
0