【Advanced篇】Web Scraper Data Cleaning and Preprocessing Techniques: Data Cleaning and Transformation Using Pandas
发布时间: 2024-09-15 12:38:20 阅读量: 19 订阅数: 30
# Advanced篇: Web Scraping Data Cleaning and Preprocessing Techniques: Using Pandas for Data Cleaning and Transformation
## 2.1 Introduction to Pandas Data Structures and Operations
### 2.1.1 An Overview of DataFrame and Series
**DataFrame:**
- A two-dimensional, tabular data structure similar to an Excel spreadsheet.
- Composed of rows (index) and columns (columns), with each cell containing a value.
- Can be created using `pd.DataFrame()`.
**Series:**
- A one-dimensional, array-like data structure similar to a Python list.
- Comprised of a sequence of values and a sequence of indices.
- Can be created using `pd.Series()`.
### 2.1.2 Importing and Exporting Data
**Importing Data:**
- From CSV files: `pd.read_csv()`
- From Excel files: `pd.read_excel()`
- From JSON files: `pd.read_json()`
**Exporting Data:**
- To CSV files: `df.to_csv()`
- To Excel files: `df.to_excel()`
- To JSON files: `df.to_json()`
# 2. Pandas Data Cleaning Techniques
### 2.1 Pandas Data Structures and Operations
#### 2.1.1 An Overview of DataFrame and Series
The two core data structures in the Pandas library are the DataFrame and Series. A DataFrame is a two-dimensional tabular structure with rows and columns, akin to a table in SQL. A Series is a one-dimensional array, similar to a list in Python.
**DataFrame**
```python
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({
"name": ["John", "Mary", "Bob"],
"age": [20, 25, 30],
"city": ["New York", "London", "Paris"]
})
# Viewing the DataFrame
print(df)
```
**Output:**
```
name age city
0 John 20 New York
1 Mary 25 London
2 Bob 30 Paris
```
**Series**
```python
# Creating a Series
series = pd.Series([20, 25, 30])
# Viewing the Series
print(series)
```
**Output:**
```
***
***
***
dtype: int64
```
#### 2.1.2 Importing and Exporting Data
Pandas offers various methods for importing and exporting data, including:
**Importing Data**
***Importing from CSV files:** `pd.read_csv("file.csv")`
***Importing from Excel files:** `pd.read_excel("file.xlsx")`
***Importing from JSON files:** `pd.read_json("file.json")`
**Exporting Data**
***Exporting to CSV files:** `df.to_csv("file.csv")`
***Exporting to Excel files:** `df.to_excel("file.xlsx")`
***Exporting to JSON files:** `df.to_json("file.json")`
### 2.2 Data Cleaning Methods
#### 2.2.1 Handling Missing Values
Missing values are a common challenge in data cleaning. Pandas provides several methods for dealing with missing values:
***Deleting missing values:** `df.dropna()`
***Filling missing values with a specific value:** `df.fillna(value)`
***Filling missing values with the mean:** `df.fillna(df.mean())`
#### 2.2.2 Handling Duplicate Values
Duplicate values are another issue that needs to be addressed during data cleaning. Pandas offers the following methods:
***Deleting duplicate values:** `df.drop_duplicates()`
***Keeping the first duplicate:** `df.drop_duplicates(keep="first")`
***Keeping the last duplicate:** `df.drop_duplicates(keep="last")`
#### 2.2.3 Data Type Conversion
Sometimes, it is necessary to convert data types from one type to another. Pandas provides the `astype()` method:
```python
# Converting the "age" column to floats
df["age"] = df["age"].astype(float)
```
### 2.3 Data Transformation Methods
#### 2.3.1 Data Merging and Joining
Pandas provides `merge()` and `join()` methods to merge and join DataFrames:
***Merging:** `df1.merge(df2, on="column_name")`
***Joining:** `df1.join(df2, on="column_name")`
#### 2.3.2 Data Grouping and Aggregation
Pandas provides `groupby()` and `agg()` methods for grouping and aggregating data:
```python
# Grouping by the "city" column and counting the number of people in each city
df.groupby("city").agg({"age": "count"})
```
#### 2.3.3 Data Sorting and Filtering
Pandas provides `sort_values()` and `query()` methods for sorting and filtering data:
```python
# Sorting by the "age" column in descending order
df.sort_values("age", ascending=False)
# Filtering out people older than 25
df.query("age > 25")
```
# 3.1 Feature Engineering
Feature engineering is a crucial step in data preprocessing, which helps extract valuable features from raw data, thereby enhancing the performance of machine learning models. Feature engineering mainly includes the following three aspects:
#### 3.1.1 Feature Selection
Feature selection involves choosing features from the raw data that are highly correlated with the target variable to reduce data dimensionality and improve the model'***mon feature selection methods include:
- **Filter methods:** Feature selection based on the statistical information of the features themselves (such as variance, information gain).
- **Wrapper methods:** Integrating the feature selection process with the model training process to choose the features that contribute most to the model's performance.
- **Embedded methods:** Automatically selecting features during the model training process using regularization or other techniques.
#### 3.1.2 Feature Scaling
Feature scaling refers to sc***mon feature scaling methods include:
- **Standardization:** Subtracting the mean and dividing by the standard deviation to distribute feature values
0
0