逐句分析这个文档之中的代码
时间: 2024-12-20 16:17:52 浏览: 5
darknet:Darknet(AB版)框架原始码解析:详细的中文注释(逐句)和原理分析!
逐句分析《数据处理.txt》中的代码:
### 导入必要的库
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import signal
from scipy import stats
from pathlib import Path
from functools import reduce
from collections import OrderedDict
import re
from toolz.itertoolz import partial
```
- **导入常用的数据处理和科学计算库**,如 `pandas`、`numpy`、`matplotlib`、`seaborn`、`scipy` 等。
- `pathlib` 用于处理文件路径。
- `functools.reduce` 用于累积操作。
- `collections.OrderedDict` 用于保持字典顺序。
- `re` 用于正则表达式操作。
- `toolz.itertoolz.partial` 用于部分应用函数。
### 设置绘图样式
```python
sns.set(style='ticks')
rcParams['figure.figsize'] = (8, 6)
sns.set_palette("Paired")
```
- **设置 Seaborn 的绘图样式**,包括背景、风格、字体等。
- **设置 Matplotlib 图形的默认尺寸**。
- **设置 Seaborn 的配色方案**。
### 导入机器学习相关的库
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (
mean_squared_error,
mean_absolute_error,
mean_absolute_percentage_error,
median_absolute_error,
r2_score,
explained_variance_score
)
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_validate, RepeatedKFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
```
- **导入 Scikit-Learn 的各种模块**,包括模型选择、集成学习、评估指标、基类、管道、交叉验证、超参数搜索和预处理工具。
### 注释:Scikit-Learn 文档参考
```python
"""
See https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
and https://github.com/scikit-learn/blob/main/sklearn/ensemble/_stacking.py
for example of scikit-learn style of documentation.
Interesting to see the option "hide/show prompts and output" in
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
"""
```
- **提供 Scikit-Learn 文档的链接**,特别是关于 `StackingClassifier` 的详细说明。
- 提到文档中的隐藏/显示提示和输出选项。
### 注释:信号处理和特征提取的关键函数及待办事项
```python
"""
For hyper-parameter search, some candidates are:
- `method` in `aggregate_spectra()` (e.g., 'mean')
- `smoother` in `convolve_spectrum()` (e.g., signal.windows.gaussian(51, std=7))
- `get_peaks()` has `base_level` and `max_no_peaks`
Key functions are (see _the_whole_pp_pipeline_example()):
- read_spectra_dataset()
- get_freq_bands_cut_points()
- extract_features_from_spectrum()
TO DO:
- [x] make `extract_features_from_spectrum` a key method that generalises and possible uses get_freq_vel_per_band
- [x] could have switches for groups of features to extract
- [ ] I'll initially have separate functions for extracting the groups of features from a spectrum
- [x] Hopefully, all that the Pumpflow Feature Extraction Transformer does with `.transform` is to apply `extract_features_from_spectrum` to each row in `X`
- [x] For feature engineering, look also at shape of distribution in each band and extract moments; computing the integral of the curve (whole and within each band)
- [ ] I would love to be able to label the frequency bands in the plot (tiny font, no-frills implementation would do)
- [ ] More flexibility in hypp search
"""
```
- **列出超参数搜索的候选者**,包括 `aggregate_spectra`、`convolve_spectrum` 和 `get_peaks` 函数的参数。
- **介绍关键函数**,包括读取频谱数据集、获取频率带切割点和从频谱中提取特征。
- **列出待办事项**,包括改进 `extract_features_from_spectrum` 函数、添加特征提取开关、分离特征提取函数、计算分布形状和曲线积分等。
### 定义读取速度谱数据的函数
```python
def read_vel_spectrum(p):
"""
Returns a Series for the velocity spectrum data specified by `p`.
p is a path-like object (here, a PosixPath relative to the current directory is the default one).
Example:
local_base_dir = Path('../shared-dropbox/Test Data/')
p = local_base_dir / 'Oil/Oil Run 1 - 0-25m3 - 17.05.22/Accelerometer Data - 17.05.22/10.5 m3hr/VXP Machine Spectrum -l-600 rpm - Vel/Spectrum Velocity 1.csv'
df = read_vel_spectrum(p)
>>> df.head()
freq
0.00 0.007059
0.25 0.018643
0.50 0.007059
0.75 0.003258
1.00 0.001267
Name: vel, dtype: float64
"""
df = pd.read_csv(p, skiprows=6, index_col=False)
df.columns = ['freq', 'vel']
return df.set_index('freq').squeeze()
```
- **定义 `read_vel_spectrum` 函数**,读取指定路径的 CSV 文件,返回一个包含频率和速度的 Series 对象。
- **跳过文件开头的 6 行**,并将列名设置为 `freq` 和 `vel`。
- **将 `freq` 列设置为索引**,并返回一个 Series 对象。
### 定义提取流量率的函数
```python
def extract_flow_rate(p):
"""
p is a path-like object (here, a PosixPath relative to the current directory is the default one).
Returns a float (converted from the substring (e.g., '10.5'))
Example:
p = local_base_dir / 'Oil/Oil Run 1 - 0-25m3 - 17.05.22/Accelerometer Data - 17.05.22/10.5 m3hr/VXP Machine Spectrum -l-600 rpm - Vel/Spectrum Velocity 1.csv'
>>> extract_flow_rate(p)
10.5
"""
return float(re.findall(r'([0-9\.]+?) m3hr', str(p))[0])
```
- **定义 `extract_flow_rate` 函数**,从路径中提取流量率。
- **使用正则表达式** `r'([0-9\.]+?) m3hr'` 匹配流量率的字符串,并转换为浮点数。
### 定义读取所有速度谱数据的函数
```python
def read_all_vel_spectra(p):
"""
p is where all flow rates subdirectories are placed (see preamble)
(e.g., `../shared-dropbox/Test Data/Oil/Oil Run 1 - 0-25m3/Accelerometer Data - 17.05.22/`)
returns -> dict(target: str, df: DataFrame)
Example:
local_base_dir = Path('../shared-dropbox/Test Data/')
local_exp_base_dir = local_base_dir / 'Oil/Oil Run 1 - 0-25m3 - 17.05.22/Accelerometer Data - 17.05.22'
dfs = read_all_vel_spectra(local_exp_base_dir)
>>> dfs[5.0].head()
freq
0.00 0.006878
0.25 0.019187
0.50 0.007602
0.75 0.002896
1.00 0.001810
Name: vel, dtype: float64
"""
paths_all_spectrum_vel_files = list(p.glob('**/*Spectrum*Vel*.csv'))
dfs = OrderedDict([(extract_flow_rate(p), read_vel_spectrum(p)) for p in paths_all_spectrum_vel_files])
return dfs
```
- **定义 `read_all_vel_spectra` 函数**,读取指定目录下的所有速度谱文件,返回一个有序字典,键为流量率,值为对应的 DataFrame。
- **使用 `glob` 方法** 找到所有符合条件的文件路径。
- **遍历每个文件路径**,提取流量率并读取速度谱数据。
### 定义合并频谱数据的函数
```python
def combine_spectra(dfs):
"""
concat_spectra has been deprecated in favour `combine_spectra()` for flow rate samples as rows (easier to sample for machine learning purposes).
`dfs` is an output from read_all_vel_spectra()
returns a DataFrame with the combined spectra.
Makes the assumption that they share the exact same structure; data is merged based on Series index.
Example:
local_base_dir = Path('../shared-dropbox/Test Data/')
local_exp_base_dir = local_base_dir / 'Oil/Oil Run 1 - 0-25m3 - 17.05.22/Accelerometer Data - 17.05.22'
dfs = read_all_vel_spectra(local_exp_base_dir)
cmb_spectra = combine_spectra(dfs)
>>> cmb_spectra.iloc[:5, :5]
freq 0.00 0.25 0.50 0.75 1.00
0.0 0.007059 0.019368 0.007602 0.003439 0.002172
0.5 0.006697 0.019730 0.009050 0.005611 0.006335
1.0 0.006878 0.019549 0.007964 0.003258 0.001810
1.5 0.007240 0.019368 0.007421 0.002896 0.001629
2.0 0.005792 0.018462 0.007421 0.002896 0.000543
"""
cmb_spectra_w = pd.concat(dfs.values(), axis='columns')
cmb_spectra_w.columns = dfs.keys()
cmb_spectra_w = cmb_spectra_w.reindex(columns=cmb_spectra_w.columns.sort_values())
cmb_spectra_w.index.name = 'freq'
cmb_spectra_w.columns.name = 'flow_rate'
cmb_spectra = cmb_spectra_w.T
return cmb_spectra
```
- **定义 `combine_spectra` 函数**,将多个频谱数据合并成一个 DataFrame。
- **假设所有频谱具有相同的结构**,基于索引进行合并。
- **按流量率排序** 并转置 DataFrame,使流量率为行索引,频率为列索引。
### 定义读取频谱数据集的函数
```python
def read_spectra_dataset(p):
"""
From `p`, the path-like object specifying the base directory for the recorded experiments, returns a flow_rate-freq velocity DataFrame.
Example:
local_base_dir = Path('../shared-dropbox/Test Data/')
p = local_base_dir / 'Oil/Oil Run 1 - 0-25m3 - 17.05.22/Accelerometer Data - 17.05.22'
df = read_spectra_dataset(p)
df.iloc[:3, :3]
"""
dfs = read_all_vel_spectra(p)
return combine_spectra(dfs)
```
- **定义 `read_spectra_dataset` 函数**,读取指定目录下的所有频谱数据并合并成一个 DataFrame。
### 定义将合并后的频谱转换为长格式的函数
```python
def melt_combined_spectra(df):
"""
Working with a long format can be sometimes more convenient than a tabulated one.
`combine_spectra` will produce something typically in the shape (n, m), where `n` is number of flow rates experimented with and `m` is the number of frequencies in the spectrum.
That is, a flow_rate x frequency matrix with velocities as values.
Example:
>>> melt_combined_spectra(cmb_spectra.iloc[:3,:3])
freq vel flow_rate
0.00 0.007059 0.0
0.00 0.006697 0.5
0.00 0.006878 1.0
0.25 0.019368 0.0
0.25 0.019730 0.5
0.25 0.019549 1.0
0.50 0.007602 0.0
0.50 0.009050 0.5
0.50 0.007964 1.0
"""
return (df
.rename_axis('index', axis=0)
.reset_index()
.rename(columns={'index': 'flow_rate'})
.melt(id_vars='flow_rate')
.rename(columns={'value': 'vel'})
.set_index('flow_rate')
)
```
- **定义 `melt_combined_spectra` 函数**,将合并后的频谱数据转换为长格式,便于某些操作。
### 定义聚合频谱数据的函数
```python
def aggregate_spectra(cmb_spectra, method='mean'):
"""
Aggregate spectrum (for all flow rates) by frequency.
cmb_spectra: output from combine_spectra()
method: anything that group-by's `agg` can accept as `func`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html
Example:
>>> cmb_spectra.iloc[:3, :3]
freq 0.00 0.25 0.50
0.0 0.007059 0.019368 0.007602
0.5 0.006697 0.019730 0.009050
1.0 0.006878 0.019549 0.007964
>>> aggregate_spectra(cmb_spectra.iloc[:3, :3])
vel
freq
0.00 0.006878
0.25 0.019549
0.50 0.008206
"""
cmb_spectra_melt = melt_combined_spectra(cmb_spectra)
agg_spectrum = (cmb_spectra_melt
.reset_index()
.groupby('freq')
.agg({'vel': method})
.squeeze()
)
return agg_spectrum
```
- **定义 `aggregate_spectra` 函数**,按频率聚合频谱数据。
- **支持多种聚合方法**,如均值、求和等。
### 定义绘制频谱图的函数
```python
def plot_spectrum(spectrum, ax=None, style_kws=None, xlabel='Frequency (Hz)', ylabel='Power (mm/s)'):
"""
A convenience method for plotting a spectrum.
The latter is expected to be a Series with frequency as index and velocity as value.
TO DO:
- [ ] add style_kws for the signal's line
Example:
fig, axs = plt.subplots(2, 2, constrained_layout=True)
titles = [ 'avg', 'sum', 'max', 'top_decile']
my_plot_funcs = [
partial(plot_spectrum, aggregate_spectra(cmb_spectra)),
partial(plot_spectrum, aggregate_spectra(cmb_spectra, method='sum')),
partial(plot_spectrum, aggregate_spectra(cmb_spectra, method='max')),
partial(plot_spectrum, aggregate_spectra(cmb_spectra, method=partial(np.quantile, q=0.9)))
]
for ax, func, title in zip(axs.ravel(), my_plot_funcs, titles):
func(ax=ax)
ax.set_title(title)
"""
if ax is None:
_, ax = plt.subplots()
style = dict(color='C1')
if isinstance(style_kws, dict):
style = { **style, **style_kws }
ax.plot(spectrum.index, spectrum, **style)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
return ax
```
- **定义 `plot_spectrum` 函数**,绘制频谱图。
- **支持自定义绘图样式** 和轴标签。
### 定义默认的汉宁窗和平滑器
```python
DEFAULT_WINDOW_SIZE = 50
DEFAULT_STD = 7
def get_default_hann_smoother():
return signal.windows.hann(DEFAULT_WINDOW_SIZE * 2 + 1)
def get_default_gaussian_smoother():
return signal.windows.gaussian(DEFAULT_WINDOW_SIZE, DEFAULT_STD)
```
- **定义默认的汉宁窗和平滑器**,用于频谱平滑。
### 定义卷积频谱的函数
```python
def convolve_spectrum(spectrum,
阅读全文