用遗传算法实现特征选择 python
时间: 2023-07-12 14:15:56 浏览: 208
特征选择是机器学习中非常重要的一步,可以有效地提高模型的准确性和泛化能力。遗传算法是一种基于自然选择和遗传机制的优化算法,可以用于特征选择问题。
下面是一个使用遗传算法实现特征选择的 Python 示例代码:
``` python
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 加载数据集
data = load_breast_cancer()
X = data.data
y = data.target
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 定义适应度函数
def fitness_function(individual, X_train, X_test, y_train, y_test):
# 将个体转换为特征掩码
feature_mask = individual.astype(bool)
# 选择相关的特征
X_train_selected = X_train[:, feature_mask]
X_test_selected = X_test[:, feature_mask]
# 训练随机森林模型
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_selected, y_train)
# 在测试集上计算准确率
y_pred = model.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)
return accuracy,
# 定义遗传算法参数
POPULATION_SIZE = 100
P_CROSSOVER = 0.9
P_MUTATION = 0.1
MAX_GENERATIONS = 50
HALL_OF_FAME_SIZE = 10
# 定义遗传算法工具箱
from deap import base
from deap import creator
from deap import tools
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", np.ndarray, fitness=creator.FitnessMax)
toolbox = base.Toolbox()
toolbox.register("attr_bool", np.random.randint, 0, 2)
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, n=X.shape[1])
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("evaluate", fitness_function, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutFlipBit, indpb=1.0/X.shape[1])
toolbox.register("select", tools.selTournament, tournsize=3)
# 运行遗传算法
population = toolbox.population(n=POPULATION_SIZE)
hof = tools.HallOfFame(HALL_OF_FAME_SIZE)
stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("avg", np.mean)
stats.register("min", np.min)
stats.register("max", np.max)
best = None
for gen in range(MAX_GENERATIONS):
offspring = algorithms.varAnd(population, toolbox, P_CROSSOVER, P_MUTATION)
fits = toolbox.map(toolbox.evaluate, offspring)
for fit, ind in zip(fits, offspring):
ind.fitness.values = fit
population = toolbox.select(offspring, k=len(population))
hof.update(population)
record = stats.compile(population)
print("Generation {}: {}".format(gen, record))
if best is None or best.fitness < hof[0].fitness:
best = hof[0]
if hof[0].fitness.values[0] >= 0.99:
break
# 输出结果
feature_mask = best.astype(bool)
selected_features = X_train[:, feature_mask]
print("Selected features:", selected_features.shape[1])
```
上述代码使用了 `deap` 库来实现遗传算法。首先,我们定义了一个适应度函数 `fitness_function`,它将一个个体(即特征掩码)转换为相关的特征,然后训练随机森林模型并在测试集上计算准确率。
然后,我们定义了遗传算法的参数,并注册了遗传算法工具箱中的各种操作函数。接着,我们初始化种群,使用遗传算法运行多代,并记录每一代的结果。
最后,我们输出了最终选中的特征个数。
阅读全文