Exercise#1: Decision Tree and support vector machine. Goal Implement the decision tree algorithm and the support vector machine, respectively and compare their performances. Dataset (in folder ‘data1’) We will use the Adult dataset whose description is available here. The training set (adult.data) and evaluation set (adult.test) can be downloaded here. The prediction task is to determine whether a person makes over 50K a year. Preprocessing Remove all the records containing '?' (i.e., missing values). Also, remove the attribute "native-country". you can not use pandas, DecisionTreeClassifier and SVC
时间: 2023-06-16 21:08:07 浏览: 160
Okay, let's get started with implementing the decision tree and support vector machine algorithms for the given dataset.
First, we need to preprocess the dataset by removing all the records containing '?' and the attribute "native-country". Here's the code for that:
```python
def preprocess(filename):
data = []
with open(filename, 'r') as f:
for line in f:
if '?' in line:
continue
line = line.strip().split(',')
del line[14]
data.append(line)
return data
train_data = preprocess('data1/adult.data')
test_data = preprocess('data1/adult.test')
```
Now that we have preprocessed the dataset, we can move on to implementing the decision tree algorithm. Here's the code for that:
```python
import math
def entropy(data):
counts = {}
for row in data:
label = row[-1]
if label not in counts:
counts[label] = 0
counts[label] += 1
entropy = 0
for label in counts:
p = counts[label] / len(data)
entropy -= p * math.log2(p)
return entropy
def split_data(data, attribute):
splits = {}
for row in data:
value = row[attribute]
if value not in splits:
splits[value] = []
splits[value].append(row)
return splits
def choose_attribute(data):
best_gain = 0
best_attribute = None
for attribute in range(len(data[0]) - 1):
splits = split_data(data, attribute)
entropy_sum = 0
for value in splits:
p = len(splits[value]) / len(data)
entropy_sum += p * entropy(splits[value])
gain = entropy(data) - entropy_sum
if gain > best_gain:
best_gain = gain
best_attribute = attribute
return best_attribute
def majority_label(data):
counts = {}
for row in data:
label = row[-1]
if label not in counts:
counts[label] = 0
counts[label] += 1
majority_label = None
majority_count = 0
for label in counts:
if counts[label] > majority_count:
majority_label = label
majority_count = counts[label]
return majority_label
def decision_tree(data):
if len(data) == 0:
return None
if len(set(row[-1] for row in data)) == 1:
return data[0][-1]
attribute = choose_attribute(data)
if attribute is None:
return majority_label(data)
tree = {attribute: {}}
splits = split_data(data, attribute)
for value in splits:
tree[attribute][value] = decision_tree(splits[value])
return tree
```
Now that we have implemented the decision tree algorithm, we can move on to implementing the support vector machine algorithm. Here's the code for that:
```python
import random
def dot_product(x, y):
return sum(xi * yi for xi, yi in zip(x, y))
def svm_train(data, epochs, learning_rate):
w = [0] * len(data[0][:-1])
b = 0
for epoch in range(epochs):
random.shuffle(data)
for row in data:
x = row[:-1]
y = row[-1]
if y * (dot_product(w, x) + b) <= 1:
w = [wi + learning_rate * (y * xi - 2 * 0.1 * wi) for wi, xi in zip(w, x)]
b += learning_rate * y
else:
w = [(1 - 2 * 0.1 * learning_rate) * wi for wi in w]
return w, b
def svm_predict(data, w, b):
predictions = []
for row in data:
x = row[:-1]
y = row[-1]
if dot_product(w, x) + b > 0:
predictions.append(1)
else:
predictions.append(-1)
return predictions
```
Finally, we can compare the performances of the decision tree and support vector machine algorithms on the given dataset. Here's the code for that:
```python
train_data = [(list(map(float, row[:-1])), 1 if row[-1] == '>50K' else -1) for row in train_data]
test_data = [(list(map(float, row[:-1])), 1 if row[-1] == '>50K.' else -1) for row in test_data]
tree = decision_tree(train_data)
tree_predictions = [evaluate(tree, row[:-1]) for row in test_data]
tree_accuracy = sum(1 for prediction, actual in zip(tree_predictions, test_data) if prediction == actual[-1]) / len(test_data)
svm_w, svm_b = svm_train(train_data, epochs=100, learning_rate=0.01)
svm_predictions = svm_predict(test_data, svm_w, svm_b)
svm_accuracy = sum(1 for prediction, actual in zip(svm_predictions, test_data) if prediction == actual[-1]) / len(test_data)
print('Decision Tree Accuracy:', tree_accuracy)
print('Support Vector Machine Accuracy:', svm_accuracy)
```
This should give us the accuracy of the decision tree and support vector machine algorithms on the given dataset.
阅读全文