英文电影评论情感分类(Kaggle竞赛)代码答案
时间: 2023-07-05 10:29:57 浏览: 118
以下是一个可能的解决方案,使用了Python的scikit-learn和NLTK库:
```python
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# 读取数据集
df = pd.read_csv('train.csv')
# 分离特征和标签
X = df['text']
y = df['sentiment']
# 将标签编码为数字
y = y.replace('negative', 0)
y = y.replace('neutral', 1)
y = y.replace('positive', 2)
# 创建管道
pipeline = Pipeline([
('vect', CountVectorizer(stop_words=stopwords.words('english'))),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())
])
# 将数据集分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 训练模型
pipeline.fit(X_train, y_train)
# 预测测试集
y_pred = pipeline.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
```
这个模型使用了一个管道,其中包含一个计数向量化器、一个TF-IDF转换器和一个逻辑回归分类器。在训练模型之前,我们还使用了NLTK库中的停用词和词形还原器。
阅读全文