英文电影评论情感分类(Kaggle竞赛)代码
时间: 2023-07-05 11:29:38 浏览: 155
以下是英文电影评论情感分类的Kaggle竞赛的代码:
1. 数据预处理:
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# 读取数据
df = pd.read_csv('train.csv')
# 分割训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)
# 处理数据
import re
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words('english')
def preprocess(text):
text = text.lower() # 小写化
text = re.sub('\[.*?\]', '', text) # 移除方括号及其内容
text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # 移除标点符号
text = re.sub('\w*\d\w*', '', text) # 移除包含数字的单词
text = re.sub('[‘’“”…]', '', text) # 移除不规则的单引号、双引号和省略号
tokens = re.split('\W+', text) # 分词
tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words] # 词形还原和移除停用词
return ' '.join(tokens)
X_train = X_train.apply(lambda x: preprocess(x))
X_test = X_test.apply(lambda x: preprocess(x))
```
2. 特征提取:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
# 特征提取
tfidf_vect = TfidfVectorizer(max_features=5000)
tfidf_vect.fit(X_train)
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf = tfidf_vect.transform(X_test)
```
3. 模型训练和评估:
```python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# 模型训练
lr = LogisticRegression()
lr.fit(X_train_tfidf, y_train)
# 模型评估
y_pred = lr.predict(X_test_tfidf)
print('Test accuracy is {}'.format(accuracy_score(y_test, y_pred)))
```
在这个例子中,我们使用了朴素贝叶斯作为分类模型,使用TF-IDF作为特征提取方法。我们可以使用其他的模型和特征提取方法,比如使用词袋模型和使用神经网络等。
阅读全文