create a new lable sentiment which equals to 1 if star > 3, otherwise sentiment equals to 0. Then develop prediciton model for sentiment. * split the dataset to use 80% of the data for training and 20% for test. * train a model to make prediciton of the comments in test dataset, please improve the accuracy as high as possible I list some steps as follows. Please feel free to work in your own way. ### * load the data (encoding='gb18030') * make new label sentiment * tokenize the Chinese comments * split the dataset for train and test * train a model to predict sentiment. Naive bayesian model is just one option, if you have a better one, try it. * test the trained model using test dataset, print the accuracy. Some tools may be helpful for you ### * jieba * apply * train_test_split * CountVectorizer * MultinomialNB * pipeline * accuracy_score
时间: 2024-02-20 17:02:00 浏览: 95
Sure, here's how you can perform sentiment analysis of restaurant comments with the steps you mentioned:
```python
import pandas as pd
import jieba
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
# Load the data
df = pd.read_csv('restaurant_comments.csv', encoding='gb18030')
# Create the sentiment label
df['sentiment'] = df['star'].apply(lambda x: 1 if x > 3 else 0)
# Tokenize the Chinese comments
df['comments'] = df['comments'].apply(lambda x: ' '.join(jieba.cut(x)))
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['comments'], df['sentiment'], test_size=0.2)
# Create a pipeline with CountVectorizer and MultinomialNB
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('classifier', MultinomialNB())
])
# Train the model
pipeline.fit(X_train, y_train)
# Test the trained model using test dataset and print the accuracy
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
```
In this code, we first load the data using pandas with encoding='gb18030'. Then we create the sentiment label based on the 'star' column. We tokenize the Chinese comments using jieba, split the dataset into training and test sets using train_test_split from sklearn. We then create a pipeline with CountVectorizer and MultinomialNB. We fit the pipeline on the training data and test the trained model using test dataset. Finally, we print the accuracy score of the model using accuracy_score from sklearn. You can try other models as well and tune the hyperparameters to improve the accuracy.
阅读全文