新闻 文本分类实验 python代码
时间: 2023-10-14 17:11:58 浏览: 109
python机器学习新闻文章文本分类
5星 · 资源好评率100%
以下是使用Python进行文本分类实验的示例代码:
```python
# -*- coding: utf-8 -*-
"""
Created on Mon Aug 23 12:23:47 2021
@author: Your Name
"""
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
# Load data
data = pd.read_csv("news.csv")
# Data cleaning
data = data.dropna()
data = data.reset_index(drop=True)
nltk.download('stopwords')
stop_words = stopwords.words('english')
data['text'] = data['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
data['text'] = data['text'].apply(lambda x: re.sub('[^a-zA-Z0-9\s]', '', x))
# TF-IDF vectorization
tfidf_vect = TfidfVectorizer()
X = tfidf_vect.fit_transform(data['text'])
y = data['category']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
# Predictions
y_pred = nb_classifier.predict(X_test)
# Evaluation
conf_mat = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
print("Confusion Matrix:\n", conf_mat)
print("Accuracy:", accuracy)
```
其中,`news.csv`是包含文本和类别的数据文件,例如:
```
text,category
New York City is a bustling metropolis.,Travel
The latest technology news from Silicon Valley.,Technology
The best recipes for a summer BBQ.,Food
```
代码的输出包括混淆矩阵和准确率。
阅读全文