python文本分类特征选择
时间: 2023-11-20 13:58:52 浏览: 44
在Python中进行文本分类特征选择,可以使用朴素贝叶斯分类器。具体步骤如下:
1.准备数据集,将数据集分为训练集和测试集。
2.对文本进行预处理,包括分词、去除停用词、词干提取等。
3.将文本转换为特征向量,可以使用词袋模型或TF-IDF模型。
4.使用朴素贝叶斯分类器进行训练,可以使用nltk库中的NaiveBayesClassifier类。
5.评估分类器的性能,可以使用nltk库中的accuracy()函数计算准确率。
6.观察分类特征的贡献,可以使用show_most_informative_features()函数查看分类器中最具有区分性的特征。
下面是一个示例代码:
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
# 准备数据集
documents = [("This is a sample sentence.", "positive"), ("This is another example sentence.", "positive"),
("This sentence is not good.", "negative"),
("I don't like this product.", "negative")]
# 分词、去除停用词、词干提取
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
all_words = []
for doc in documents:
words = [stemmer.stem(word.lower()) for word in word_tokenize(doc[0]) if word.lower() not in stop_words]
all_words.extend(words)
all_words = nltk.FreqDist(all_words)
# 将文本转换为特征向量
word_features = list(all_words.keys())[:100]
def document_features(document):
document_words = set([stemmer.stem(word.lower()) for word in word_tokenize(document) if word.lower() not in stop_words])
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[:3], featuresets[3:]
vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform([d[0] for d in documents[:3]])
test_vectors = vectorizer.transform([d[0] for d in documents[3:]])
# 使用朴素贝叶斯分类器进行训练
classifier = nltk.NaiveBayesClassifier.train(train_set)
# 评估分类器的性能
print(nltk.classify.accuracy(classifier, test_set))
# 观察分类特征的贡献
classifier.show_most_informative_features(5)
```