文本挖掘python代码
时间: 2024-03-21 13:36:34 浏览: 74
文本挖掘是指从大量的文本数据中提取有用的信息和知识的过程。在Python中,有很多强大的库可以用于文本挖掘任务,如NLTK、Scikit-learn和Gensim等。下面是一个简单的文本挖掘代码示例:
1. 安装所需库:
```python
pip install nltk
pip install scikit-learn
pip install gensim
```
2. 导入所需库:
```python
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
```
3. 分词:
```python
# 使用NLTK库进行分词
nltk.download('punkt')
text = "This is an example sentence."
tokens = nltk.word_tokenize(text)
print(tokens)
```
4. 文本向量化:
```python
# 使用CountVectorizer进行词频向量化
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(tokens)
print(X.toarray())
# 使用TfidfVectorizer进行TF-IDF向量化
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(tokens)
print(X.toarray())
```
5. 文本相似度计算:
```python
# 使用Word2Vec计算文本相似度
sentences = [['I', 'love', 'coding'], ['I', 'love', 'programming']]
model = Word2Vec(sentences, min_count=1)
similarity = model.wv.similarity('coding', 'programming')
print(similarity)
```
阅读全文