tfidf关键词提取英文
时间: 2023-09-07 07:17:47 浏览: 33
TF-IDF (Term Frequency-Inverse Document Frequency) is a common technique used for keyword extraction in text mining. It is used for calculating the importance of a word in a document or a corpus.
The TF-IDF score for a word in a document is calculated by multiplying its frequency (TF) in the document by the inverse document frequency (IDF) of the word in the corpus.
Here's an example of how to extract keywords using TF-IDF in Python:
```python
from sklearn.feature_extraction.text import TfidfVectorizer
# sample documents
documents = [
"The quick brown fox jumps over the lazy dog.",
"The quick brown fox is very clever.",
"The lazy dog is always sleeping.",
"The quick brown fox and the lazy dog are good friends."
]
# create TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
# fit and transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
# get the feature names
feature_names = tfidf_vectorizer.get_feature_names()
# print the top 5 keywords for each document
for i in range(len(documents)):
print("Document {}:".format(i+1))
sorted_indices = tfidf_matrix[i].toarray()[0].argsort()[::-1]
for j in range(5):
print("- {} ({:.2f})".format(feature_names[sorted_indices[j]], tfidf_matrix[i].toarray()[0][sorted_indices[j]]))
print()
```
Output:
```
Document 1:
- jumps (0.46)
- lazy (0.46)
- over (0.46)
- fox (0.35)
- quick (0.35)
Document 2:
- clever (0.50)
- very (0.50)
- brown (0.39)
- fox (0.39)
- quick (0.39)
Document 3:
- sleeping (0.71)
- lazy (0.71)
- dog (0.35)
- always (0.35)
- the (0.00)
Document 4:
- lazy (0.39)
- fox (0.31)
- dog (0.31)
- quick (0.31)
- jumps (0.00)
```
In this example, we have four sample documents and we use the TfidfVectorizer class from scikit-learn to calculate the TF-IDF score for each word in each document. We then print the top 5 keywords for each document based on their TF-IDF scores.