java如何实现tf-idf
时间: 2024-01-17 12:03:35 浏览: 71
要在Java中实现TF-IDF(Term Frequency-Inverse Document Frequency),可以按照以下步骤进行:
1. 计算每个文档中每个词的词频(TF):对于每个文档,将文档中每个词出现的次数除以文档中总词数。可以使用HashMap来存储每个词的计数。
```java
Map<String, Integer> wordCount = new HashMap<String, Integer>();
int totalWords = 0;
// 遍历文档,统计词频
for (String word : document) {
wordCount.put(word, wordCount.getOrDefault(word, 0) + 1);
totalWords++;
}
// 计算每个词的词频
Map<String, Double> tfMap = new HashMap<String, Double>();
for (Map.Entry<String, Integer> entry : wordCount.entrySet()) {
String word = entry.getKey();
int count = entry.getValue();
double tf = (double) count / totalWords;
tfMap.put(word, tf);
}
```
2. 计算逆文档频率(IDF):对于每个词,计算在所有文档中出现该词的文档数的倒数的对数。可以使用HashSet来存储每个词的文档数。
```java
Set<String> documentSet = new HashSet<String>();
Map<String, Integer> documentCount = new HashMap<String, Integer>();
// 遍历所有文档,统计包含每个词的文档数
for (List<String> doc : allDocuments) {
documentSet.clear();
documentSet.addAll(doc);
for (String word : documentSet) {
documentCount.put(word, documentCount.getOrDefault(word, 0) + 1);
}
}
// 计算每个词的逆文档频率
Map<String, Double> idfMap = new HashMap<String, Double>();
int totalDocuments = allDocuments.size();
for (Map.Entry<String, Integer> entry : documentCount.entrySet()) {
String word = entry.getKey();
int count = entry.getValue();
double idf = Math.log((double) totalDocuments / (count + 1));
idfMap.put(word, idf);
}
```
3. 计算TF-IDF:将TF和IDF相乘即可得到TF-IDF值。
```java
Map<String, Double> tfidfMap = new HashMap<String, Double>();
// 计算每个词的TF-IDF值
for (Map.Entry<String, Double> entry : tfMap.entrySet()) {
String word = entry.getKey();
double tf = entry.getValue();
double idf = idfMap.getOrDefault(word, 0.0);
double tfidf = tf * idf;
tfidfMap.put(word, tfidf);
}
```
以上是一个简单的实现示例,你可以根据具体需求进行适当的修改和优化。
阅读全文