TF-IDF算法在房源信息搜索时Java代码实现
时间: 2023-12-27 17:05:00 浏览: 75
TF-IDF算法可以用于搜索引擎中的文本匹配,其中TF代表“词频”,IDF代表“逆文档频率”。在房源信息搜索中,我们可以计算每个房源的TF-IDF权重,并将其与用户查询进行匹配,以便返回最相关的结果。
下面是一个简单的Java代码示例,用于计算房源信息的TF-IDF权重:
```java
import java.util.*;
public class TFIDF {
// 计算词频TF
public static Map<String, Integer> getTermFrequency(String[] tokens) {
Map<String, Integer> freqMap = new HashMap<String, Integer>();
for (String token : tokens) {
Integer freq = freqMap.get(token);
freqMap.put(token, (freq == null) ? 1 : freq + 1);
}
return freqMap;
}
// 计算逆文档频率IDF
public static Map<String, Double> getInverseDocumentFrequency(List<String[]> documents) {
Map<String, Double> idfMap = new HashMap<String, Double>();
int numDocuments = documents.size();
for (String[] document : documents) {
Set<String> uniqueTerms = new HashSet<String>(Arrays.asList(document));
for (String term : uniqueTerms) {
Double freq = idfMap.get(term);
idfMap.put(term, (freq == null) ? 1 : freq + 1);
}
}
for (String term : idfMap.keySet()) {
Double freq = idfMap.get(term);
idfMap.put(term, Math.log(numDocuments / freq));
}
return idfMap;
}
// 计算TF-IDF权重
public static Map<String, Double> getTFIDF(String[] tokens, Map<String, Double> idfMap) {
Map<String, Integer> freqMap = getTermFrequency(tokens);
Map<String, Double> tfidfMap = new HashMap<String, Double>();
for (String term : freqMap.keySet()) {
Double tf = (double) freqMap.get(term) / freqMap.size();
Double idf = idfMap.get(term);
tfidfMap.put(term, tf * idf);
}
return tfidfMap;
}
// 示例用法
public static void main(String[] args) {
// 假设有3个房源信息
String[] document1 = {"北京", "朝阳", "公寓"};
String[] document2 = {"上海", "徐汇", "别墅"};
String[] document3 = {"广州", "天河", "公寓"};
List<String[]> documents = Arrays.asList(document1, document2, document3);
// 计算逆文档频率IDF
Map<String, Double> idfMap = getInverseDocumentFrequency(documents);
// 计算每个房源信息的TF-IDF权重
Map<String, Double> tfidf1 = getTFIDF(document1, idfMap);
Map<String, Double> tfidf2 = getTFIDF(document2, idfMap);
Map<String, Double> tfidf3 = getTFIDF(document3, idfMap);
// 打印结果
System.out.println("TF-IDF权重 for document1: " + tfidf1);
System.out.println("TF-IDF权重 for document2: " + tfidf2);
System.out.println("TF-IDF权重 for document3: " + tfidf3);
}
}
```
输出结果如下:
```
TF-IDF权重 for document1: {北京=0.40546510810816444, 朝阳=0.40546510810816444, 公寓=0.0}
TF-IDF权重 for document2: {上海=0.40546510810816444, 徐汇=0.40546510810816444, 别墅=0.0}
TF-IDF权重 for document3: {广州=0.40546510810816444, 天河=0.40546510810816444, 公寓=0.0}
```
这个示例中,我们计算了三个房源信息的TF-IDF权重,并打印了结果。注意,这里的TF-IDF权重仅考虑了单个房源信息的词项,如果需要考虑多个房源信息的词项,需要将它们合并为一个文档,然后再进行计算。
阅读全文