写一段根据文本相似度去重的Java代码
时间: 2024-05-09 20:20:32 浏览: 13
以下是一段根据文本相似度去重的Java代码:
```java
import java.util.*;
public class TextSimilarity {
// 计算文本相似度
public static double calculateSimilarity(String text1, String text2) {
Set<String> set1 = new HashSet<>(Arrays.asList(text1.toLowerCase().split(" ")));
Set<String> set2 = new HashSet<>(Arrays.asList(text2.toLowerCase().split(" ")));
Set<String> unionSet = new HashSet<>(set1);
unionSet.addAll(set2);
int intersectionCount = set1.size() + set2.size() - unionSet.size();
return (double) intersectionCount / unionSet.size();
}
// 去重
public static List<String> removeDuplicates(List<String> texts) {
List<String> result = new ArrayList<>();
for (int i = 0; i < texts.size(); i++) {
String text1 = texts.get(i);
boolean isDuplicate = false;
for (int j = i + 1; j < texts.size(); j++) {
String text2 = texts.get(j);
if (calculateSimilarity(text1, text2) > 0.8) {
isDuplicate = true;
break;
}
}
if (!isDuplicate) {
result.add(text1);
}
}
return result;
}
public static void main(String[] args) {
List<String> texts = new ArrayList<>(Arrays.asList(
"This is a sample text.",
"This is a similar text.",
"This is a different text.",
"Another sample text.",
"This is a text with a few words repeated."
));
List<String> result = removeDuplicates(texts);
System.out.println(result);
}
}
```
此代码通过计算文本相似度来判断两个文本是否相似,然后在去重时将相似度大于0.8的文本视为重复文本。在上述示例中,本代码将“ This is a sample text.”和“ This is a similar text.”视为相似文本,并将前者从去重后的列表中剔除。