请用Java代码利用余弦相似度分析文本相似度
时间: 2024-03-31 13:21:19 浏览: 20
以下是使用Java代码计算余弦相似度的示例:
import java.util.HashMap;
import java.util.Map;
public class CosineSimilarity {
public static void main(String[] args) {
String text1 = "This is a sample text.";
String text2 = "This is another sample text.";
double similarity = calculateCosineSimilarity(text1, text2);
System.out.println("The cosine similarity between the two texts is: " + similarity);
}
public static double calculateCosineSimilarity(String text1, String text2) {
Map<String, Integer> wordFrequency1 = getWordFrequency(text1);
Map<String, Integer> wordFrequency2 = getWordFrequency(text2);
double dotProduct = calculateDotProduct(wordFrequency1, wordFrequency2);
double magnitude1 = calculateMagnitude(wordFrequency1);
double magnitude2 = calculateMagnitude(wordFrequency2);
return dotProduct / (magnitude1 * magnitude2);
}
private static Map<String, Integer> getWordFrequency(String text) {
Map<String, Integer> wordFrequency = new HashMap<>();
String[] words = text.split("\\s+");
for (String word : words) {
word = word.toLowerCase();
if (wordFrequency.containsKey(word)) {
wordFrequency.put(word, wordFrequency.get(word) + 1);
} else {
wordFrequency.put(word, 1);
}
}
return wordFrequency;
}
private static double calculateDotProduct(Map<String, Integer> wordFrequency1, Map<String, Integer> wordFrequency2) {
double dotProduct = 0;
for (String word : wordFrequency1.keySet()) {
if (wordFrequency2.containsKey(word)) {
dotProduct += wordFrequency1.get(word) * wordFrequency2.get(word);
}
}
return dotProduct;
}
private static double calculateMagnitude(Map<String, Integer> wordFrequency) {
double magnitude = 0;
for (int frequency : wordFrequency.values()) {
magnitude += Math.pow(frequency, 2);
}
return Math.sqrt(magnitude);
}
}
在上面的代码中,我们首先定义了两个文本字符串。然后,我们使用calculateCosineSimilarity方法计算这两个字符串之间的余弦相似度。该方法接受两个字符串作为参数,并返回它们之间的余弦相似度。
在calculateCosineSimilarity方法中,我们首先使用getWordFrequency方法获取每个文本字符串中每个单词的出现次数。然后,我们使用calculateDotProduct方法计算两个文本字符串中所有共同出现的单词的乘积之和。接下来,我们使用calculateMagnitude方法计算每个文本字符串的向量大小。最后,我们将这些值带入余弦相似度公式中,并返回相似度。