写一段matlab构建VSM模型代码
时间: 2024-01-10 11:02:02 浏览: 107
以下是一个简单的MATLAB代码,用于构建基于向量空间模型(VSM)的文本检索系统:
```
% 假设我们有一个包含3个文档的语料库
corpus = {'This is the first document.'
'This is the second second document.'
'And the third one.'};
% 构建词表,去除停用词
stopWords = {'is', 'the', 'and', 'one'};
words = split(join(corpus));
words = setdiff(words, stopWords);
vocab = unique(words);
% 构建文档-词频矩阵
tfMatrix = zeros(numel(corpus), numel(vocab));
for i = 1:numel(corpus)
docWords = split(corpus{i});
docWords = setdiff(docWords, stopWords);
for j = 1:numel(vocab)
tfMatrix(i,j) = sum(strcmp(docWords, vocab{j}));
end
end
% 计算文档的tf-idf向量
idfVector = log(numel(corpus) ./ sum(tfMatrix > 0));
tfidfMatrix = tfMatrix .* idfVector;
% 查询
query = 'second document';
queryWords = split(query);
queryWords = setdiff(queryWords, stopWords);
queryVector = zeros(1, numel(vocab));
for i = 1:numel(vocab)
queryVector(i) = sum(strcmp(queryWords, vocab{i}));
end
queryTfidfVector = queryVector .* idfVector;
cosSim = tfidfMatrix * queryTfidfVector' ./ (vecnorm(tfidfMatrix')' .* vecnorm(queryTfidfVector));
[sortedSim, sortedIdx] = sort(cosSim, 'descend');
disp('Top matching documents:');
disp(corpus(sortedIdx(1:5))));
```
这段代码首先构建了一个包含3个文档的语料库,然后通过词表和文档-词频矩阵构建了tf-idf向量空间模型。之后,代码接受用户输入的查询,并计算查询向量与所有文档向量之间的余弦相似度,最后返回相似度最高的前5个文档。
阅读全文