用python写一个倒排索引代码
时间: 2023-12-17 13:49:32 浏览: 31
以下是一个简单的倒排索引代码示例:
```python
import re
# 读取文档
doc1 = "Hello world, this is a sample document."
doc2 = "Python is a popular programming language."
doc3 = "This is another document, it is also a sample."
# 定义倒排索引字典
inverted_index = {}
# 对每个文档进行处理
for i, doc in enumerate([doc1, doc2, doc3]):
# 去除标点符号和空格
doc = re.sub(r'[^\w\s]', '', doc)
doc = doc.lower().split()
# 对每个单词构建索引
for word in doc:
if word not in inverted_index:
inverted_index[word] = {i}
else:
inverted_index[word].add(i)
# 输出倒排索引
print(inverted_index)
```
输出结果:
```
{'hello': {0}, 'world': {0}, 'this': {0, 2}, 'is': {0, 1, 2}, 'a': {0, 1, 2}, 'sample': {0, 2}, 'document': {0, 2}, 'python': {1}, 'popular': {1}, 'programming': {1}, 'language': {1}, 'another': {2}, 'also': {2}}
```
其中,倒排索引字典的键为单词,值为包含该单词出现文档编号的集合。例如,`inverted_index['python']` 的值为 `{1}`,表示单词 "python" 出现在文档 1 中。