利用 TF-IDF 模型，为文本库中的文本创建索引(如倒排索引)；用户输入的关键词可以是一个或多个；对于返回的结果文本，需同时显示各检索关键词在结果文本中的出现频度信息； (4)系统内支持返回结果文本的查看； 2. 扩展要求 (1)支持文本库的动态装载和处理,支持停用词的管理和维护，停用词是指在没有检索价值的单词，如 is, am, are, a, an, the 等，用C++语言实现完整代码；

时间: 2024-01-21 22:16:33 浏览: 90

以下是基于 TF-IDF 模型的文本检索系统的 C++ 实现代码，包括动态装载和处理文本库、停用词管理和维护等功能： ```cpp #include <iostream> #include <fstream> #include <sstream> #include <string> #include <vector> #include <map> #include <set> #include <algorithm> #include <cmath> using namespace std; // 定义文本信息结构体 struct TextInfo { string filename; // 文件名 int word_count; // 单词总数 map<string, int> word_freq; // 单词频度 }; // 定义索引信息结构体 struct IndexInfo { int doc_count; // 包含该单词的文档总数 map<string, int> doc_freq; // 包含该单词的每个文档中该单词的出现次数 }; // 定义全局变量 map<string, IndexInfo> inverted_index; // 倒排索引 set<string> stop_words; // 停用词 // 定义函数：读取文件内容 string read_file(const string& filename) { ifstream ifs(filename); if (!ifs.is_open()) { cerr << "Error: cannot open file " << filename << endl; exit(1); } stringstream ss; ss << ifs.rdbuf(); return ss.str(); } // 定义函数：分词 vector<string> tokenize(const string& text) { vector<string> words; stringstream ss(text); string word; while (ss >> word) { transform(word.begin(), word.end(), word.begin(), ::tolower); // 转换为小写 if (stop_words.count(word) == 0) { // 如果不是停用词 words.push_back(word); } } return words; } // 定义函数：计算 TF-IDF 值 double tf_idf(const string& word, const TextInfo& text_info) { double tf = static_cast<double>(text_info.word_freq.at(word)) / text_info.word_count; double idf = log(static_cast<double>(inverted_index.size()) / inverted_index.at(word).doc_count); return tf * idf; } // 定义函数：检索 vector<pair<string, map<string, int>>> search(const vector<string>& keywords) { map<string, double> scores; // 存储文档得分 for (const auto& keyword : keywords) { if (inverted_index.count(keyword) > 0) { // 如果索引中包含该关键词 for (const auto& p : inverted_index.at(keyword).doc_freq) { string filename = p.first; double tf_idf_value = tf_idf(keyword, { filename, 0, {} }); scores[filename] += tf_idf_value * p.second; } } } vector<pair<string, map<string, int>>> results; for (const auto& p : scores) { string filename = p.first; string content = read_file(filename); vector<string> words = tokenize(content); map<string, int> word_freq; for (const auto& word : words) { ++word_freq[word]; } map<string, int> keyword_freq; for (const auto& keyword : keywords) { keyword_freq[keyword] = inverted_index.at(keyword).doc_freq.at(filename); } results.push_back({ filename, keyword_freq }); } return results; } int main() { // 读取停用词 ifstream ifs_stop_words("stop_words.txt"); if (!ifs_stop_words.is_open()) { cerr << "Error: cannot open file stop_words.txt" << endl; exit(1); } string stop_word; while (ifs_stop_words >> stop_word) { stop_words.insert(stop_word); } // 动态装载和处理文本库 string dir_name; cout << "Please input the directory name of text files: "; cin >> dir_name; vector<string> filenames; for (const auto& entry : filesystem::directory_iterator(dir_name)) { filenames.push_back(entry.path().string()); } for (const auto& filename : filenames) { string content = read_file(filename); vector<string> words = tokenize(content); TextInfo text_info = { filename, static_cast<int>(words.size()), {} }; for (const auto& word : words) { ++text_info.word_freq[word]; if (inverted_index.count(word) == 0) { inverted_index[word] = { 0, {} }; } ++inverted_index[word].doc_count; ++inverted_index[word].doc_freq[filename]; } } // 检索 string query; while (true) { cout << "Please input the query keywords (separated by spaces), or enter 'exit' to exit: "; getline(cin, query); if (query == "exit") { break; } vector<string> keywords = tokenize(query); vector<pair<string, map<string, int>>> results = search(keywords); for (const auto& result : results) { cout << "Filename: " << result.first << endl; for (const auto& keyword_freq : result.second) { cout << "Keyword: " << keyword_freq.first << ", Frequency: " << keyword_freq.second << endl; } cout << endl; } } return 0; } ``` 在运行程序前，需要在程序所在目录下创建一个名为 `stop_words.txt` 的文本文件，并将停用词列表写入该文件中，每个停用词占一行。此外，需要将待检索的文本文件放在一个目录下，并将该目录名称作为程序的输入。程序会扫描该目录下的所有文本文件，并动态地创建倒排索引。在检索时，用户可以输入一个或多个关键词，程序会返回包含这些关键词的文本文件，以及每个关键词在文本文件中的出现频度信息。

阅读全文

相关推荐

利用Python实现中文文本关键词抽分别采用TF-IDF、TextRank、Word2Vec词聚类三种方法

python TF-IDF算法实现文本关键词提取

基于特定语料库的TF-IDF的中文关键词提取

如何结合jieba库和TF-IDF算法实现中文文本关键词的高效提取？

在TensorFlow环境中，如何利用TF-IDF技术与LDA模型从THUCNews数据集提取关键词并生成新闻文本的自动文摘？

如将文本数据转化为词袋模型、TF-IDF模型等

写一个导出tf-idf模型表示的文本的代码

tf-idf中文文本分类预处理的python实现，返回值为tf-idf值

如将文本数据转化为TF-IDF模型 的代码

告诉我如何对CSV中的文本利用TF-IDF提取关键词的代码

写一个对用tf-idf模型表示的文本数据进行标准化的代码

写一个对用tf-idf模型表示的文本数据进行归一化的代码

文本数据TF-idf模型

TF-IDF模型将文本转化为向量

用TF-IDF模型实例演示去比较文本差异度模型

中文文本如何进行TF-IDF

如何使用jieba库进行高效的中文分词处理，并结合TF-IDF算法提取文本关键词？

tf-idf词袋模型、jieba 文本情感分类

最新推荐

python TF-IDF算法实现文本关键词提取

TF-IDF算法解析与Python实现方法详解

高清艺术文字图标资源，PNG和ICO格式免费下载

管理建模和仿真的文件

DMA技术：绕过CPU实现高效数据传输

SGM8701电压比较器如何在低功耗电池供电系统中实现高效率运作？

mui框架HTML5应用界面组件使用示例教程

"互动学习：行动中的多样性与论文攻读经历"

【数据传输高速公路】：总线系统的深度解析

如何结合PID算法调整PWM信号来优化电机速度控制？请提供实现这一过程的步骤和代码示例。

如将文本数据转化为TF-IDF模型的代码