AG新闻数据集:百万新闻文章与丰富语料库

版权申诉
5星 · 超过95%的资源 2 下载量 163 浏览量 更新于2024-11-18 收藏 11.24MB 7Z 举报
资源摘要信息:"AG News 数据集是一个新闻文章的数据集,包含超过100万篇新闻文章,来源于2000个不同的新闻源。此数据集主要使用了新闻文章的标题和描述字段,针对四种不同的新闻类别进行了分类,每个类别包含30,000个训练样本和1900个测试样本。AG News Dataset 在2004年由康奈尔大学发布,并与两篇相关论文相关联,分别是《Ranking a stream of news. In Proceedings of 14th International World Wide Web Conference》和《The anatomy of a news search engine》。 在详细分析此数据集之前,了解其背后的背景和应用非常重要。首先,数据集的规模超过100万篇新闻文章,这提供了一个庞大的数据基础,适用于新闻分类、文本挖掘、自然语言处理等领域的研究与开发。数据集的构建采用了多个新闻源,意味着样本具有多样性,可以更好地训练和测试模型在不同来源和不同风格的新闻文本中的泛化能力。 AG News Dataset 的构建涉及到了文本分类的问题,文本分类是机器学习和自然语言处理的一个基本任务,它旨在将文本数据分到一个或多个类别中。对于新闻文章,文本分类可以帮助建立新闻推荐系统、情感分析、新闻聚合等应用。本数据集的目标就是识别新闻文章属于四个新闻类别中的哪一个,这四个类别通常包括世界新闻、体育、商业和科技。分类问题在机器学习中是一个有监督的学习任务,需要大量的已标记数据来训练模型。 数据集中的标签字段“数据集”意味着AG News 只是一个包含数据的集合,而没有额外的特征或复杂的数据结构。这可能意味着用户需要进一步预处理数据,例如进行词干提取、去除停用词、进行向量化等步骤,才能用于机器学习模型的训练。 在技术层面,该数据集是机器学习和自然语言处理社区常用的资源之一,特别是在文本分类任务中。在实践中,研究人员和开发者可以使用AG News Dataset来训练和测试他们的算法,以达到对新闻文本进行有效分类的目的。数据集的规模和多样性使得它成为研究如何处理大规模真实世界文本数据的有用工具。 此外,该数据集还涉及到了信息检索和搜索引擎优化的问题,这在相关论文中得到了探讨。第一篇论文《Ranking a stream of news》可能探讨了如何对新闻流进行排序,而第二篇《The anatomy of a news search engine》可能深入分析了新闻搜索引擎的工作机制。这些论文可以为理解新闻数据的结构化处理提供理论支持,同时也为从事相关研究的学者提供了宝贵的参考。 在使用AG News Dataset时,用户应当注意遵守相关的版权和使用规定,特别是在进行公开发布研究成果或商业应用时。此外,数据集中的数据也可能存在偏差或不均衡,因此在分析和模型训练时,应该考虑到数据集的代表性问题,确保模型不会因为数据集的特定偏见而对新的、未见过的数据产生不良的预测。 总结来说,AG News Dataset是一个高质量的新闻文章分类数据集,它为机器学习和自然语言处理的研究者提供了宝贵的实验资源。通过对该数据集的研究和应用,可以开发出更好的文本分类算法,提高新闻相关系统的性能,同时对于理解新闻搜索引擎和信息检索技术也有重要的价值。"
2462 浏览量
496,835 条来自 AG 新闻语料库 4 大类别超过 2000 个新闻源的新闻文章,数据集仅仅援用了标题和描述字段。每个类别分别拥有 30,000 个训练样本及 1900 个测试样本。 README: AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
801 浏览量
496,835 条来自 AG 新闻语料库 4 大类别超过 2000 个新闻源的新闻文章,数据集仅仅援用了标题和描述字段。每个类别分别拥有 30,000 个训练样本及 1900 个测试样本。 README: AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".