掌握Google Robots.txt：新规范与应用指南

需积分: 9 186 浏览量更新于2024-07-16 收藏 104KB PDF 举报

Google Robots.txt文档是网站管理员控制搜索引擎爬虫如何访问其网站的重要工具。该文档允许用户指定哪些页面应该被索引或排除在搜索结果之外，从而维护网站隐私、性能和搜索引擎优化策略。在2019年7月1日，Google宣布（<https://webmasters.googleblog.com/2019/07/rep-id.html>）他们正在将Robots.txt协议标准化为互联网标准（<https://tools.ietf.org/html/draft-koster-rep-00>），这影响了文档的一些关键特性。以下是主要变化点： 1. **删除“要求语言”部分**：由于Robots.txt协议成为了互联网标准草案，原有的特定语言要求被移除，这意味着任何遵循标准的URI（统一资源标识符）协议的Robots.txt文件都将被接受。 2. **支持所有URI协议**：现在，Robots.txt不仅可以处理传统的http和https协议，还可以处理其他URI协议，如ftp、sftp等，提供了更大的灵活性。 3. **重定向处理**：Google搜索引擎会跟踪至少一个重定向层级，但仅在没有找到Robots.txt文件的情况下才会将后续重定向视为404错误。这意味着如果robots.txt文件位于重定向路径中，Google会尝试获取并遵循文件中的规则。 4. **逻辑重定向的处理**：对基于HTML内容返回2xx状态码（如frame、JavaScript或meta refresh类型的重定向）的robots.txt文件，Google不再鼓励使用这些逻辑重定向。相反，它会优先使用重定向后的第一个页面内容来确定适用的规则，这可能会影响爬虫对网站结构的理解。 5. **文件位置**：虽然没有具体提及，但通常情况下，Robots.txt文件应放在网站根目录下（例如www.example.com/robots.txt），以便搜索引擎爬虫能够轻易发现。了解这些变化有助于确保您的网站与新的Google标准保持一致，避免潜在的抓取问题，并充分利用Robots.txt文件进行有效的网站管理和SEO管理。同时，定期检查和更新Robots.txt内容，尤其是在实施重定向或更改网站结构时，可以确保搜索引擎的行为符合您的期望。



(https://en.wikipedia.org/wiki/Uniform_Resource_Identier)

, and for Google Search specically (for

example, crawling of websites) are "http" and "https". On http and https, the robots.txt le is fetched

using a HTTP non-conditional GET request.

Google-specic: Google also accepts and follows robots.txt les for FTP sites. FTP-based robots.txt

les are accessed via the FTP protocol, using an anonymous login.

The directives listed in the robots.txt le apply only to the host, protocol and port number where the

le is hosted.

RL for the robots.txt le is - like other URLs - case-sensitive.

Examples of valid robots.txt URLs

Robots.txt URL examples

http://example.com/robots

.txt



Valid for:

http://example.com/

http://example.com/folder/file



Not valid for:

http://other.example.com/

https://example.com/

http://example.com:8181/

This is the general case. It is not valid for other subdomains, protocols or port

numbers. It is valid for all les in all subdirectories on the same host, protocol

and port number.

剩余14页未读，继续阅读

传奇胡工

粉丝: 2
资源: 1

掌握Google Robots.txt：新规范与应用指南

jd_seckill-master.zip

VOCtrainval_11-May-2012.tar训练测试数据下载

Legged-Robots-That-Balance.pdf

Python库 | robots_controller-1.0.12.tar.gz

Python库 | espn_scraper-0.14.13-py2.py3-none-any.whl

PyPI 官网下载 | ckan_crawler-0.1.14-py3-none-any.whl

Python库 | hockey_scraper-1.2.1.tar.gz

PyPI 官网下载 | deepcrawl_robots-0.0.4-py3-none-any.whl

信息安全_数据安全_D2T2 - Hacking Robots - Stefano .pdf

最新资源