爬取文档怎么去重复 案例
时间: 2024-02-13 15:03:46 浏览: 25
假设我们要爬取某个网站上的新闻文章,如何避免重复爬取同一篇文章呢?这里以使用数据库的方法为例:
1. 创建一个数据库表,用于存储已经爬取过的文章的 URL。
```sql
CREATE TABLE `crawled_urls` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `url` (`url`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
```
2. 在爬取文章的代码中,先查询数据库中是否存在该 URL。
```python
import requests
import hashlib
import mysql.connector
def get_url_hash(url):
return hashlib.sha1(url.encode('utf-8')).hexdigest()
def is_url_crawled(url, conn):
cursor = conn.cursor()
query = "SELECT COUNT(*) FROM crawled_urls WHERE url=%s"
cursor.execute(query, (url,))
result = cursor.fetchone()
cursor.close()
return result[0] > 0
def mark_url_crawled(url, conn):
cursor = conn.cursor()
query = "INSERT INTO crawled_urls (url) VALUES (%s)"
cursor.execute(query, (url,))
conn.commit()
cursor.close()
def crawl_article(url, conn):
if is_url_crawled(url, conn):
print(f"Skipping crawled URL: {url}")
return
# 爬取文章内容
response = requests.get(url)
content = response.text
# 处理文章内容
# ...
# 标记 URL 已经爬取过
mark_url_crawled(url, conn)
print(f"Crawled URL: {url}")
```
3. 在爬取文章的主函数中,建立数据库连接,并调用 `crawl_article` 函数爬取文章。
```python
def main():
conn = mysql.connector.connect(
host="localhost",
user="root",
passwd="password",
database="mydatabase"
)
# 爬取文章列表
article_list_url = "http://example.com/article_list"
response = requests.get(article_list_url)
article_urls = extract_article_urls(response.text)
# 爬取每篇文章
for url in article_urls:
crawl_article(url, conn)
conn.close()
if __name__ == "__main__":
main()
```
通过上述方法,我们可以避免重复爬取同一篇文章,从而提高爬虫的效率。