python爬取豆瓣小组讨论
时间: 2023-09-23 22:12:35 浏览: 210
豆瓣图像爬取python
要使用Python爬取豆瓣小组讨论,可以使用scrapy框架和相关的代码。一个实际的Python项目可以通过爬取豆瓣小组的讨论列表并保存相关信息到MongoDB数据库,同时下载图片到本地来学习如何使用scrapy爬取网络上的信息。
下面是一个示例代码,可以爬取豆瓣租房小组的1000条讨论列表,并筛选出包含特定关键词的条目。假设将以下代码保存在spider.py文件中:
```python
import scrapy
class DiscussionSpider(scrapy.Spider):
name = 'discussion_spider'
def __init__(self, username, password, start, limit, *args, **kwargs):
super(DiscussionSpider, self).__init__(*args, **kwargs)
self.username = username
self.password = password
self.start = start
self.limit = limit
def start_requests(self):
# 登录豆瓣
login_url = 'https://accounts.douban.com/login'
yield scrapy.FormRequest(
login_url,
formdata={'username': self.username, 'password': self.password},
callback=self.parse_discussion_list
)
def parse_discussion_list(self, response):
# 在这里解析讨论列表页面,并筛选含有关键词的条目
discussion_list = response.xpath('//div[@class="discussion-item"]')
for discussion in discussion_list:
title = discussion.xpath('.//h3/text()').get()
content = discussion.xpath('.//p/text()').get()
keyword = '关键词' # 替换成你要筛选的关键词
if keyword in content:
yield {
'title': title,
'content': content
}
def parse(self, response):
pass
# 运行爬虫
if __name__ == "__main__":
import sys
from scrapy.crawler import CrawlerProcess
username = sys.argv<span class="em">1</span><span class="em">2</span><span class="em">3</span>
#### 引用[.reference_title]
- *1* [Python爬取豆瓣小组内的图片源码.zip](https://download.csdn.net/download/weixin_47367099/85291425)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"]
- *2* *3* [使用selenium+requests爬取豆瓣小组讨论列表](https://blog.csdn.net/weixin_29305313/article/details/112051860)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"]
[ .reference_list ]
阅读全文