首页python常用爬虫代码

python常用爬虫代码

时间: 2023-11-03 09:58:13 浏览: 47

以下是Python中常用的爬虫代码： 1. 使用requests库发送HTTP请求获取网页内容 ```python import requests url = 'https://www.example.com' response = requests.get(url) html_content = response.text ``` 2. 使用BeautifulSoup库解析HTML并提取信息 ```python from bs4 import BeautifulSoup soup = BeautifulSoup(html_content, 'html.parser') title = soup.title.text links = [link['href'] for link in soup.find_all('a')] ``` 3. 使用正则表达式匹配网页内容 ```python import re pattern = r'<title>(.*?)</title>' title = re.findall(pattern, html_content) ``` 4. 使用Selenium库模拟浏览器行为 ```python from selenium import webdriver url = 'https://www.example.com' driver = webdriver.Chrome() driver.get(url) html_content = driver.page_source ``` 5. 使用Scrapy框架进行结构化的爬取 ```python import scrapy class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['https://www.example.com'] def parse(self, response): title = response.xpath('//title/text()').extract_first() links = response.css('a::attr(href)').extract() ``` 请注意，使用爬虫时需要遵守网站的robots.txt文件和使用条款，以避免对网站的不必要影响。