python playwright爬取
时间: 2023-10-05 14:07:11 浏览: 138
Python Playwright 是一个基于 Node.js 的自动化测试库,它可以模拟用户在浏览器中的行为和操作。使用 Playwright 可以编写出高效、可靠的浏览器自动化测试脚本。当然,它也可以用来进行网页爬取。
以下是使用 Python Playwright 实现网页爬虫的基本步骤:
1. 安装 Python Playwright:使用 pip 命令安装 Python Playwright
```
pip install playwright
```
2. 安装浏览器:在 Python Playwright 中,需要先安装所需的浏览器(Chrome、Firefox、Safari)。
```
from playwright.sync_api import Playwright, sync_playwright
with sync_playwright() as playwright:
# 安装 Chrome 浏览器
browser = playwright.chromium.launch()
# 安装 Firefox 浏览器
browser = playwright.firefox.launch()
# 安装 Safari 浏览器
browser = playwright.webkit.launch()
```
3. 打开网页:使用 Python Playwright 打开网页,可以使用 `browser.new_page()` 方法创建一个新的页面,然后使用 `goto()` 方法打开指定网页。
```
from playwright.sync_api import Playwright, sync_playwright
with sync_playwright() as playwright:
# 安装 Chrome 浏览器
browser = playwright.chromium.launch()
# 创建一个新页面
page = browser.new_page()
# 打开指定网页
page.goto('https://www.baidu.com')
```
4. 获取网页内容:使用 `page.content()` 方法获取当前页面的 html 内容,使用 `page.title()` 方法获取当前页面的标题。
```
from playwright.sync_api import Playwright, sync_playwright
with sync_playwright() as playwright:
# 安装 Chrome 浏览器
browser = playwright.chromium.launch()
# 创建一个新页面
page = browser.new_page()
# 打开指定网页
page.goto('https://www.baidu.com')
# 获取网页内容
content = page.content()
# 获取网页标题
title = page.title()
```
5. 查找元素:使用 `page.query_selector()` 方法查找元素,使用 `element.get_property()` 方法获取元素属性值,使用 `element.text_content()` 方法获取元素文本内容。
```
from playwright.sync_api import Playwright, sync_playwright
with sync_playwright() as playwright:
# 安装 Chrome 浏览器
browser = playwright.chromium.launch()
# 创建一个新页面
page = browser.new_page()
# 打开指定网页
page.goto('https://www.baidu.com')
# 查找元素
search_input = page.query_selector('#kw')
# 输入搜索内容
search_input.fill('Python Playwright')
# 点击搜索按钮
page.click('#su')
# 查找搜索结果
search_result = page.query_selector('.result')
# 获取搜索结果文本内容
result_text = search_result.text_content()
```
6. 关闭浏览器:使用 `browser.close()` 方法关闭浏览器。
```
from playwright.sync_api import Playwright, sync_playwright
with sync_playwright() as playwright:
# 安装 Chrome 浏览器
browser = playwright.chromium.launch()
# 创建一个新页面
page = browser.new_page()
# 打开指定网页
page.goto('https://www.baidu.com')
# 获取网页内容
content = page.content()
# 关闭浏览器
browser.close()
```
以上是使用 Python Playwright 实现网页爬虫的基本步骤,你可以根据自己的需求进行修改和扩展。需要注意的是,网页爬虫有可能会侵犯网站的隐私,需要遵守相关法律法规和网站协议。
阅读全文