python playwright爬取
时间: 2023-06-20 18:05:32 浏览: 164
playwright-1.44.0
Python Playwright 是一个跨浏览器自动化库,可以用来进行 Web 数据爬取。以下是一个使用 Python Playwright 进行爬取的示例:
首先,需要安装 Python Playwright:
```
pip install playwright
```
然后,可以按照以下步骤进行爬取:
1. 导入 Playwright 和 asyncio 库:
```python
import asyncio
from playwright.async_api import async_playwright, Page
```
2. 创建一个异步函数,用于启动 Playwright 和浏览器,并打开一个页面:
```python
async def scrape(url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto(url)
# 进行页面操作和数据提取
# ...
await browser.close()
```
3. 在该函数中,可以进行页面操作和数据提取,例如:
```python
# 点击页面中的某个元素
await page.click('#some-element')
# 获取页面中的文本内容
text_content = await page.text_content('.some-class')
# 获取页面中的某个属性值
attribute_value = await page.get_attribute('#some-element', 'href')
```
4. 最后,调用该函数并传入要爬取的页面 URL:
```python
url = 'https://www.example.com'
asyncio.run(scrape(url))
```
完整的示例代码如下:
```python
import asyncio
from playwright.async_api import async_playwright, Page
async def scrape(url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto(url)
text_content = await page.text_content('body')
print(text_content)
await browser.close()
url = 'https://www.example.com'
asyncio.run(scrape(url))
```
这个示例中,使用 Playwright 打开一个 Chromium 浏览器,访问指定的 URL 并获取页面中的文本内容,最后关闭浏览器。
阅读全文