用正则表达式<dd>.?board-index.?>(.?)</i>.?data-src="(.?)".?name.?a.?>(.?)</a>.?star.?>(.?)</p>.?releasetime.?>(.?)</p>.?integer.?>(.?)</i>.?fraction.?>(.?)</i>.?</dd>爬取猫眼电影TOP100的所有信息。网址:https://www.maoyan.com/board/4
时间: 2023-07-17 12:09:00 浏览: 125
正则表达式-基础知识
可以使用以下正则表达式来爬取猫眼电影TOP100的所有信息:
```python
import re
import requests
# 请求网页
url = "https://www.maoyan.com/board/4"
response = requests.get(url)
html = response.text
# 提取信息
pattern = r'<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>'
matches = re.findall(pattern, html, re.S)
# 打印结果
for match in matches:
index = match[0]
image = match[1]
title = match[2]
actor = match[3].strip()[3:]
release_time = match[4].strip()[5:]
score = match[5] + match[6]
print("排名:", index)
print("海报链接:", image)
print("电影名称:", title)
print("主演:", actor)
print("上映时间:", release_time)
print("评分:", score)
print("--------------------")
```
注意:使用正则表达式爬取网页内容时,可能会因为网页结构变动而导致匹配失败,请根据实际情况调整正则表达式。另外,建议使用更加稳定和方便的库(如BeautifulSoup、Scrapy等)进行网页爬取。
阅读全文