def get_content_html(x): url="https://movie.douban.com/top250?start={}&filter=".format(x) headers={ "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" }
时间: 2024-03-29 19:39:45 浏览: 16
这是一个 Python 函数,名为 get_content_html(),它接受一个参数 x。这个函数的作用是通过豆瓣电影 Top250 的网址获取指定页面的 HTML 内容。
函数中定义了一个变量 url,它是一个字符串类型的变量,包含豆瓣电影 Top250 的网址,并使用 format() 方法将 x 插入到字符串中。这里的 x 可以是 0、25、50 等,表示要获取 Top250 中第几个页面的数据。
headers 变量是一个字典类型的变量,包含了请求头信息,用于模拟浏览器访问,避免被网站识别为爬虫程序。
该函数并没有返回值,只是定义了变量 url 和 headers。
相关问题
def get_mnnu_text(url): response = requests.get(url) soup = BeautifulSoup(response.content.decode('utf-8'), 'html.parser') content = soup.find_all('p') text = '\n'.join([p.get_text() for p in content]) return text url = 'http://www.mnnu.edu.cn/' text = get_mnnu_text(url) print(text) def save_to_navicat(text): connection = pymysql.connect( host='localhost', user='root', password='root', database='pa', charset='utf8' ) try: with connection.cursor() as cursor: sql = "INSERT INTO `my_table` (`content`) VALUES (%s)" encoded_text = text.encode('utf8') cursor.execute(sql, (encoded_text,)) connection.commit() finally: connection.close() save_to_navicat(text),我要修改url为https://movie.douban.com/subject/30391186/comments?start={}&limit=20&sort=new_score&status=P
好的,你可以按照以下方式修改代码:
```python
import requests
from bs4 import BeautifulSoup
import pymysql
def get_movie_comments(url):
response = requests.get(url)
soup = BeautifulSoup(response.content.decode('utf-8'), 'html.parser')
content = soup.find_all('span', class_='short')
comments = '\n'.join([c.get_text() for c in content])
return comments
url_template = 'https://movie.douban.com/subject/30391186/comments?start={}&limit=20&sort=new_score&status=P'
comments = ''
for i in range(0, 100, 20): # 获取前100条评论
url = url_template.format(i)
comments += get_movie_comments(url)
def save_to_navicat(text):
connection = pymysql.connect(
host='localhost',
user='root',
password='root',
database='pa',
charset='utf8'
)
try:
with connection.cursor() as cursor:
sql = "INSERT INTO `my_table` (`content`) VALUES (%s)"
encoded_text = text.encode('utf8')
cursor.execute(sql, (encoded_text,))
connection.commit()
finally:
connection.close()
save_to_navicat(comments)
```
这样就可以获取豆瓣电影《误杀》的前100条评论并保存到数据库中了。
帮我找一下这段代码的错误,并改正import urllib.parse import urllib.request def create_request(page): base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&' data = { 'start':(page - 1) * 20, 'limit':20 } data = urllib.parse.urlencode(data) url = base_url + data # print(url) headers = { # 'User - Agent': 'Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 108.0.5359.95Safari / 537.36' 'User - Agent': 'Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 108.0.5359.95Safari / 537.36' } request = urllib.request.Request(url=url,headers=headers) return request def get_content(request): response = urllib.request.urlopen(request) content = response.read().decode('utf-8') return content def down_load(page,content): with open ('douban_' + str(page) + '.json','w',encoding='utf-8') as fp: fp.write(content) if __name__ == '__main__': start_page = int (input('请输入开始的页码')) end_page = int (input('请输入结束的页码')) for page in range(start_page,end_page+1): # 每一页都有请求对象的定制 request = create_request(page) content = get_content(request) down_load(page,content)
该代码存在两个错误:
1. headers中的键名有空格,需要去掉空格。
2. down_load函数名错误,应为download。
改正后的代码如下:
```python
import urllib.parse
import urllib.request
def create_request(page):
base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&'
data = {
'start': (page - 1) * 20,
'limit': 20
}
data = urllib.parse.urlencode(data)
url = base_url + data
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.5359.95 Safari/537.36'
}
request = urllib.request.Request(url=url, headers=headers)
return request
def get_content(request):
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
return content
def download(page, content):
with open('douban_' + str(page) + '.json', 'w', encoding='utf-8') as fp:
fp.write(content)
if __name__ == '__main__':
start_page = int(input('请输入开始的页码'))
end_page = int(input('请输入结束的页码'))
for page in range(start_page, end_page + 1):
request = create_request(page)
content = get_content(request)
download(page, content)
```