使用jupyter爬取晋江文学网
时间: 2023-07-11 20:59:41 浏览: 128
如何在VSCode下使用Jupyter的教程详解
4星 · 用户满意度95%
首先,你需要安装必要的Python库,包括requests、beautifulsoup4、pandas和numpy。可以使用以下命令进行安装:
```python
!pip install requests beautifulsoup4 pandas numpy
```
然后,你可以使用以下代码从晋江文学网爬取小说内容:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
# 设置请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
# 定义函数,用于获取小说信息
def get_novel_info(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('h1', {'class': 'book-detail-title'}).text
author = soup.find('p', {'class': 'book-detail-author'}).text
intro = soup.find('div', {'class': 'book-intro'}).text.strip()
return title, author, intro
# 定义函数,用于获取小说章节内容
def get_chapter_content(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
chapter_title = soup.find('div', {'class': 'chapter-title'}).text
chapter_content = soup.find('div', {'class': 'chapter-entity'}).text.strip()
return chapter_title, chapter_content
# 爬取小说并保存到CSV文件中
novel_url = 'https://www.jjwxc.net/onebook.php?novelid=12345678'
title, author, intro = get_novel_info(novel_url)
chapter_url = 'https://www.jjwxc.net/onebook.php?novelid=12345678&chapterid=1'
chapter_title, chapter_content = get_chapter_content(chapter_url)
data = {'title': [title], 'author': [author], 'intro': [intro], 'chapter_title': [chapter_title], 'chapter_content': [chapter_content]}
df = pd.DataFrame(data)
df.to_csv('novel.csv', index=False)
```
请注意,这只是一个示例代码,你需要将示例中的URL和选择器替换为你需要爬取的内容。
阅读全文