使用jupyter爬取晋江文学网
时间: 2023-07-11 16:47:10 浏览: 97
首先,你需要安装以下几个库:requests、beautifulsoup4 和 pandas。在Jupyter Notebook中打开一个新的Python文件,然后按照以下步骤开始爬取晋江文学网上的小说数据:
1. 导入所需的库:
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
```
2. 确定要爬取的小说的URL:
```python
url = 'https://www.jjwxc.net/onebook.php?novelid=123456'
```
请将"123456"替换为你想要爬取的小说的实际ID号。
3. 使用requests库获取HTML源代码:
```python
response = requests.get(url)
html = response.content
```
4. 使用BeautifulSoup库解析HTML源代码:
```python
soup = BeautifulSoup(html, 'html.parser')
```
5. 找到包含小说章节列表的HTML元素:
```python
chapter_list = soup.find('div', {'class': 'booklast'})
```
6. 找到所有小说章节的链接和标题:
```python
chapter_links = chapter_list.find_all('a')
chapter_titles = [chapter.text for chapter in chapter_links]
chapter_urls = ['https://www.jjwxc.net/' + chapter.get('href') for chapter in chapter_links]
```
7. 遍历所有章节链接,获取每章节的HTML源代码:
```python
chapter_html = []
for chapter_url in chapter_urls:
chapter_response = requests.get(chapter_url)
chapter_html.append(chapter_response.content)
```
8. 使用BeautifulSoup库解析每章节的HTML源代码,并提取章节内容:
```python
chapter_content = []
for chapter in chapter_html:
chapter_soup = BeautifulSoup(chapter, 'html.parser')
chapter_content.append(chapter_soup.find('div', {'class': 'noveltext'}).text)
```
9. 将小说章节的标题和内容存储到一个Pandas DataFrame中:
```python
novel_df = pd.DataFrame({'title': chapter_titles, 'content': chapter_content})
```
现在,你已经成功地爬取了晋江文学网上的小说数据,并将其存储在一个DataFrame中。可以使用Pandas的导出功能将数据导出为CSV或Excel文件,以便进一步分析。