Python搜狐新闻数据爬虫,数据储存在MySQL,在pycharm和MySQL中运行的具体代码是
时间: 2024-09-20 17:17:46 浏览: 95
Python中通过requests库获取网页内容,然后BeautifulSoup库解析HTML提取新闻信息,最后使用pymysql库将数据存储到MySQL数据库。以下是简单的步骤概述及部分代码示例:
1. **安装必要的库**:
- `pip install requests`
- `pip install beautifulsoup4`
- `pip install pymysql`
2. **创建连接数据库的函数** (假设数据库名是'sohu_news'):
```python
import pymysql
def connect_mysql():
db = pymysql.connect(host='localhost', user='your_username', password='your_password',
database='sohu_news', charset='utf8')
return db
```
3. **获取新闻数据并解析**:
```python
import requests
from bs4 import BeautifulSoup
def get_sohu_news(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 提取你需要的数据,例如标题、链接等
title_list = [title.text for title in soup.find_all('h2', class_='title')]
link_list = [link['href'] for link in soup.find_all('a', href=True)]
return title_list, link_list
```
4. **插入数据到数据库**:
```python
def save_to_mysql(title_list, link_list, db):
cursor = db.cursor()
# 创建SQL插入语句
insert_query = "INSERT INTO news_data (title, link) VALUES (%s, %s)"
try:
for i in range(len(title_list)):
cursor.execute(insert_query, (title_list[i], link_list[i]))
db.commit() # 每次添加一条数据后提交事务
except Exception as e:
print(f"Error occurred: {e}")
db.rollback()
# 调用上述函数
db = connect_mysql()
titles, links = get_sohu_news('http://news.sohu.com/')
save_to_mysql(titles, links, db)
```
5. **关闭数据库连接**:
```python
def close_db(db):
db.close()
close_db(db)
```
记得替换`your_username`和`your_password`为你实际的数据库用户名和密码。
阅读全文