首页Python搜狐新闻数据爬虫，数据储存在MySQL，在pycharm和MySQL中运行的具体代码是

Python搜狐新闻数据爬虫，数据储存在MySQL，在pycharm和MySQL中运行的具体代码是

时间: 2024-09-20 14:17:46 浏览: 109

Python中通过requests库获取网页内容，然后BeautifulSoup库解析HTML提取新闻信息，最后使用pymysql库将数据存储到MySQL数据库。以下是简单的步骤概述及部分代码示例： 1. **安装必要的库**: - `pip install requests` - `pip install beautifulsoup4` - `pip install pymysql` 2. **创建连接数据库的函数** (假设数据库名是'sohu_news'): ```python import pymysql def connect_mysql(): db = pymysql.connect(host='localhost', user='your_username', password='your_password', database='sohu_news', charset='utf8') return db ``` 3. **获取新闻数据并解析**: ```python import requests from bs4 import BeautifulSoup def get_sohu_news(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # 提取你需要的数据，例如标题、链接等 title_list = [title.text for title in soup.find_all('h2', class_='title')] link_list = [link['href'] for link in soup.find_all('a', href=True)] return title_list, link_list ``` 4. **插入数据到数据库**: ```python def save_to_mysql(title_list, link_list, db): cursor = db.cursor() # 创建SQL插入语句 insert_query = "INSERT INTO news_data (title, link) VALUES (%s, %s)" try: for i in range(len(title_list)): cursor.execute(insert_query, (title_list[i], link_list[i])) db.commit() # 每次添加一条数据后提交事务 except Exception as e: print(f"Error occurred: {e}") db.rollback() # 调用上述函数 db = connect_mysql() titles, links = get_sohu_news('http://news.sohu.com/') save_to_mysql(titles, links, db) ``` 5. **关闭数据库连接**: ```python def close_db(db): db.close() close_db(db) ``` 记得替换`your_username`和`your_password`为你实际的数据库用户名和密码。

阅读全文