python编写数据分析代码
时间: 2023-11-12 22:55:39 浏览: 107
23 - python生鲜电商平台数据分析
Python是一种非常流行的编程语言,特别适合用于数据分析。以下是一个使用Python编写数据分析代码的示例:
1. 数据采集和清洗
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
# 从网站上获取数据
url = 'https://www.example.com'
response = requests.get(url)
# 使用BeautifulSoup库解析数据
soup = BeautifulSoup(response.text, 'html.parser')
data = []
for item in soup.find_all('div', {'class': 'content'}):
title = item.find('h2').text
content = item.find('p').text
data.append({'title': title, 'content': content})
# 将数据转换为DataFrame格式,并进行清洗
df = pd.DataFrame(data)
df = df.drop_duplicates()
df = df.dropna()
```
2. 数据转换和建模
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# 将数据划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['title'], test_size=0.2)
# 将文本数据转换为数值特征
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
# 使用逻辑回归模型进行训练和预测
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# 计算模型的准确率
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
```
3. 数据可视化
```python
import matplotlib.pyplot as plt
# 统计数据分布情况,并可视化
plt.hist(df['content'].apply(len), bins=50)
plt.title('Length of Content')
plt.xlabel('Length')
plt.ylabel('Count')
plt.show()
```
需要注意的是,以上代码仅为示例,具体实现过程需要根据具体的数据和分析任务进行调整和优化。
阅读全文