读取"ramen-rating.csv"文件,并进行数据预处理; 统计各个国家拉面品牌数量,并绘制排名前10位国家的直方图; 找出各个国家最受欢迎的拉面品牌和包装类型; 统计各个国家各个品牌的stars平均值; 给出一些交叉列联表,行列自己选。
时间: 2024-06-11 08:06:36 浏览: 116
1. 数据预处理:
```python
import pandas as pd
# 读取数据
df = pd.read_csv("ramen-rating.csv")
# 删除无用列
df.drop(['Review #', 'Top Ten'], axis=1, inplace=True)
# 处理包装列
df.loc[df['Style'].str.contains('cup', case=False), 'Packaging'] = 'Cup'
df.loc[df['Style'].str.contains('bowl', case=False), 'Packaging'] = 'Bowl'
df.loc[df['Style'].str.contains('box', case=False), 'Packaging'] = 'Box'
df.loc[df['Style'].str.contains('tray', case=False), 'Packaging'] = 'Tray'
# 处理品牌列
df.loc[df['Brand'].str.contains('nissin', case=False), 'Brand'] = 'Nissin'
df.loc[df['Brand'].str.contains('maruchan', case=False), 'Brand'] = 'Maruchan'
df.loc[df['Brand'].str.contains('samyang', case=False), 'Brand'] = 'Samyang'
df.loc[df['Brand'].str.contains('sapporo', case=False), 'Brand'] = 'Sapporo Ichiban'
# 处理国家列
df.loc[df['Country'].str.contains('usa', case=False), 'Country'] = 'United States'
df.loc[df['Country'].str.contains('south korea', case=False), 'Country'] = 'South Korea'
df.loc[df['Country'].str.contains('hong kong', case=False), 'Country'] = 'Hong Kong'
df.loc[df['Country'].str.contains('taiwan', case=False), 'Country'] = 'Taiwan'
df.loc[df['Country'].str.contains('singapore', case=False), 'Country'] = 'Singapore'
df.loc[df['Country'].str.contains('japan', case=False), 'Country'] = 'Japan'
df.loc[df['Country'].str.contains('thailand', case=False), 'Country'] = 'Thailand'
df.loc[df['Country'].str.contains('china', case=False), 'Country'] = 'China'
df.loc[df['Country'].str.contains('malaysia', case=False), 'Country'] = 'Malaysia'
df.loc[df['Country'].str.contains('indonesia', case=False), 'Country'] = 'Indonesia'
# 处理stars列
df.loc[df['Stars'] == 'Unrated', 'Stars'] = '0'
df['Stars'] = df['Stars'].astype(float)
# 保存清洗过的数据
df.to_csv("ramen_cleaned.csv", index=False)
```
2. 统计各个国家拉面品牌数量,并绘制排名前10位国家的直方图:
```python
import matplotlib.pyplot as plt
# 统计各个国家的品牌数量
country_counts = df.groupby('Country')['Brand'].nunique().sort_values(ascending=False)
# 绘制前10位国家的直方图
plt.bar(country_counts.index[:10], country_counts[:10])
plt.xticks(rotation=45)
plt.xlabel('Country')
plt.ylabel('Number of brands')
plt.title('Top 10 countries with most ramen brands')
plt.show()
```
3. 找出各个国家最受欢迎的拉面品牌和包装类型:
```python
# 找出各个国家最受欢迎的品牌和包装类型
popular_brand = df.groupby('Country')['Brand'].apply(lambda x: x.value_counts().index[0])
popular_packaging = df.groupby('Country')['Packaging'].apply(lambda x: x.value_counts().index[0])
# 输出结果
print("Most popular brand by country:\n", popular_brand)
print("\nMost popular packaging by country:\n", popular_packaging)
```
4. 统计各个国家各个品牌的stars平均值:
```python
# 统计各个国家各个品牌的stars平均值
country_brand_stars = df.groupby(['Country', 'Brand'])['Stars'].mean()
# 输出结果
print(country_brand_stars)
```
5. 交叉列联表:
```python
# 统计不同包装类型的品牌数量
packaging_brand_count = pd.crosstab(df['Packaging'], df['Brand'])
print("Packaging vs Brand:\n", packaging_brand_count)
# 统计不同国家的包装类型数量
country_packaging_count = pd.crosstab(df['Country'], df['Packaging'])
print("\nCountry vs Packaging:\n", country_packaging_count)
# 统计不同星级的品牌数量
stars_brand_count = pd.crosstab(df['Stars'], df['Brand'])
print("\nStars vs Brand:\n", stars_brand_count)
```
阅读全文