import pandas as pd inputfile1 = 'data/GoodsOrder.csv' inputfile2 = 'data/GoodsTypes.csv' # 读入数据 data = pd.read_csv(inputfile1,encoding = 'gbk') types = pd.read_csv(inputfile2,encoding = 'gbk') group = data.groupby(['Goods']).count().reset_index() group_sorted = group.sort_values('id',ascending = False).reset_index() # 总量 data_nums = data.shape[0] del group_sorted['index'] group_sorted.head()标记注释

该段代码的作用是：读取名为"GoodsOrder.csv"和"GoodsTypes.csv"的数据文件，并对商品销量进行统计和排名。其中，使用了pandas库的read_csv函数读取数据文件，使用groupby函数对商品进行分类汇总，使用count函数统计每个商品的销量，使用sort_values函数对结果按照销量进行排序，使用reset_index函数重新设置索引，并计算了数据的总记录数。最后，使用del函数删除了结果中的index列，并使用head函数展示了结果的前5行。

import pandas as pd import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.cluster import KMeans import os def dbscan(input_file): ## 纬度在前，经度在后 [latitude, longitude] columns = ['lat', 'lon'] in_df = pd.read_csv(input_file, sep=',', header=None, names=columns) # represent GPS points as (lat, lon) coords = in_df.as_matrix(columns=['lat', 'lon']) # earth's radius in km kms_per_radian = 6371.0086 # define epsilon as 0.5 kilometers, converted to radians for use by haversine # This uses the 'haversine' formula to calculate the great-circle distance between two points # that is, the shortest distance over the earth's surface # http://www.movable-type.co.uk/scripts/latlong.html epsilon = 0.5 / kms_per_radian # radians() Convert angles from degrees to radians db = DBSCAN(eps=epsilon, min_samples=15, algorithm='ball_tree', metric='haversine').fit(np.radians(coords)) cluster_labels = db.labels_ # get the number of clusters (ignore noisy samples which are given the label -1) num_clusters = len(set(cluster_labels) - set([-1])) print('Clustered ' + str(len(in_df)) + ' points to ' + str(num_clusters) + ' clusters') # turn the clusters in to a pandas series # clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)]) # print(clusters) kmeans = KMeans(n_clusters=1, n_init=1, max_iter=20, random_state=20) for n in range(num_clusters): # print('Cluster ', n, ' all samples:') one_cluster = coords[cluster_labels == n] # print(one_cluster[:1]) # clist = one_cluster.tolist() # print(clist[0]) kk = kmeans.fit(one_cluster) print(kk.cluster_centers_) def main(): path = './datas' filelist = os.listdir(path) for f in filelist: datafile = os.path.join(path, f) print(datafile) dbscan(datafile) if name == 'main': main()

这是一个 Python 代码，主要使用了 Pandas、NumPy、sklearn.cluster 等库实现了 DBSCAN 和 KMeans 聚类算法。代码读入了一个文件夹中的多个文件，每个文件都是 GPS 坐标点的经纬度信息，然后使用 DBSCAN 算法进行聚类，并且输出了聚类的个数。接着使用 KMeans 算法对每个聚类进行细分，最后输出每个聚类的中心点坐标。

Read the planetary data in the text file S12_planet.txt into a Pandas DataFrame. The file will be given in wechat group. Task1, calculate the density of each planet relative to that of the Earth (i.e., the Earth’s density is 1) and add the results as a “density” column in the DataFrame. Print out the DataFrame sorted from the largest to smallest based on the planet diameter. The density formula is � � � � � � � = � / � = 3 � / ( 4 � ∗ � 3 ) density=M/V=3M/(4π∗r 3 ) Task2, read the 2 items in input as "A" and "Standard". A is a "String" and Standard is a "float". Create a Pandas Series that have "A" greater than "Standard", sorted from most to least based on the planet's "A". For example, if input is "year 500", then collect the data of planets whose years are greater than 500. Input A string and a float. Output Task1: DataFrame sorted on diameter Task2: Panda Series sorted on A. Note: You dont need to deal with rounding format. Just let the output as default like that in example.

Task1: ```python import pandas as pd import numpy as np # Read the planetary data into a DataFrame df = pd.read_csv('S12_planet.txt', delimiter='\t') # Calculate the density of each planet relative to that of the Earth earth_density = 1 df['density'] = (3 * df['mass']) / (4 * np.pi * (df['diameter'] ** 3)) / earth_density # Sort the DataFrame based on planet diameter sorted_df = df.sort_values(by='diameter') # Print the sorted DataFrame print(sorted_df) ``` Task2: ```python import pandas as pd # Read the input values A = input('Enter string A: ') Standard = float(input('Enter float Standard: ')) # Create a Pandas Series with planets whose A is greater than Standard filtered_series = df.loc[df['A'] > Standard, 'A'].sort_values(ascending=False) # Print the filtered Series print(filtered_series) ``` 注意：以上代码中的 `df` 表示在 Task1 中读入的 DataFrame。如果需要在 Task2 中使用，需要先在代码中进行读取。

阅读全文

相关推荐

pandas将numpy数组写入到csv的实例

data_wuliu.csv 数据分析用 python pandas

Processing-Data-Using-Python：我使用python脚本从文本文件中读取数据，处理数据并将其以可用格式写入到excel文件中

python 自动办公- Python分块拆分txt文件中的数据.zip

【数据处理】：pandas结合fileinput实现高效数据流处理的秘诀

Excel到Pandas：数据处理工作流无缝迁移技巧

Python CSV模块深入探索：文件头管理与数据筛选技巧

Python CSV特殊字符处理：避免常见错误的策略

性能优化与流式处理：Python CSV模块的高级技巧

【编程语言在CSV转换中的应用】：以Python为例

建立一个单机版有GUI界面的的信息管理系统，功能上要求实现:（1）能从键盘，文件中读入数据（2）能写入数据到文件，文件格式可以是execl，csv, txt等

python读入一个csv文件然后输出一个csv统计输入的csv文件第一列显示id名字，第二列显示重复次数

tensorflow从csv读取数据，输入32位参数，输出2位参数代码

python将.arff文件转化为csv

python代码设计1个窗体及4个以上控件，在控件中输入文件名，从指定文件中读入一批数据，进行数据处理与分析，实现数据的散点图，柱形图，折线图的绘制，词频统计

python合并多个csv 文件

读取csv文件时让行列互换

大家在看

silvaco中文学习资料

AES128（CBC或者ECB）源码

EMC VNX 5300使用安装

华为MA5671光猫使用 华为MA5671补全shell 101版本可以补全shell，安装后自动补全，亲测好用，需要的可以下载

视频转换芯片 TP9950 iic 驱动代码

最新推荐

智慧园区3D可视化解决方案PPT(24页).pptx

掌握Android RecyclerView拖拽与滑动删除功能

【IBM HttpServer入门全攻略】：一步到位的安装与基础配置教程

[root@localhost~]#mount-tcifs-0username=administrator,password=hrb.123456//192.168.100.1/ygptData/home/win mount：/home/win：挂载点不存在

惠普8594E与IT8500系列电子负载使用教程

MATLAB与Python在SAR点目标仿真中的对决：哪种工具更胜一筹？

前端代理配置config.js配置proxyTable多个代理不生效

最小二乘法程序深入解析与应用案例

SAR点目标仿真应用指南：案例研究与系统设计实战

eclipse为项目配置jdk

华为MA5671光猫使用华为MA5671补全shell 101版本可以补全shell，安装后自动补全，亲测好用，需要的可以下载