Python pandas高效分析CDN日志：千万行数据40秒处理

156 浏览量更新于2024-08-30 收藏 79KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文将详细介绍如何利用Python中的pandas库对大规模的CDN日志进行高效分析。随着日志文件数量和规模的增长，传统的bash shell处理方式在效率上显得捉襟见肘，因此转向了Python的数据处理工具——pandas。pandas以其强大的数据结构DataFrame和内置的函数，使得处理数千万行日志只需40秒左右，极大地提高了数据分析的性能。首先，了解pandas库在数据处理中的核心优势。Pandas是基于NumPy的开源数据分析库，提供了大量的数据结构（如Series和DataFrame）以及各种数据操作工具，包括数据清洗、筛选、聚合和转换等，非常适合处理结构化数据，如CSV、Excel或JSON格式的日志文件。在开始分析之前，确保已安装pandas库。通过命令`sudo pip install pandas`来安装，如果尚未安装。本文档的作者Loya Chen编写了一个简单的脚本，该脚本的目的是解析并分析cdn日志，主要关注以下几点： 1. 日志格式：日志包含多个字段，如IP地址、响应时间、方法（GET或POST）、URL、HTTP版本、状态码、响应大小、Referer和User-Agent（UA）。这些信息对于流量统计、异常检测和用户行为分析至关重要。 2. 日志示例：脚本读取命令行参数指定的日志文件，如`file_of_log`，并解析每一行，提取出IP地址、状态码、URL等关键信息。例如，第一行日志表示IP地址为101.226.66.179，响应时间为68毫秒，HTTP请求为GET，目标URL为`http://www.qn.com/1.jpg`。 3. 数据预处理：使用pandas库的`pd.read_csv()`或`pd.read_json()`（取决于日志文件格式）函数将日志文件加载到DataFrame对象中。这样可以一次性处理大量数据，并且DataFrame提供了方便的索引和列名，便于后续操作。 4. 数据分析：针对具体需求，例如流量统计，可以使用pandas的`groupby()`函数按IP、URL、状态码等字段分组，然后计算每个组的计数或总和。对于TOP IP、URL和UA，可以通过`value_counts()`获取最常见的值。状态码统计则可以通过`value_counts().sort_values(ascending=False)`进行排序并显示最频繁的状态码。 5. 时间戳处理：如果日志中的时间戳是以字符串形式存储的，可能需要将其转换为datetime对象以便进一步的时间序列分析。这可以通过`pd.to_datetime()`函数完成。 6. 性能优化：由于pandas在内存中操作数据，对于大文件，可能需要考虑分块读取和处理，以避免一次性加载整个文件导致内存溢出。`pd.read_csv(chunksize=n)`函数可用于分块读取。利用Python的pandas库对cdn日志进行分析能够显著提升效率，适用于处理大规模数据。通过灵活的数据结构和丰富的函数集，我们可以轻松地完成复杂的日志分析任务，同时保证了代码的简洁性和可维护性。

资源详情

资源推荐

利用利用Python中的中的pandas库对库对cdn日志进行分析详解日志进行分析详解

前言前言

最近工作工作中遇到一个需求，是要根据CDN日志过滤一些数据，例如流量、状态码统计，TOP IP、URL、UA、Referer

等。以前都是用 bash shell 实现的，但是当日志量较大，日志文件数G、行数达数千万亿级时，通过 shell 处理有些力不从

心，处理时间过长。于是研究了下Python pandas这个数据处理库的使用。一千万行日志，处理完成在40s左右。

代码代码

#!/usr/bin/python

# -*- coding: utf-8 -*-

# sudo pip install pandas

__author__ = 'Loya Chen'

import sys

import pandas as pd

from collections import OrderedDict

"""

Description: This script is used to analyse qiniu cdn log.

================================================================================

日志格式

IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA"

================================================================================

日志示例

[0] [1][2] [3] [4] [5] 101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.qn.com/1.jpg -"

[6] [7] [8] [9] 200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

================================================================================

"""

if len(sys.argv) != 2:

print('Usage:', sys.argv[0], 'file_of_log')

exit()

else:

log_file = sys.argv[1] # 需统计字段对应的日志位置

ip = 0

url = 5

status_code = 6

size = 7

referer = 8

ua = 9

# 将日志读入DataFrame

reader = pd.read_table(log_file, sep=' ', names=[i for i in range(10)], iterator=True)

loop = True

chunkSize = 10000000

chunks = [] while loop:

try:

chunk = reader.get_chunk(chunkSize)

chunks.append(chunk)

except StopIteration:

#Iteration is stopped.

loop = False

df = pd.concat(chunks, ignore_index=True)

byte_sum = df[size].sum() #流量统计

top_status_code = pd.DataFrame(df[6].value_counts()) #状态码统计

top_ip = df[ip].value_counts().head(10) #TOP IP

top_referer = df[referer].value_counts().head(10) #TOP Referer

top_ua = df[ua].value_counts().head(10) #TOP User-Agent

top_status_code['persent'] = pd.DataFrame(top_status_code/top_status_code.sum()*100)

top_url = df[url].value_counts().head(10) #TOP URL

top_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:x.astype(float)/1024/1024) \

.round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #请求流量最大的URL

top_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:x.astype(float)/1024/1024) \

.round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) #请求流量最多的IP

# 将结果有序存入字典

result = OrderedDict([("流量总计[单位:GB]:" , byte_sum/1024/1024/1024),

("状态码统计[次数|百分比]:" , top_status_code),

("IP TOP 10:" , top_ip),

("Referer TOP 10:" , top_referer),

("UA TOP 10:" , top_ua),

("URL TOP 10:" , top_url),

("请求流量最大的URL TOP 10[单位:MB]:" , top_url_byte),

("请求流量最大的IP TOP 10[单位:MB]:" , top_ip_byte)

])

# 输出结果

for k,v in result.items():

print(k)

print(v)

print('='*80)

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38545959

粉丝: 1
资源: 928

Python pandas高效分析CDN日志：千万行数据40秒处理

Python-获取腾讯云cdn日志并入日志分析

Python3.7Pandas离线包

python中pandas库下载

如何使用调用python的pandas库

python中pandas库读取excel

python中pandas库的安装

python中pandas库透视图代码

python3.7 pandas库下载

python中pandas库读取excel文件

python的pandas库

Python的pandas库

如何使用python的pandas库对txt文件的数据进行统计

python中pandas库的作用

python的pandas库是干什么的

python 下载pandas库

python的pandas库怎么使用

python的pandas 库

python numpy pandas题库

如何使用Python中的Pandas库进行数据处理和分析？

python 安装pandas库

最新资源