Python实现多进程导入CSV数据到MySQL - CSDN文库

mysql创建数据库

168 浏览量更新于2023-05-04 评论 3 收藏 55KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Python实现实现多进程导入多进程导入CSV数据到数据到 MySQL

前段时间帮同事处理了一个把 CSV 数据导入到 MySQL 的需求。两个很大的 CSV 文件，分别有 3GB、2100 万条记录和

7GB、3500 万条记录。对于这个量级的数据，用简单的单进程／单线程导入会耗时很久，最终用了多进程的方式来实现。具

体过程不赘述，记录一下几个要点：

批量插入而不是逐条插入

为了加快插入速度，先不要建索引

生产者和消费者模型，主进程读文件，多个 worker 进程执行插入

注意控制 worker 的数量，避免对 MySQL 造成太大的压力

注意处理脏数据导致的异常

原始数据是 GBK 编码，所以还要注意转换成 UTF-8

用 click 封装命令行工具

具体的代码实现如下：

#!/usr/bin/env python

# -*- coding: utf-8 -*-

import codecs

import csv

import logging

import multiprocessing

import os

import warnings

import click

import MySQLdb

import sqlalchemy

warnings.filterwarnings('ignore', category=MySQLdb.Warning)

# 批量插入的记录数量

BATCH = 5000

DB_URI = 'mysql://root@localhost:3306/example?charset=utf8'

engine = sqlalchemy.create_engine(DB_URI)

def get_table_cols(table):

sql = 'SELECT * FROM `{table}` LIMIT 0'.format(table=table)

res = engine.execute(sql)

return res.keys()

def insert_many(table, cols, rows, cursor):

sql = 'INSERT INTO `{table}` ({cols}) VALUES ({marks})'.format(

table=table,

cols=', '.join(cols),

marks=', '.join(['%s'] * len(cols)))

cursor.execute(sql, *rows)

logging.info('process %s inserted %s rows into table %s', os.getpid(), len(rows), table)

def insert_worker(table, cols, queue):

rows = [] # 每个子进程创建自己的 engine 对象

cursor = sqlalchemy.create_engine(DB_URI)

while True:

row = queue.get()

if row is None:

if rows:

insert_many(table, cols, rows, cursor)

break

rows.append(row)

if len(rows) == BATCH:

insert_many(table, cols, rows, cursor)

rows = []

def insert_parallel(table, reader, w=10):

cols = get_table_cols(table)

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余2页未读，立即下载

评论0

weixin_38525735

粉丝: 3
资源: 881

会员权益专享

图片转文字

全年可省5，000元立即开通

最新资源

资源上传下载、课程学习等过程中有任何疑问或建议，欢迎提出宝贵意见哦~我们会及时处理！点击此处反馈