MySQL去重优化实践：从理论到极致操作

124 浏览量更新于2024-08-31 收藏 123KB PDF 举报

"本文主要探讨了如何在MySQL中优化去重操作，针对存在大量重复数据的情况，提供了一种高效的方法。文章首先提出了一个具体的问题场景，即源表中有一百万条记录，其中五十万条记录的`created_time`和`item_name`是重复的，需要将去重后的数据插入到目标表。实验环境为CentOS 6.4系统，8GB内存，100GB硬盘，双CPU双核的MySQL 8.0.16数据库。" 在MySQL中进行数据去重通常涉及到性能优化，尤其是在处理大规模数据时。对于上述问题，可以采取以下策略来优化去重操作： 1. 使用DISTINCT关键字：最直观的方法是使用`SELECT DISTINCT`语句来获取不重复的数据，但这种方法在处理大量数据时效率较低，因为它需要对所有数据进行全表扫描。 2. 创建唯一索引：在`created_time`和`item_name`上创建复合唯一索引。这可以加速去重过程，因为数据库在插入时会自动过滤掉重复的组合。但是，创建索引会占用额外的存储空间，并可能影响写入速度。 3. 使用临时表：先将数据插入临时表，然后通过`GROUP BY`和`MIN/MAX`函数结合`INSERT INTO...SELECT`语句将去重后的数据插入目标表。这样可以利用分组函数来消除重复，但可能对内存消耗较高。 4. 窗口函数ROW_NUMBER()：MySQL 8.0引入了窗口函数，可以结合`ROW_NUMBER()`和`PARTITION BY`来标记每个重复组的第一个记录，从而实现去重。例如： ``` WITH cte AS ( SELECT *, ROW_NUMBER() OVER(PARTITION BY created_time, item_name ORDER BY item_id) as rn FROM t_source ) INSERT INTO t_target SELECT * FROM cte WHERE rn = 1; ``` 这种方法仅保留每个重复组的第一条记录，但可能会对计算资源需求较高。 5. 并行处理：如果硬件资源允许，可以考虑使用并行处理，将大表分割成小块，然后分别进行去重操作。这可以利用多核CPU的优势，提高处理速度。 6. 优化SQL执行计划：检查和调整查询的执行计划，确保数据库使用了合适的索引和优化器策略。在实际应用中，需要根据数据量、硬件资源和业务需求选择最适合的优化策略。在实验环境中，可以使用EXPLAIN命令分析查询计划，进一步优化SQL语句，以达到最佳性能。同时，监控数据库的性能指标，如CPU使用率、内存占用和磁盘I/O，以确保系统在处理大量数据时保持稳定。

将将MySQL去重操作优化到极致的操作方法去重操作优化到极致的操作方法

主要介绍了如何将MySQL去重操作优化到极致,本文给大家介绍的非常详细，具有一定的参考借鉴价值，需要的朋友可以参考下

•问题提出

源表t_source结构如下：

item_id int,

created_time datetime,

modified_time datetime,

item_name varchar(20),

other varchar(20)

要求：

1.源表中有100万条数据，其中有50万created_time和item_name重复。

2.要把去重后的50万数据写入到目标表。

3.重复created_time和item_name的多条数据，可以保留任意一条，不做规则限制。

•实验环境

Linux虚机：CentOS release 6.4；8G物理内存（MySQL配置4G）；100G机械硬盘；双物理CPU双核，共四个处理器；MySQL 8.0.16。

•建立测试表和数据

-- 建立源表

create table t_source

( item_id int,

created_time datetime,

modified_time datetime,

item_name varchar(20),

other varchar(20)

);

-- 建立目标表

create table t_target like t_source;

-- 生成100万测试数据，其中有50万created_time和item_name重复

delimiter //

create procedure sp_generate_data()

begin

set @i := 1;

while @i<=500000 do

set @created_time := date_add('2017-01-01',interval @i second);

set @modified_time := @created_time;

set @item_name := concat('a',@i);

insert into t_source

values (@i,@created_time,@modified_time,@item_name,'other');

set @i:=@i+1;

end while;

commit;

set @last_insert_id := 500000;

insert into t_source

select item_id + @last_insert_id,

created_time,

date_add(modified_time,interval @last_insert_id second),

item_name,

'other'

from t_source;

commit;

end

delimiter ;

call sp_generate_data();

-- 源表没有主键或唯一性约束，有可能存在两条完全一样的数据，所以再插入一条记录模拟这种情况。

insert into t_source select * from t_source where item_id=1;

源表中有1000001条记录，去重后的目标表应该有500000条记录。

mysql> select count(*),count(distinct created_time,item_name) from t_source;

+----------+----------------------------------------+

| count(*) | count(distinct created_time,item_name) |

+----------+----------------------------------------+

| 1000001 | 500000 |

+----------+----------------------------------------+

1 row in set (1.92 sec)

一、巧用索引与变量一、巧用索引与变量

1. 无索引对比测试无索引对比测试

（1）使用相关子查询

truncate t_target;

insert into t_target

select distinct t1.* from t_source t1 where item_id in

(select min(item_id) from t_source t2 where t1.created_time=t2.created_time and t1.item_name=t2.item_name);

这个语句很长时间都出不来结果，只看一下执行计划吧。

mysql> explain select distinct t1.* from t_source t1 where item_id in

-> (select min(item_id) from t_source t2 where t1.created_time=t2.created_time and t1.item_name=t2.item_name);

+----+--------------------+-------+------------+------+---------------+------+---------+------+--------+----------+------------------------------+

+----+--------------------+-------+------------+------+---------------+------+---------+------+--------+----------+------------------------------+

| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 100.00 | Using where; Using temporary |

| 2 | DEPENDENT SUBQUERY | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 1.00 | Using where |

+----+--------------------+-------+------------+------+---------------+------+---------+------+--------+----------+------------------------------+

2 rows in set, 3 warnings (0.00 sec)

主查询和相关子查询都是全表扫描，一共要扫描100万*100万数据行，难怪出不来结果。

（2）使用表连接

truncate t_target;

insert into t_target

select distinct t1.* from t_source t1,

(select min(item_id) item_id,created_time,item_name from t_source group by created_time,item_name) t2

where t1.item_id = t2.item_id;

这种方法用时14秒，查询计划如下：

mysql> explain select distinct t1.* from t_source t1, (select min(item_id) item_id,created_time,item_name from t_source group by created_time,item_name) t2 where t1.item_id = t2.item_id;

+----+-------------+------------+------------+------+---------------+-------------+---------+-----------------+--------+----------+------------------------------+

+----+-------------+------------+------------+------+---------------+-------------+---------+-----------------+--------+----------+------------------------------+

| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 100.00 | Using where; Using temporary |

| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 5 | test.t1.item_id | 10 | 100.00 | Distinct |

+----+-------------+------------+------------+------+---------------+-------------+---------+-----------------+--------+----------+------------------------------+

3 rows in set, 1 warning (0.00 sec)

•内层查询扫描t_source表的100万行，建立临时表，找出去重后的最小item_id，生成导出表derived2，此导出表有50万行。

•MySQL会在导出表derived2上自动创建一个item_id字段的索引auto_key0。

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38526751

粉丝: 3
资源: 937

MySQL去重优化实践：从理论到极致操作

MySQL去重的方法整理

MySQL去重案例分析：电商平台去重优化实践，实战经验分享

mysql去重查询方法

python mysql去重

5.8mysql去重

mysql 去重 关键字

mysql去重的两种方法详解及实例代码

MySQL去重查询性能优化：案例分析与解决方案，彻底解决性能瓶颈

MySQL去重查询分析：深入理解执行计划，优化查询效率

解决MySQL去重查询性能瓶颈：索引、查询优化全攻略

最新资源

mysql 去重关键字