将将MySQL去重操作优化到极致的操作方法去重操作优化到极致的操作方法
主要介绍了如何将MySQL去重操作优化到极致,本文给大家介绍的非常详细,具有一定的参考借鉴价值,需要的朋友可以参考下
•问题提出
源表t_source结构如下:
item_id int,
created_time datetime,
modified_time datetime,
item_name varchar(20),
other varchar(20)
要求:
1.源表中有100万条数据,其中有50万created_time和item_name重复。
2.要把去重后的50万数据写入到目标表。
3.重复created_time和item_name的多条数据,可以保留任意一条,不做规则限制。
•实验环境
Linux虚机:CentOS release 6.4;8G物理内存(MySQL配置4G);100G机械硬盘;双物理CPU双核,共四个处理器;MySQL 8.0.16。
•建立测试表和数据
-- 建立源表
create table t_source
( item_id int,
created_time datetime,
modified_time datetime,
item_name varchar(20),
other varchar(20)
);
-- 建立目标表
create table t_target like t_source;
-- 生成100万测试数据,其中有50万created_time和item_name重复
delimiter //
create procedure sp_generate_data()
begin
set @i := 1;
while @i<=500000 do
set @created_time := date_add('2017-01-01',interval @i second);
set @modified_time := @created_time;
set @item_name := concat('a',@i);
insert into t_source
values (@i,@created_time,@modified_time,@item_name,'other');
set @i:=@i+1;
end while;
commit;
set @last_insert_id := 500000;
insert into t_source
select item_id + @last_insert_id,
created_time,
date_add(modified_time,interval @last_insert_id second),
item_name,
'other'
from t_source;
commit;
end
//
delimiter ;
call sp_generate_data();
-- 源表没有主键或唯一性约束,有可能存在两条完全一样的数据,所以再插入一条记录模拟这种情况。
insert into t_source select * from t_source where item_id=1;
源表中有1000001条记录,去重后的目标表应该有500000条记录。
mysql> select count(*),count(distinct created_time,item_name) from t_source;
+----------+----------------------------------------+
| count(*) | count(distinct created_time,item_name) |
+----------+----------------------------------------+
| 1000001 | 500000 |
+----------+----------------------------------------+
1 row in set (1.92 sec)
一、巧用索引与变量一、巧用索引与变量
1. 无索引对比测试无索引对比测试
(1)使用相关子查询
truncate t_target;
insert into t_target
select distinct t1.* from t_source t1 where item_id in
(select min(item_id) from t_source t2 where t1.created_time=t2.created_time and t1.item_name=t2.item_name);
这个语句很长时间都出不来结果,只看一下执行计划吧。
mysql> explain select distinct t1.* from t_source t1 where item_id in
-> (select min(item_id) from t_source t2 where t1.created_time=t2.created_time and t1.item_name=t2.item_name);
+----+--------------------+-------+------------+------+---------------+------+---------+------+--------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+------+---------------+------+---------+------+--------+----------+------------------------------+
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 100.00 | Using where; Using temporary |
| 2 | DEPENDENT SUBQUERY | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 1.00 | Using where |
+----+--------------------+-------+------------+------+---------------+------+---------+------+--------+----------+------------------------------+
2 rows in set, 3 warnings (0.00 sec)
主查询和相关子查询都是全表扫描,一共要扫描100万*100万数据行,难怪出不来结果。
(2)使用表连接
truncate t_target;
insert into t_target
select distinct t1.* from t_source t1,
(select min(item_id) item_id,created_time,item_name from t_source group by created_time,item_name) t2
where t1.item_id = t2.item_id;
这种方法用时14秒,查询计划如下:
mysql> explain select distinct t1.* from t_source t1, (select min(item_id) item_id,created_time,item_name from t_source group by created_time,item_name) t2 where t1.item_id = t2.item_id;
+----+-------------+------------+------------+------+---------------+-------------+---------+-----------------+--------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+------+---------------+-------------+---------+-----------------+--------+----------+------------------------------+
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 100.00 | Using where; Using temporary |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 5 | test.t1.item_id | 10 | 100.00 | Distinct |
| 2 | DERIVED | t_source | NULL | ALL | NULL | NULL | NULL | NULL | 997282 | 100.00 | Using temporary |
+----+-------------+------------+------------+------+---------------+-------------+---------+-----------------+--------+----------+------------------------------+
3 rows in set, 1 warning (0.00 sec)
•内层查询扫描t_source表的100万行,建立临时表,找出去重后的最小item_id,生成导出表derived2,此导出表有50万行。
•MySQL会在导出表derived2上自动创建一个item_id字段的索引auto_key0。