如何优化数百万行的计数和按查询排序

rks48beu  于 2021-06-23  发布在  Mysql
关注(0)|答案(4)|浏览(425)

需要帮助优化order by和count查询,我有数百万(约300万)行的表。
我必须连接4个表并获取记录,当我运行简单的查询时,它只需要毫秒就可以完成,但是当我试图通过离开连接表来计数或排序时,它会被无限时间卡住。
请看下面的案例。

数据库服务器配置:

CPU Number of virtual cores: 4
Memory(RAM): 16 GiB
Network Performance: High

每个表中的行:

tbl_customers -  #Rows: 20 million.
tbl_customers_address -  #Row 25 million.
tbl_shop_setting - #Rows 50k
aio_customer_tracking - #Rows 5k

表架构:

CREATE TABLE `tbl_customers` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(225) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `accepts_marketing` TINYINT(1) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_order_id` BIGINT(20) NULL DEFAULT NULL,
    `total_spent` DECIMAL(12,2) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `verified_email` TINYINT(4) NULL DEFAULT NULL,
    `updated_at` DATETIME NULL DEFAULT NULL,
    `created_at` DATETIME NULL DEFAULT NULL,
    `date_updated` DATETIME NULL DEFAULT NULL,
    `date_created` DATETIME NULL DEFAULT NULL,
    PRIMARY KEY (`id`),
    UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `shop_id` (`shop_id`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;

CREATE TABLE `tbl_customers_address` (
    `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `customer_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_address_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_customer_id` BIGINT(20) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL,
    `last_name` VARCHAR(50) NULL DEFAULT NULL,
    `company` VARCHAR(50) NULL DEFAULT NULL,
    `address1` VARCHAR(250) NULL DEFAULT NULL,
    `address2` VARCHAR(250) NULL DEFAULT NULL,
    `city` VARCHAR(50) NULL DEFAULT NULL,
    `province` VARCHAR(50) NULL DEFAULT NULL,
    `country` VARCHAR(50) NULL DEFAULT NULL,
    `zip` VARCHAR(15) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL,
    `name` VARCHAR(50) NULL DEFAULT NULL,
    `province_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_name` VARCHAR(50) NULL DEFAULT NULL,
    `longitude` VARCHAR(250) NULL DEFAULT NULL,
    `latitude` VARCHAR(250) NULL DEFAULT NULL,
    `default` TINYINT(1) NULL DEFAULT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `customer_id` (`customer_id`),
    INDEX `shopify_address_id` (`shopify_address_id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

CREATE TABLE `tbl_shop_setting` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,   
    `shop_name` VARCHAR(300) NOT NULL COLLATE 'latin1_swedish_ci',
     PRIMARY KEY (`id`),
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;

CREATE TABLE `aio_customer_tracking` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(255) NULL DEFAULT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `domain` VARCHAR(255) NULL DEFAULT NULL,
    `web_session_count` INT(11) NOT NULL,
    `last_seen_date` DATETIME NULL DEFAULT NULL,
    `last_contact_date` DATETIME NULL DEFAULT NULL,
    `last_email_open` DATETIME NULL DEFAULT NULL,
    `created_date` DATETIME NOT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id_shop_id` (`shopify_customer_id`, `shop_id`),
    INDEX `last_seen_date` (`last_seen_date`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

正在运行和未运行的查询案例:

1. Running:  Below query fetch the records by joining all the 4 tables, It takes only 0.300 ms.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20

2. Not running: Simply when try to get the count of these row stuk the query, I waited 10 min but still running.

SELECT 
     COUNT(DISTINCT c.shopify_customer_id)   -- what makes #2 different
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20

3. Not running: In the #1 query we simply put the 1 Order by clause and it get stuck, I waited 10 min but still running. I study query optimization some article and tried by indexing, Right Join etc.. but still not working.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
  ORDER BY `t`.`last_seen_date`    -- what makes #3 different
LIMIT 20

解释问题1:

解释问题2:

解释问题3:

欢迎提出优化查询、表结构的建议。

我想做的是: tbl_customers 表包含客户信息, tbl_customer_address 表包含客户的地址(一个客户可能有多个地址),以及 aio_customer_tracking 表包含客户的访问记录 last_seen_date 是访问日期。

现在,我只想获取和统计客户,以及他们的地址和访问信息。另外,我可以按这3个表中的任何列排序,在我的示例中,我是按上次看到的日期排序(默认顺序)。希望这个解释有助于理解我在做什么。

z9ju0rcb

z9ju0rcb1#

shopify_customer_id 是独一无二的 tbl_customers 表,然后在第二个查询中,为什么在中使用distinct和group by shopify_customer_id 列?
请把它扔掉。

jslywgbw

jslywgbw2#

在查询#1(而不是其他两个)中,优化器可以使用

UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`)

缩短查询时间

GROUP BY c.shopify_customer_id
LIMIT 20

这是因为它可以在索引的20项之后停止。由于派生表(子查询)的原因,查询速度不是很快 t )大约有51k行。
查询#2可能很慢,因为优化器没有注意到并删除多余的 DISTINCT . 相反,它可能认为20岁以后就停不下来了。
查询#3必须完全遍历表 c 得到每一个 shopify_customer_id 组。这是因为 ORDER BY 防止短路到达 LIMIT 20 .
a中的列 GROUP BY 必须包含 SELECT 除了由group by列唯一定义的任何列。既然你说过一个地址可以有多个 shopify_customer_id ,然后获取 ca.address1 不适合与 GROUP BY shopify_customer_id . 类似地,子查询对于 last_seen_date, last_contact_date .
aio_customer_tracking ,这种变化(对“覆盖”索引)可能会有一些帮助:

INDEX (`shopify_customer_id`)

INDEX (`shopify_customer_id`, `last_seen_date`, `last_contact_date`)

剖析目标
现在,我只想。。。数一数顾客
要计算客户数量,请这样做,但不要试图将其与“获取”结合起来:

SELECT COUNT(*) FROM tbl_customers;

现在,我只想把。。。顾客们。。。
tbl#客户-#行:2000万。
你肯定不想拍到两千万排!我不想去想该怎么做。请澄清。我不接受这么多行的分页。也许有一个 WHERE 条款??这个 WHERE 子句通常是优化最重要的部分!
现在,我只想把。。。客户,其中一个地址和访问信息。
假设 WHERE 过滤到“少数”客户,然后 JOINing 转到另一个表以获取“any”地址和“any”访问信息,可能有问题和/或效率低下。要求“第一个”或“最后一个”而不是“任何”不会更容易,但可能更有意义。
我建议你的用户界面先找几个客户,然后如果用户愿意,转到另一个包含所有地址和访问量的页面。或者访问量能达到数百次或更多?
另外,我可以按这3个表中的任何列排序,在我的示例中,我是按上次看到的日期排序(默认顺序)。
让我们集中精力优化 WHERE ,然后钉住 last_seen_date 在任何索引的末尾。

rkue9o1l

rkue9o1l3#

你有太多的索引,这可能是一个真正的性能杀手当涉及到插入,更新和删除,以及偶尔选择取决于优化设置。
同时,拆下 GROUP BY 声明。
关于正确使用聚集索引和非聚集索引,我还有更多要说的, GROUP BY , ORDER BY , WHERE ,和视图,以进行查询优化。但是,我认为如果删除一些索引,查询速度会大大加快(也可以修改查询以遵循更严格的sql标准,使其更符合逻辑,但这超出了本问题的范围。)
还有一件事-你对查询结果做了什么?它是否被存储在某个地方并被访问以进行查找、用于计算、用于自动报告、通过web数据库连接显示等?这是有区别的,因为如果您只需要一个报表/备份或导出到一个平面文件,那么就有更有效的方法来获取这些数据。很多不同的选择取决于你在做什么。

cgvd09ve

cgvd09ve4#

查询2包含其他人指出的逻辑错误: count(distinct(c.shopify_customer_id)) 将返回单个值,因此您的groupby只会使查询复杂化(这可能确实会使mysql首先通过shopify\u customer\u id进行分组,然后执行 count(distinct(shopify_customer_id )) 这可能是执行时间长的原因
查询3的order by无法优化,因为您正在加入无法索引的子选择。它所花费的时间只是系统排序结果集所需的时间。
解决问题的方法是:
更改索引 shopify_customer_id ( shopify_customer_id )表tbl\u customers\u address to shopify_customer_id ( shopify_customer_id , default )优化以下查询
使用查询1的结果(result)创建一个表,但不使用 LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id .
更改结果表,为last\u seen\u date添加一列,并为last\u seen\u date和shopify\u customer\u id添加索引
为此查询的结果创建表(最后一天): SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id 使用表last\u date中的值更新结果表
现在,您可以使用创建的索引对按最后一天排序的结果表运行查询。
整个过程所用的时间应该比执行查询2或查询3少

相关问题