什么是直方图,在MySQL 8.0.3如何创建直方图?
MySQL的直方图是如何影响执行计划生成的?
创建直方图有哪些注意事项?
直方图和索引对优化器的选择上有什么差异,又该如何选择?
如何判断直方图对执行计划的影响?
MySQL官方blog的这篇文章用非常具体的示例回答了这一系列问题,let\'s go。
原文地址为https://dev.mysql.com/blog-archive/histogram-statistics-in-mysql/,以下为译文:
从MySQL 8.0.3开始,您可以创建直方图统计信息,以便向优化器提供更多的统计信息。在这篇博文中,我们将看看如何创建直方图统计数据,并解释何时使用直方图统计数据可能有用。
什么是直方图
查询优化器是数据库中负责将SQL查询转换为尽可能高效的执行计划的部分。有时,查询优化器无法找到最有效的计划,并最终花费比所实际所需更多的时间来执行查询。出现这种情况的主要原因通常是优化器对它要查询的数据分布没有足够的了解:
- 每个表中有多少行?
- 每一列有多少不同的值?
- 数据如何分布在每一列中?
一个查询优化器缺少重要信息的示例,人们每天睡觉时间的数据表:
CREATE TABLE bedtime ( person_id INT, time_of_day TIME);
对于“time_of_day”这个字段,大多数值很可能是在11:00PM左右,因为大多数人是在这个时间段睡觉的。所以下面第一个查询返回的数据行数要比第二个查询返回的数据要多。
1) SELECT * FROM bedtime WHERE time_of_day BETWEEN \"22:00:00\" AND \"23:59:00\" 2) SELECT * FROM bedtime WHERE time_of_day BETWEEN \"12:00:00\" AND \"14:00:00\"
在没有任何统计数据可用的情况下,优化器将假设“time_of_day”中的值是均匀分布的(即,一个人在下午3点左右睡觉的可能性与晚上11点左右睡觉的可能性相同)。如何使查询优化器意识到数据中的这种偏斜度?对此的一个解决方案是为该列创建直方图统计信息。
直方图是一列数据分布的近似值。它可以相当准确地告诉您,您的数据是否有偏差,这反过来将帮助数据库服务器理解它所包含的数据的性质。直方图有很多不同的风格,在MySQL中我们选择支持两种不同的类型:“单例(等宽)”直方图和“等高”直方图。所有直方图类型的共同点是,它们将数据集分割为一组“桶”,MySQL自动将值划分为桶,并自动决定创建什么类型的直方图。
如何创建和删除直方图统计
为了管理直方图统计数据,我们扩展了ANALYZE TABLE,增加了两个新的子句:
ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS; ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name];
第一种语法允许你同时为一个或多个列创建直方图统计数据:
mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount WITH 32 BUCKETS; +----------------+-----------+----------+---------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics created for column \'amount\'. | +----------------+-----------+----------+---------------------------------------------------+ 1 row in set (0.27 sec) mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount, payment_date WITH 32 BUCKETS; +----------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics created for column \'amount\'. | | sakila.payment | histogram | status | Histogram statistics created for column \'payment_date\'. | +----------------+-----------+----------+---------------------------------------------------------+
请注意,必须指定桶的数量,并且可以在 1 到 1024 的范围内(默认为100)。您应该为数据集选择多少个桶取决于几个因素;您有多少个不同的值,您的数据集有多大偏差,您需要多高的准确性等。
但是,在一定数量的桶之后,(再继续加大桶的数据量)对准确性的提高效果相当低。所以我们建议从较低的数字开始,例如 32,如果您发现它不符合您的需求,则增加它。
在上面的例子中,我们可以看到我们已经为列“amount”构建了两次直方图。在第一个查询中,创建了一个新的直方图。在第二个查询中,“amount”的直方图会自动覆盖。
如果你想删除你创建的任何直方图统计数据,你只需使用DROP histogram语法:
mysql> ANALYZE TABLE payment DROP HISTOGRAM ON payment_date; +----------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics removed for column \'payment_date\'. | +----------------+-----------+----------+---------------------------------------------------------+
与UPDATE HISTOGRAM一样,您可以在同一个命令中指定多个列。值得注意的一个特性是,ANALYZE TABLE命令将尝试执行尽可能多的工作,即使在命令执行过程中出现了错误。假设您指定了三列,但是第二列不存在。服务器仍然会为第一和第三列创建和存储直方图:
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day, c_foobar, c_birth_month WITH 32 BUCKETS; +----------------+-----------+----------+----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+----------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column \'c_birth_day\'. | | tpcds.customer | histogram | status | Histogram statistics created for column \'c_birth_month\'. | | tpcds.customer | histogram | Error | The column \'c_foobar\' does not exist. | +----------------+-----------+----------+----------------------------------------------------------+ 3 rows in set (0.15 sec)
直方图的创建在数据库内部是如何实现的?
如果您已经阅读了MySQL手册,您可能已经看到了新的系统变量histogram_generation_max_mem_size。这个变量将控制服务器在生成直方图统计数据时允许使用的内存大小(以字节计)。那你为什么要控制它呢?
当您指定想要构建一个直方图时,服务器将把所有数据读入内存并在内存中执行所有工作(包括排序)。如果您想在一个非常大的表上生成一个直方图,那么您可能要冒着将数百兆字节的数据读入内存的风险,这可能是不可取的。因此,为了处理这个问题,MySQL将计算在给定由系统变量histogram_generation_max_mem_size指定的内存量的情况下,它可以将多少行数据放入内存中。如果它意识到它只能在给定的内存限制内装入行的一个子集,它将求助于抽样。这可以通过查看属性“采样率”来观察:
mysql> SET histogram_generation_max_mem_size = 1000000; Query OK, 0 rows affected (0.00 sec) mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_country WITH 16 BUCKETS; +----------------+-----------+----------+------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+------------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column \'c_birth_country\'. | +----------------+-----------+----------+------------------------------------------------------------+ 1 row in set (0.22 sec) mysql> SELECT histogram->>\'$.\"sampling-rate\"\' -> FROM information_schema.column_statistics -> WHERE table_name = \"customer\" -> AND column_name = \"c_birth_country\"; +---------------------------------+ | histogram->>\'$.\"sampling-rate\"\' | +---------------------------------+ | 0.048743243211626014 | +---------------------------------+
在这里,我们可以看到优化器通过读取“c_birth_country”列中大约4.8%的数据创建了一个直方图。值得注意的是,抽样是不确定的,因此如果使用抽样,在同一个数据集上的两次后续调用“ANALYZE TABLE tbl UPDATE HISTOGRAM…”可能会给您两个不同的直方图。
Query examples
那么,使用直方图统计可以得到什么呢?让我们看看TPC-DS Benchmark Suite中的几个查询,其中添加一个直方图可以在查询执行时间上产生很大的差异。下面我们将使用规模系数为1的TPC-DS,这意味着数据库的大小大约为1GB。这台机器是英特尔酷睿i7-4770,运行Debian Stretch和MySQL 8.0 RC1。这个配置是相当标准的,除了innodb_buffer_pool_size被增加到2G,以便我们可以将整个数据库放入缓冲池中。
为了让优化器实际使用直方图提供的统计数据,您只需确保优化器开关“condition_fanout_filter”处于打开状态。注意,这在默认情况下是打开的。
Query 90
Benchmark Suite 将此查询描述为“具有特定家属人数的客户早上通过互联网售出的商品数量与晚上售出的商品数量之间的比率是多少。仅考虑具有大量内容的网站。
mysql> SELECT CAST(amc AS DECIMAL(15, 4)) / CAST(pmc AS DECIMAL(15, 4)) am_pm_ratio -> FROM (SELECT COUNT(*) amc -> FROM web_sales, -> household_demographics, -> time_dim, -> web_page -> WHERE ws_sold_time_sk = time_dim.t_time_sk -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk -> AND ws_web_page_sk = web_page.wp_web_page_sk -> AND time_dim.t_hour BETWEEN 9 AND 9 + 1 -> AND household_demographics.hd_dep_count = 2 -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) at, -> (SELECT COUNT(*) pmc -> FROM web_sales, -> household_demographics, -> time_dim, -> web_page -> WHERE ws_sold_time_sk = time_dim.t_time_sk -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk -> AND ws_web_page_sk = web_page.wp_web_page_sk -> AND time_dim.t_hour BETWEEN 15 AND 15 + 1 -> AND household_demographics.hd_dep_count = 2 -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) pt -> ORDER BY am_pm_ratio -> LIMIT 100; +-------------+ | am_pm_ratio | +-------------+ | 1.27619048 | +-------------+ 1 row in set (1.48 sec)
View Code
正如我们所见,执行查询大约需要 1.5 秒。这看起来并不多,但是通过在单个列上添加直方图,我们可以使该查询的运行速度提高三倍(为了便于阅读,查询被截断了);
mysql> ANALYZE TABLE web_page UPDATE HISTOGRAM ON wp_char_count WITH 8 BUCKETS; +----------------+-----------+----------+----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+----------------------------------------------------------+ | tpcds.web_page | histogram | status | Histogram statistics created for column \'wp_char_count\'. | +----------------+-----------+----------+----------------------------------------------------------+ 1 row in set (0.06 sec) mysql> SELECT ... +-------------+ | am_pm_ratio | +-------------+ | 1.27619048 | +-------------+ 1 row in set (0.50 sec)
View Code
对于这个直方图,查询现在大约需要0.5秒。为什么呢? 主要原因可以通过谓词“web_page.wp_char_count BETWEEN 5000 AND 5200 \"得到。在没有任何统计数据可用的情况下,优化器假定表“web_page”中有11.11%的行匹配给定的谓词。然而,这是错误的。通过验证表里的数据,我们可以看到只有1.6%匹配这个谓词(60行中有一行):
mysql> SELECT -> (SELECT COUNT(*) FROM web_page WHERE web_page.wp_char_count BETWEEN 5000 AND 5200) -> / -> (SELECT COUNT(*) FROM web_page) AS ratio; +--------+ | ratio | +--------+ | 0.0167 | +--------+ 1 row in set (0.00 sec)
有了直方图统计信息,优化器现在知道了这一点,并在连接顺序中提前推入表(译注:原文是pushes the table earlier in the join order,应该是将相关的表选为驱动表,符合小表驱动大表的原则),从而生成执行计划,执行速度提高三倍。
Query 61
该查询描述为“查找给定月份和年份中有促销和没有促销的商品的销售比例”。只有出售给生活在特定时区的客户的特定类别的产品才会被考虑。”这是一个包含多个连接的复杂大查询:
mysql> SELECT promotions, -> total, -> CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 -> FROM (SELECT SUM(ss_ext_sales_price) promotions -> FROM store_sales, -> store, -> promotion, -> date_dim, -> customer, -> customer_address, -> item -> WHERE ss_sold_date_sk = d_date_sk -> AND ss_store_sk = s_store_sk -> AND ss_promo_sk = p_promo_sk -> AND ss_customer_sk = c_customer_sk -> AND ca_address_sk = c_current_addr_sk -> AND ss_item_sk = i_item_sk -> AND ca_gmt_offset = -5 -> AND i_category = \'Home\' -> AND ( p_channel_dmail = \'Y\' -> OR p_channel_email = \'Y\' -> OR p_channel_tv = \'Y\' ) -> AND s_gmt_offset = -5 -> AND d_year = 2000 -> AND d_moy = 12) promotional_sales, -> (SELECT SUM(ss_ext_sales_price) total -> FROM store_sales, -> store, -> date_dim, -> customer, -> customer_address, -> item -> WHERE ss_sold_date_sk = d_date_sk -> AND ss_store_sk = s_store_sk -> AND ss_customer_sk = c_customer_sk -> AND ca_address_sk = c_current_addr_sk -> AND ss_item_sk = i_item_sk -> AND ca_gmt_offset = -5 -> AND i_category = \'Home\' -> AND s_gmt_offset = -5 -> AND d_year = 2000 -> AND d_moy = 12) all_sales -> ORDER BY promotions, -> total -> LIMIT 100; +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (2.78 sec)
View Code
从输出中可以看到,执行查询大约需要2.8秒。然而,查询优化器没有意识到列“s_gmt_offset”中只有一个不同的值。在没有任何统计数据可用的情况下,优化器使用一些硬编码的估计,这假设10%的行将匹配谓词“ca_gmt_offset = -5”。如果我们为这一列添加一个直方图,优化器现在知道表中的所有行都将满足条件,从而为我们提供一个更好的执行计划(为了更好的可读性,查询被截断):
mysql> ANALYZE TABLE store UPDATE HISTOGRAM ON s_gmt_offset WITH 8 BUCKETS; +-------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-------------+-----------+----------+---------------------------------------------------------+ | tpcds.store | histogram | status | Histogram statistics created for column \'s_gmt_offset\'. | +-------------+-----------+----------+---------------------------------------------------------+ 1 row in set (0.06 sec) mysql> SELECT ... +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (1.37 sec)
View Code
有了这个直方图,查询执行时间下降到不到1.4秒,提高了2倍。原因是在第一个计划中,优化器选择第一个派生表在store表上执行全表扫描,然后分别在<item、store_sales、date_dim、customer和customer_address中执行主键查找。但是,当它意识到store表将返回比它预期的更多的行,优化器选择对Item表执行全表扫描,并分别在store_sales、store、date_dim、customer和最后的customer_address中执行主键查找。
But, why not an index?
你们中的一些人现在可能认为一个索引也能做得同样好,事实也的确如此:
mysql> CREATE INDEX s_gmt_offset_idx ON store (s_gmt_offset); Query OK, 0 rows affected (0.53 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> SELECT ... +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (1.41 sec)
然而,有两个原因可以解释为什么你可能会考虑直方图而不是索引:
1. 维护索引是有成本的。如果您有一个索引,那么每次INSERT/UPDATE/DELETE都会导致索引被更新。这不是免费的,而且会影响您的性能。另一方面,直方图只创建一次,并且永远不会更新,除非您明确地要求它。因此,它不会损害您的插入/更新/删除性能。
2. 如果您有一个索引,优化器将执行我们称为“index dives” 的操作,以估计给定范围内的记录数量。这也有一定的成本,如果查询中有非常长的in -list,那么成本可能会太高。在这种情况下,直方图统计要便宜得多,因此可能更合适。
译者注:简单地理解,index dives就是MySQL在对where id in (***,***,……)这种语句生成执行计划的时候,通过扫描索引页的方式来估算符合条件的数据行数,这种方式潜在的问题就是,如果In里面的值很多,以至于符合条件的数据页面很多,那么仅在执行计划评估阶段,就需要扫描大量的数据页面,可能会造成一定的性能损耗,如果换一种评估方式,也就是基于统计信息做评估,就可以避免潜在的扫描大量的索引页的情况(但是基于统计信息的预估也不是完美的,最大的问题是不够精准)。index dives的参数为eq_range_index_dive_limit,默认为200。
检查直方图统计信息
直方图统计数据作为JSON对象存储在数据字典中,这使得它们既灵活又可读。例如,您可以使用内置的JSON函数从直方图中提取信息。假设您想知道您的柱状图是何时为“payment”表中的“amount”列创建/更新的。你可以很容易地使用JSON反引用提取操作符来查找这些信息:
mysql> SELECT -> HISTOGRAM->>\'$.\"last-updated\"\' AS last_updated -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS -> WHERE -> SCHEMA_NAME = \"sakila\" -> AND TABLE_NAME = \"payment\" -> AND COLUMN_NAME = \"amount\"; +----------------------------+ | last_updated | +----------------------------+ | 2017-09-15 11:54:25.000000 | +----------------------------+
或者假设你想找出直方图中有多少个桶与你在ANALYZE TABLE语句中指定的桶的数量进行比较:
mysql> SELECT -> TABLE_NAME, -> COLUMN_NAME, -> HISTOGRAM->>\'$.\"number-of-buckets-specified\"\' AS num_buckets_specified, -> JSON_LENGTH(HISTOGRAM, \'$.buckets\') AS num_buckets_created -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS -> WHERE -> SCHEMA_NAME = \"sakila\"; +------------+--------------+-----------------------+---------------------+ | TABLE_NAME | COLUMN_NAME | num_buckets_specified | num_buckets_created | +------------+--------------+-----------------------+---------------------+ | payment | amount | 32 | 19 | | payment | payment_date | 32 | 32 | +------------+--------------+-----------------------+---------------------+
关于可以从直方图中提取什么样的信息,我们参考了手册中的更多信息。
优化器跟踪
如果你想知道直方图所做的估计,最简单的方法是查看EXPLAIN输出:
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 11.11 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day WITH 32 BUCKETS; +----------------+-----------+----------+--------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+--------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column \'c_birth_day\'. | +----------------+-----------+----------+--------------------------------------------------------+ 1 row in set (0.10 sec) mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 32.12 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec)
如果你查看“过滤”列,你会发现它从默认的11.11%变成了更精确的32.12%。然而,如果你有多个条件,其中一些列有直方图统计数据,而另一些没有,这将很难知道优化器已经估计了什么:
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 6.38 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec)
如果你想更详细地了解直方图所做的估计,你可以查看查询的跟踪:
mysql> SET OPTIMIZER_TRACE = \"enabled=on\"; Query OK, 0 rows affected (0.00 sec) mysql> SET OPTIMIZER_TRACE_MAX_MEM_SIZE = 1000000; Query OK, 0 rows affected (0.00 sec) mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967; mysql> SELECT JSON_EXTRACT(TRACE, \"$**.filtering_effect\") FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE; +----------------------------------------------------------------------------------------+ | JSON_EXTRACT(TRACE, \"$**.filtering_effect\") | +----------------------------------------------------------------------------------------+ | [[{\"condition\": \"(`customer`.`c_birth_day` <= 20)\", \"histogram_selectivity\": 0.6376}]] | +----------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
这里我们使用JSON_EXTRACT函数从跟踪输出中提取相关部分。在这里我们可以看到,对于使用直方图的每个条件,我们可以看到估计的选择性。在本例中,只对其中一个条件(c_birth_day <= 20)的选择性进行了估计,并且直方图估计列中63.76%的行将匹配此条件。事实上,这与列中的实际数据相符的:
mysql> SELECT
-> (SELECT count(*) FROM customer WHERE c_birth_day <= 20)
-> /
-> (SELECT COUNT(*) FROM customer) AS ratio;
+--------+
| ratio |
+--------+
| 0.6376 |
+--------+
1 row in set (0.03 sec)
来源:https://www.cnblogs.com/wy123/p/16944785.html
本站部分图文来源于网络,如有侵权请联系删除。