PostgreSQL 对IN,EXISTS,ANY/ALL,JOIN的sql优化方案

测试环境：

									postgres=# select version();       

									                         version                        

									---------------------------------------------------------------------------------------------------------

									 PostgreSQL 11.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

									(1 row) 

									postgres=#

数据准备：

				?

									$ pgbench -i -s 10

				?

									postgres=# \d

									       List of relations

									 Schema |    Name    | Type | Owner 

									--------+------------------+-------+----------

									 public | pgbench_accounts | table | postgres

									 public | pgbench_branches | table | postgres

									 public | pgbench_history | table | postgres

									 public | pgbench_tellers | table | postgres

									(4 rows)

									postgres=# select * from pgbench_accounts limit 1;

									 aid | bid | abalance |                    filler                    

									-----+-----+----------+--------------------------------------------------------------------------------------

									  1 |  1 |    0 |                                          

									(1 row)

									postgres=# select * from pgbench_branches limit 1;

									 bid | bbalance | filler

									-----+----------+--------

									  1 |    0 |

									(1 row)

									postgres=# select * from pgbench_history limit 1;

									 tid | bid | aid | delta | mtime | filler

									-----+-----+-----+-------+-------+--------

									(0 rows)

									postgres=# select * from pgbench_tellers limit 1;

									 tid | bid | tbalance | filler

									-----+-----+----------+--------

									  1 |  1 |    0 |

									(1 row)

									postgres=# select * from pgbench_branches;

									 bid | bbalance | filler

									-----+----------+--------

									  1 |    0 |

									  2 |    0 |

									  3 |    0 |

									  4 |    0 |

									  5 |    0 |

									  6 |    0 |

									  7 |    0 |

									  8 |    0 |

									  9 |    0 |

									 10 |    0 |

									(10 rows)

									postgres=# update pgbench_branches set bbalance=4500000 where bid in (4,7);

									UPDATE 2

									postgres=#

IN语句

查询要求：找出那些余额(balance)大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户

1.使用IN子句　

				?

									SELECT

									  count( aid ),bid

									FROM

									  pgbench_accounts

									WHERE

									  bid IN ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

									GROUP BY

									  bid;

2.使用ANY子句

				?

									SELECT

									  count( aid ),bid

									FROM

									  pgbench_accounts

									WHERE

									  bid = ANY ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

									GROUP BY

									  bid;

3.使用EXISTS子句

				?

									SELECT

									  count( aid ),bid

									FROM

									  pgbench_accounts

									WHERE

									  EXISTS ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid )

									GROUP BY

									  bid;

4.使用INNER JOIN

				?

									SELECT

									  count( aid ),a.bid

									FROM

									  pgbench_accounts a

									  JOIN pgbench_branches b ON a.bid = b.bid

									WHERE

									  b.bbalance > 0

									GROUP BY

									  a.bid;

在完成这个查询要求的时候，有人可能会假设exists和inner join性能可能会更好，因为他们可以使用两表连接的逻辑和优化。而IN和ANY子句需要使用子查询。

然而，PostgreSQL(10版本之后)已经智能的足以对上面四种写法产生相同的执行计划！

所有上面的写法都会产生相同的执行计划：

				?

									                                      QUERY PLAN                                      

									------------------------------------------------------------------------------------------------------------------------------------------------------------------

									 Finalize GroupAggregate (cost=23327.73..23330.26 rows=10 width=12) (actual time=97.199..99.014 rows=2 loops=1)

									  Group Key: a.bid

									  -> Gather Merge (cost=23327.73..23330.06 rows=20 width=12) (actual time=97.191..99.006 rows=6 loops=1)

									     Workers Planned: 2

									     Workers Launched: 2

									     -> Sort (cost=22327.70..22327.73 rows=10 width=12) (actual time=93.762..93.766 rows=2 loops=3)

									        Sort Key: a.bid

									        Sort Method: quicksort Memory: 25kB

									        Worker 0: Sort Method: quicksort Memory: 25kB

									        Worker 1: Sort Method: quicksort Memory: 25kB

									        -> Partial HashAggregate (cost=22327.44..22327.54 rows=10 width=12) (actual time=93.723..93.727 rows=2 loops=3)

									           Group Key: a.bid

									           -> Hash Join (cost=1.14..22119.10 rows=41667 width=8) (actual time=24.024..83.263 rows=66667 loops=3)

									              Hash Cond: (a.bid = b.bid)

									              -> Parallel Seq Scan on pgbench_accounts a (cost=0.00..20560.67 rows=416667 width=8) (actual time=0.023..43.151 rows=333333 loops=3)

									              -> Hash (cost=1.12..1.12 rows=1 width=4) (actual time=0.027..0.028 rows=2 loops=3)

									                 Buckets: 1024 Batches: 1 Memory Usage: 9kB

									                 -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=1 width=4) (actual time=0.018..0.020 rows=2 loops=3)

									                    Filter: (bbalance > 0)

									                    Rows Removed by Filter: 8

									 Planning Time: 0.342 ms

									 Execution Time: 99.164 ms

									(22 rows)

那么，我们是否可以得出这样的结论:我们可以随意地编写查询，而PostgreSQL的智能将会处理其余的问题?！

等等!

如果我们考虑排除情况，事情会变得不同。

排除查询

查询要求：找出那些余额(balance)不大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户

1.使用NOT IN

				?

									SELECT

									  count( aid ),bid

									FROM

									  pgbench_accounts

									WHERE

									  bid NOT IN ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

									GROUP BY

									  bid;

执行计划:

				?

									                                    QUERY PLAN                                    

									----------------------------------------------------------------------------------------------------------------------------------------------------------

									 Finalize GroupAggregate (cost=23645.42..23647.95 rows=10 width=12) (actual time=128.606..130.502 rows=8 loops=1)

									  Group Key: pgbench_accounts.bid

									  -> Gather Merge (cost=23645.42..23647.75 rows=20 width=12) (actual time=128.598..130.490 rows=24 loops=1)

									     Workers Planned: 2

									     Workers Launched: 2

									     -> Sort (cost=22645.39..22645.42 rows=10 width=12) (actual time=124.960..124.963 rows=8 loops=3)

									        Sort Key: pgbench_accounts.bid

									        Sort Method: quicksort Memory: 25kB

									        Worker 0: Sort Method: quicksort Memory: 25kB

									        Worker 1: Sort Method: quicksort Memory: 25kB

									        -> Partial HashAggregate (cost=22645.13..22645.23 rows=10 width=12) (actual time=124.917..124.920 rows=8 loops=3)

									           Group Key: pgbench_accounts.bid

									           -> Parallel Seq Scan on pgbench_accounts (cost=1.13..21603.46 rows=208333 width=8) (actual time=0.078..83.134 rows=266667 loops=3)

									              Filter: (NOT (hashed SubPlan 1))

									              Rows Removed by Filter: 66667

									              SubPlan 1

									               -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=1 width=4) (actual time=0.020..0.021 rows=2 loops=3)

									                  Filter: (bbalance > 0)

									                  Rows Removed by Filter: 8

									 Planning Time: 0.310 ms

									 Execution Time: 130.620 ms

									(21 rows)

									postgres=#

2.使用<>ALL

				?

									SELECT

									  count( aid ),bid

									FROM

									  pgbench_accounts

									WHERE

									  bid <> ALL ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )

									GROUP BY

									  bid;

执行计划:

				?

									                                     QUERY PLAN                                    

									------------------------------------------------------------------------------------------------------------------------------------------------------------

									 Finalize GroupAggregate (cost=259581.79..259584.32 rows=10 width=12) (actual time=418.220..419.913 rows=8 loops=1)

									  Group Key: pgbench_accounts.bid

									  -> Gather Merge (cost=259581.79..259584.12 rows=20 width=12) (actual time=418.212..419.902 rows=24 loops=1)

									     Workers Planned: 2

									     Workers Launched: 2

									     -> Sort (cost=258581.76..258581.79 rows=10 width=12) (actual time=413.906..413.909 rows=8 loops=3)

									        Sort Key: pgbench_accounts.bid

									        Sort Method: quicksort Memory: 25kB

									        Worker 0: Sort Method: quicksort Memory: 25kB

									        Worker 1: Sort Method: quicksort Memory: 25kB

									        -> Partial HashAggregate (cost=258581.50..258581.60 rows=10 width=12) (actual time=413.872..413.875 rows=8 loops=3)

									           Group Key: pgbench_accounts.bid

									           -> Parallel Seq Scan on pgbench_accounts (cost=0.00..257539.83 rows=208333 width=8) (actual time=0.054..367.244 rows=266667 loops=3)

									              Filter: (SubPlan 1)

									              Rows Removed by Filter: 66667

									              SubPlan 1

									               -> Materialize (cost=0.00..1.13 rows=1 width=4) (actual time=0.000..0.001 rows=2 loops=1000000)

									                  -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=1 width=4) (actual time=0.001..0.001 rows=2 loops=337880)

									                     Filter: (bbalance > 0)

									                     Rows Removed by Filter: 8

									 Planning Time: 0.218 ms

									 Execution Time: 420.035 ms

									(22 rows) 

									postgres=#

3.使用NOT EXISTS

				?

									SELECT

									  count( aid ),bid

									FROM

									  pgbench_accounts

									WHERE

									  NOT EXISTS ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid )

									GROUP BY

									  bid;

执行计划:

				?

									                                      QUERY PLAN                                     

									----------------------------------------------------------------------------------------------------------------------------------------------------------------

									 Finalize GroupAggregate (cost=28327.72..28330.25 rows=10 width=12) (actual time=152.024..153.931 rows=8 loops=1)

									  Group Key: pgbench_accounts.bid

									  -> Gather Merge (cost=28327.72..28330.05 rows=20 width=12) (actual time=152.014..153.917 rows=24 loops=1)

									     Workers Planned: 2

									     Workers Launched: 2

									     -> Sort (cost=27327.70..27327.72 rows=10 width=12) (actual time=147.782..147.786 rows=8 loops=3)

									        Sort Key: pgbench_accounts.bid

									        Sort Method: quicksort Memory: 25kB

									        Worker 0: Sort Method: quicksort Memory: 25kB

									        Worker 1: Sort Method: quicksort Memory: 25kB

									        -> Partial HashAggregate (cost=27327.43..27327.53 rows=10 width=12) (actual time=147.732..147.737 rows=8 loops=3)

									           Group Key: pgbench_accounts.bid

									           -> Hash Anti Join (cost=1.14..25452.43 rows=375000 width=8) (actual time=0.134..101.884 rows=266667 loops=3)

									              Hash Cond: (pgbench_accounts.bid = pgbench_branches.bid)

									              -> Parallel Seq Scan on pgbench_accounts (cost=0.00..20560.67 rows=416667 width=8) (actual time=0.032..45.174 rows=333333 loops=3)

									              -> Hash (cost=1.12..1.12 rows=1 width=4) (actual time=0.036..0.037 rows=2 loops=3)

									                 Buckets: 1024 Batches: 1 Memory Usage: 9kB

									                 -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=1 width=4) (actual time=0.025..0.027 rows=2 loops=3)

									                    Filter: (bbalance > 0)

									                    Rows Removed by Filter: 8

									 Planning Time: 0.322 ms

									 Execution Time: 154.040 ms

									(22 rows) 

									postgres=#

4.使用LEFT JOIN和IS NULL

				?

									SELECT

									  count( aid ),a.bid

									FROM

									  pgbench_accounts a

									  LEFT JOIN pgbench_branches b ON a.bid = b.bid AND b.bbalance > 0

									WHERE

									  b.bid IS NULL

									GROUP BY

									  a.bid;

执行计划:

				?

									                                      QUERY PLAN                                      

									------------------------------------------------------------------------------------------------------------------------------------------------------------------

									 Finalize GroupAggregate (cost=28327.72..28330.25 rows=10 width=12) (actual time=145.298..147.096 rows=8 loops=1)

									  Group Key: a.bid

									  -> Gather Merge (cost=28327.72..28330.05 rows=20 width=12) (actual time=145.288..147.083 rows=24 loops=1)

									     Workers Planned: 2

									     Workers Launched: 2

									     -> Sort (cost=27327.70..27327.72 rows=10 width=12) (actual time=141.883..141.887 rows=8 loops=3)

									        Sort Key: a.bid

									        Sort Method: quicksort Memory: 25kB

									        Worker 0: Sort Method: quicksort Memory: 25kB

									        Worker 1: Sort Method: quicksort Memory: 25kB

									        -> Partial HashAggregate (cost=27327.43..27327.53 rows=10 width=12) (actual time=141.842..141.847 rows=8 loops=3)

									           Group Key: a.bid

									           -> Hash Anti Join (cost=1.14..25452.43 rows=375000 width=8) (actual time=0.087..99.535 rows=266667 loops=3)

									              Hash Cond: (a.bid = b.bid)

									              -> Parallel Seq Scan on pgbench_accounts a (cost=0.00..20560.67 rows=416667 width=8) (actual time=0.025..44.337 rows=333333 loops=3)

									              -> Hash (cost=1.12..1.12 rows=1 width=4) (actual time=0.026..0.027 rows=2 loops=3)

									                 Buckets: 1024 Batches: 1 Memory Usage: 9kB

									                 -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=1 width=4) (actual time=0.019..0.020 rows=2 loops=3)

									                    Filter: (bbalance > 0)

									                    Rows Removed by Filter: 8

									 Planning Time: 0.231 ms

									 Execution Time: 147.180 ms

									(22 rows) 

									postgres=#

NOT IN 和 <> ALL生成执行计划都包含了一个子查询。他们是各自独立的。

而NOT EXISTS和LEFT JOIN生成了相同的执行计划。

这些hash连接(或hash anti join)是完成查询要求的最灵活的方式。这也是推荐exists或join的原因。因此，推荐使用exists或join的经验法则是有效的。

但是，我们继续往下看! 即使有了子查询执行计划，NOT IN子句的执行时间也会更好?

是的。PostgreSQL做了出色的优化，PostgreSQL将子查询计划进行了hash处理。因此PostgreSQL对如何处理IN子句有了更好的理解，这是一种逻辑思维方式，因为很多人倾向于使用IN子句。子查询返回的行很少，但即使子查询返回几百行，也会发生同样的情况。

但是，如果子查询返回大量行(几十万行)怎么办?让我们尝试一个简单的测试：

				?

									CREATE TABLE t1 AS

									SELECT * FROM generate_series(0, 500000) id;

									CREATE TABLE t2 AS

									SELECT (random() * 4000000)::integer id

									FROM generate_series(0, 4000000);

									ANALYZE t1;

									ANALYZE t2;

									EXPLAIN SELECT id

									FROM t1

									WHERE id NOT IN (SELECT id FROM t2);

执行计划：

				?

									    QUERY PLAN                 

									--------------------------------------------------------------------------------

									 Gather (cost=1000.00..15195064853.01 rows=250000 width=4)

									  Workers Planned: 1

									  -> Parallel Seq Scan on t1 (cost=0.00..15195038853.01 rows=147059 width=4)

									     Filter: (NOT (SubPlan 1))

									     SubPlan 1

									      -> Materialize (cost=0.00..93326.01 rows=4000001 width=4)

									         -> Seq Scan on t2 (cost=0.00..57700.01 rows=4000001 width=4)

									(7 rows)

									postgres=#

这里，执行计划将子查询进行了物化。代价评估变成了15195038853.01。(PostgreSQL的默认设置，如果t2表的行低于100k，会将子查询进行hash)。这样就会严重影响性能。因此，对于那种子查询返回的行数很少的场景，IN子句可以起到很好的作用。

其它注意点

有的！在我们用不同的方式写查询的时候，可能有数据类型的转换。

比如，语句：

				?

									EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY(ARRAY['M', 'F']);

就会发生隐式的类型转换：

				?

									Seq Scan on emp (cost=0.00..1.04 rows=2 width=43) (actual time=0.023..0.026 rows=3 loops=1)

									 Filter: ((gen)::text = ANY ('{M,F}'::text[]))

这里的(gen)::text就发生了类型转换。如果在大表上，这种类型转换的代价会很高，因此，PostgreSQL对IN子句做了更好的处理。

				?

									EXPLAIN ANALYZE SELECT * FROM emp WHERE gen IN ('M','F');

									 Seq Scan on emp (cost=0.00..1.04 rows=3 width=43) (actual time=0.030..0.034 rows=3 loops=1)

									  Filter: (gen = ANY ('{M,F}'::bpchar[]))

将IN子句转换成了ANY子句，没有对gen列进行类型转换。而是将M\F转成了bpchar(内部等价于char)

总结

简单来说，exists和直接join表通常比较好。

很多情况下，PostgreSQL将IN子句换成被hash的子计划。在一些特殊场景下，IN可以获得更好的执行计划。

以上为个人经验，希望能给大家一个参考，也希望大家多多支持服务器之家。如有错误或未考虑完全的地方，望不吝赐教。

原文链接：https://www.cnblogs.com/abclife/p/13940948.html

PostgreSQL 对IN,EXISTS,ANY/ALL,JOIN的sql优化方案

相关文章

热门资讯