postgreSQL 非count方法算记录数操作

2023-07-07 17:14:03 267

一般方法

selectcount(1)fromtable_name;

全量扫描一遍表，记录越多，查询速度越慢

新法

PostgreSQL还真提供了一个这样的途径，那就是系统表pg_class，这个系统表里头，存储着每个表的统计信息，其中reltuples就是对应的表的统计行,统计行的数据是pg有个独立进程，定期扫描不同的表，收集这些表的统计信息，保存在系统表里头。

方法如下：

select
reltuples::intastotal
from
pg_class
where
relname='table_name'
andrelnamespace=(selectoidfrompg_namespacewherenspname='schema');

新方法不是通用的，如果要求特精确还是使用selectcount(1),如果是类似分页的，且分页数量超过比较多的，也不是要求特别精准的，这就是一个好方法！

count(1)over计算记录数

selectcount(1)over(),*fromtable_name;

补充

count是最常用的聚集函数之一，看似简单，其实还是有坑的，如：

1、count(*)：返回结果集的行数，是null也统计

2、count(1)：和count(*)基本没区别，pg92之前都是扫描全表的，pg92之后增加了indexonlyscan一般会变成扫主键索引，如果没有主键或者是表的列很多的情况下，count(1)快一些，因为不会考虑表的全部字段

3、count(field)：返回数据表中指定字段值不等于null的行数

拓展：理解PostgreSQL的count函数的行为

关于count函数的使用一直存在争议，尤其是在MySQL中，作为流行度越来越高的PostgreSQL是否也有类似的问题呢，我们通过实践来理解一下PostgreSQL中count函数的行为。

构建测试数据库

创建测试数据库，并创建测试表。测试表中有自增ID、创建时间、内容三个字段，自增ID字段是主键。

createdatabaseperformance_test;

createtabletest_tbl(idserialprimarykey,created_attimestamp,contentvarchar(512));

生成测试数据

使用generate_series函数生成自增ID，使用now()函数生成created_at列，对于content列，使用了repeat(md5(random()::text),10)生成10个32位长度的md5字符串。使用下列语句，插入1000w条记录用于测试。

performance_test=#insertintotest_tblselectgenerate_series(1,10000000),now(),repeat(md5(random()::text),10);INSERT010000000Time:212184.223ms(03:32.184)

由count语句引发的思考

默认情况下PostgreSQL不开启SQL执行时间的显示，所以需要手动开启一下，方便后面的测试对比。

\timingon

count(*)和count(1)的性能区别是经常被讨论的问题，分别使用count(*)和count(1)执行一次查询。

performance_test=#selectcount(*)fromtest_tbl;
count
----------
10000000
(1row)

Time:115090.380ms(01:55.090)

performance_test=#selectcount(1)fromtest_tbl;
count
----------
10000000
(1row)

Time:738.502ms

可以看到两次查询的速度差别非常大，count(1)真的有这么大的性能提升？接下来再次运行查询语句。

performance_test=#selectcount(*)fromtest_tbl;
count
----------
10000000
(1row)

Time:657.831ms

performance_test=#selectcount(1)fromtest_tbl;
count
----------
10000000
(1row)

Time:682.157ms

可以看到第一次查询时候会非常的慢，后面三次速度非常快并且时间相近，这里就有两个问题出现了：

为什么第一次查询速度这么慢？

count(*)和count(1)到底存不存在性能差别？

查询缓存

使用explain语句重新执行查询语句

explain(analyze,buffers,verbose)selectcount(*)fromtest_tbl;

可以看到如下输出：

FinalizeAggregate(cost=529273.69..529273.70rows=1width=8)(actualtime=882.569..882.570rows=1loops=1)
Output:count(*)
Buffers:sharedhit=96read=476095
->Gather(cost=529273.48..529273.69rows=2width=8)(actualtime=882.492..884.170rows=3loops=1)
Output:(PARTIALcount(*))
WorkersPlanned:2
WorkersLaunched:2
Buffers:sharedhit=96read=476095
->PartialAggregate(cost=528273.48..528273.49rows=1width=8)(actualtime=881.014..881.014rows=1loops=3)
Output:PARTIALcount(*)
Buffers:sharedhit=96read=476095
Worker0:actualtime=880.319..880.319rows=1loops=1
Buffers:sharedhit=34read=158206
Worker1:actualtime=880.369..880.369rows=1loops=1
Buffers:sharedhit=29read=156424
->ParallelSeqScanonpublic.test_tbl(cost=0.00..517856.98rows=4166598width=0)(actualtime=0.029..662.165rows=3333333loops=3)
Buffers:sharedhit=96read=476095
Worker0:actualtime=0.026..661.807rows=3323029loops=1
Buffers:sharedhit=34read=158206
Worker1:actualtime=0.030..660.197rows=3285513loops=1
Buffers:sharedhit=29read=156424
Planningtime:0.043ms
Executiontime:884.207ms

注意里面的sharedhit，表示命中了内存中缓存的数据，这就可以解释为什么后面的查询会比第一次快很多。接下来去掉缓存，并重启PostgreSQL。

servicepostgresqlstop
echo1>/proc/sys/vm/drop_caches
servicepostgresqlstart

重新执行SQL语句，速度慢了很多。

FinalizeAggregate(cost=529273.69..529273.70rows=1width=8)(actualtime=50604.564..50604.564rows=1loops=1)
Output:count(*)
Buffers:sharedread=476191
->Gather(cost=529273.48..529273.69rows=2width=8)(actualtime=50604.508..50606.141rows=3loops=1)
Output:(PARTIALcount(*))
WorkersPlanned:2
WorkersLaunched:2
Buffers:sharedread=476191
->PartialAggregate(cost=528273.48..528273.49rows=1width=8)(actualtime=50591.550..50591.551rows=1loops=3)
Output:PARTIALcount(*)
Buffers:sharedread=476191
Worker0:actualtime=50585.182..50585.182rows=1loops=1
Buffers:sharedread=158122
Worker1:actualtime=50585.181..50585.181rows=1loops=1
Buffers:sharedread=161123
->ParallelSeqScanonpublic.test_tbl(cost=0.00..517856.98rows=4166598width=0)(actualtime=92.491..50369.691rows=3333333loops=3)
Buffers:sharedread=476191
Worker0:actualtime=122.170..50362.271rows=3320562loops=1
Buffers:sharedread=158122
Worker1:actualtime=14.020..50359.733rows=3383583loops=1
Buffers:sharedread=161123
Planningtime:11.537ms
Executiontime:50606.215ms

sharedread表示没有命中缓存，通过这个现象可以推断出，上一小节的四次查询中，第一次查询没有命中缓存，剩下三次查询都命中了缓存。

count(1)和count(*)的区别

接下来探究count(1)和count(*)的区别是什么，继续思考最开始的四次查询，第一次查询使用了count(*)，第二次查询使用了count(1)，却依然命中了缓存，不正是说明count(1)和count(*)是一样的吗？

事实上，PostgreSQL官方对于isthereadifferenceperformance-wisebetweenselectcount(1)andselectcount(*)?问题的回复也证实了这一点：

Nope.Infact,thelatterisconvertedtotheformerduringparsing.[2]

既然count(1)在性能上没有比count(*)更好，那么使用count(*)就是更好的选择。

sequencescan和indexscan

接下来测试一下，在不同数据量大小的情况下count(*)的速度，将查询语句写在count.sql文件中，使用pgbench进行测试。

pgbench-c5-t20performance_test-r-fcount.sql

分别测试200w-1000w数据量下的count语句耗时

数据大小	count耗时(ms)
200w	738.758
300w	1035.846
400w	1426.183
500w	1799.866
600w	2117.247
700w	2514.691
800w	2526.441
900w	2568.240
1000w	2650.434

绘制成耗时曲线

曲线的趋势在600w-700w数据量之间出现了转折，200w-600w是线性增长，600w之后count的耗时就基本相同了。使用explain语句分别查看600w和700w数据时的count语句执行。

700w：

FinalizeAggregate(cost=502185.93..502185.94rows=1width=8)(actualtime=894.361..894.361rows=1loops=1)
Output:count(*)
Buffers:sharedhit=16344read=352463
->Gather(cost=502185.72..502185.93rows=2width=8)(actualtime=894.232..899.763rows=3loops=1)
Output:(PARTIALcount(*))
WorkersPlanned:2
WorkersLaunched:2
Buffers:sharedhit=16344read=352463
->PartialAggregate(cost=501185.72..501185.73rows=1width=8)(actualtime=889.371..889.371rows=1loops=3)
Output:PARTIALcount(*)
Buffers:sharedhit=16344read=352463
Worker0:actualtime=887.112..887.112rows=1loops=1
Buffers:sharedhit=5459read=118070
Worker1:actualtime=887.120..887.120rows=1loops=1
Buffers:sharedhit=5601read=117051
->ParallelIndexOnlyScanusingtest_tbl_pkeyonpublic.test_tbl(cost=0.43..493863.32rows=2928960width=0)(actualtime=0.112..736.376rows=2333333loops=3)
IndexCond:(test_tbl.id<7000000)
HeapFetches:2328492
Buffers:sharedhit=16344read=352463
Worker0:actualtime=0.107..737.180rows=2344479loops=1
Buffers:sharedhit=5459read=118070
Worker1:actualtime=0.133..737.960rows=2327028loops=1
Buffers:sharedhit=5601read=117051
Planningtime:0.165ms
Executiontime:899.857ms

600w：

FinalizeAggregate(cost=429990.94..429990.95rows=1width=8)(actualtime=765.575..765.575rows=1loops=1)
Output:count(*)
Buffers:sharedhit=13999read=302112
->Gather(cost=429990.72..429990.93rows=2width=8)(actualtime=765.557..770.889rows=3loops=1)
Output:(PARTIALcount(*))
WorkersPlanned:2
WorkersLaunched:2
Buffers:sharedhit=13999read=302112
->PartialAggregate(cost=428990.72..428990.73rows=1width=8)(actualtime=763.821..763.821rows=1loops=3)
Output:PARTIALcount(*)
Buffers:sharedhit=13999read=302112
Worker0:actualtime=762.742..762.742rows=1loops=1
Buffers:sharedhit=4638read=98875
Worker1:actualtime=763.308..763.308rows=1loops=1
Buffers:sharedhit=4696read=101570
->ParallelIndexOnlyScanusingtest_tbl_pkeyonpublic.test_tbl(cost=0.43..422723.16rows=2507026width=0)(actualtime=0.053..632.199rows=2000000loops=3)
IndexCond:(test_tbl.id<6000000)
HeapFetches:2018490
Buffers:sharedhit=13999read=302112
Worker0:actualtime=0.059..633.156rows=1964483loops=1
Buffers:sharedhit=4638read=98875
Worker1:actualtime=0.038..634.271rows=2017026loops=1
Buffers:sharedhit=4696read=101570
Planningtime:0.055ms
Executiontime:770.921ms

根据以上现象推断，PostgreSQL似乎在count的数据量小于数据表长度的某一比例时，才使用indexscan，通过查看官方wiki也可以看到相关描述：

Itisimportanttorealisethattheplannerisconcernedwithminimisingthetotalcostofthequery.Withdatabases,thecostofI/Otypicallydominates.Forthatreason,"count(*)withoutanypredicate"querieswillonlyuseanindex-onlyscaniftheindexissignificantlysmallerthanitstable.Thistypicallyonlyhappenswhenthetable'srowwidthismuchwiderthansomeindexes'.[3]

根据Stackoverflow上的回答，count语句查询的数量大于表大小的3/4时候就会用使用全表扫描代替索引扫描[4]。

结论

不要用count(1)或count(列名)代替count(*)

count本身是非常耗时的

count可能是indexscan也可能是sequencescan，取决于count数量占表大小的比例

以上为个人经验，希望能给大家一个参考，也希望大家多多支持毛票票。如有错误或未考虑完全的地方，望不吝赐教。