Optimization of sum in PostgreSQL

ноября 27, 2017

Consider this situation: a statistical table and column identifiers and column counters. You want to sum the counters for a certain subset. In this case we are not interested in how we choose our lot — about indexes and protezirovanie written many books and articles. We assume that all the data already selected the most optimal way and learn how to quickly summarize.

This is not the first place that should be optimized, if the query is slow, probably the latter. The following ideas are meaningful to apply when the execution plan (explain) has mean perfect and it the mosquito nose will not undermine, but I want to "squeeze" a little more.

Will make a test table and write in it 10 million records:
the

create table s (
d date,
browser_id int not null,
banner_id int not null,
views bigint,
clicks bigint,
primary key(d, browser_id, banner_id)
);
insert into s
select d, browser_id, banner_id, succ + insucc, succ
from (
select d, browser_id, banner_id, (array[0,0,50,500])[ceil(random()*4)] succ, (array[0,0,400,400000])[ceil(random()*4)] insucc
from generate_series(now() - interval '99 day', now(), '1 day') d
cross join generate_series(0, 999) banner_id
cross join generate_series(0, 99) browser_id
)_;

Date, id-schnick and a primary key is given only for decency — the task is simple, to sum the entire table. Strange generation of values for views and clicks needed to simulate the real situation, where values are often zero and quite large.

So, let's start. All time measurements will do when re-execution of the request to exclude the influence of a cold cache.
Honest benchmarks did not hold, ran a few times, took the average and rounded. The machine is weak, you will be faster!

Method 1: "head"

the

select sum(clicks) from s;

9 seconds.

Method 2: change type

Will personagem our table, the counters will do the numeric type:
the

create table s (
d date,
browser_id int not null,
banner_id int not null,
views numeric
clicks numeric,
primary key(d, browser_id, banner_id)
);

8 seconds. It would seem that the numeric type should be more cumbersome, as it allows for the storage of numbers is very large and hence far from the machine representation. In fact, it is faster.

The answer is: sum, taking the input of the bigint or numeric, returns in both cases, numeric. This is done to avoid overflow. When we immediately give numeric input, we avoid the implicit conversion.

first Observation: if we have enough for a counter of type int (and it can take values up to ~2 billion) — then it will work even faster. Sum in this case, it returns a bigint.

Observation the second: the usual arithmetic operations (+, -, *, /) with numeric type are slower than bigint. A sum — faster numeric.

Method 3: do not consider zeros

the

select sum(clicks) from clicks where s <> 0;

This method yields a speedup of up to 7 seconds. But it has drawback: when the summation of values from multiple columns is not clear how to apply it, especially if the columns are reset independently.

Method 4: model zeros on null

the

select sum(nullif(сlicks, 0)) from s;

The same 7 seconds but the method works better than the previous adding up several columns.

The reason is that sum, as a strict (strict) aggregate function ignores input from s null.

Methods 3 and 4 is useful where in column a significant number of zeros (at least 10% and better half).

Method 5: model zeros for null s in the table

This optimization method a few "unethical": we offer brazenly flout the semantics of null values. But it works.
the

create table s2 (like s including all);
insert into s2 select d, browser_id, banner_id, nullif(views, 0), nullif(clicks, 0) from s;

the

select sum(clicks) from s2;

6 seconds. Apparently, the reason lies in the fact that Postgres stores nulls in the bitmap and so the size tupl'but decreases.

Perhaps the reader is surprised to ask: why are we right columns of zeros are not thrown away with the sample, for example by using a partial index?
As we explain in the real table a lot of columns, and zeros are independent.
Links to the documentation:

Numeric Types
Sum aggregate function
User-defined aggregate functions, strictness

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express