Optimization of sum in PostgreSQL

Consider this situation: a statistical table and column identifiers and column counters. You want to sum the counters for a certain subset. In this case we are not interested in how we choose our lot — about indexes and protezirovanie written many books and articles. We assume that all the data already selected the most optimal way and learn how to quickly summarize.

This is not the first place that should be optimized, if the query is slow, probably the latter. The following ideas are meaningful to apply when the execution plan (explain) has mean perfect and it the mosquito nose will not undermine, but I want to "squeeze" a little more.

Will make a test table and write in it 10 million records:
the
create table s (
d date,
browser_id int not null,
banner_id int not null,
views bigint,
clicks bigint,
primary key(d, browser_id, banner_id)
);
insert into s
select d, browser_id, banner_id, succ + insucc, succ
from (
select d, browser_id, banner_id, (array[0,0,50,500])[ceil(random()*4)] succ, (array[0,0,400,400000])[ceil(random()*4)] insucc
from generate_series(now() - interval '99 day', now(), '1 day') d
cross join generate_series(0, 999) banner_id
cross join generate_series(0, 99) browser_id
)_;

Date, id-schnick and a primary key is given only for decency — the task is simple, to sum the entire table. Strange generation of values for views and clicks needed to simulate the real situation, where values are often zero and quite large.

So, let's start. All time measurements will do when re-execution of the request to exclude the influence of a cold cache.
Honest benchmarks did not hold, ran a few times, took the average and rounded. The machine is weak, you will be faster!

Method 1: "head"


the
select sum(clicks) from s;

9 seconds.

Method 2: change type


Will personagem our table, the counters will do the numeric type:
the
create table s (
d date,
browser_id int not null,
banner_id int not null,
views numeric
clicks numeric,
primary key(d, browser_id, banner_id)
);

8 seconds. It would seem that the numeric type should be more cumbersome, as it allows for the storage of numbers is very large and hence far from the machine representation. In fact, it is faster.

The answer is: sum, taking the input of the bigint or numeric, returns in both cases, numeric. This is done to avoid overflow. When we immediately give numeric input, we avoid the implicit conversion.

first Observation: if we have enough for a counter of type int (and it can take values up to ~2 billion) — then it will work even faster. Sum in this case, it returns a bigint.

Observation the second: the usual arithmetic operations (+, -, *, /) with numeric type are slower than bigint. A sum — faster numeric.


Method 3: do not consider zeros


the
select sum(clicks) from clicks where s <> 0;

This method yields a speedup of up to 7 seconds. But it has drawback: when the summation of values from multiple columns is not clear how to apply it, especially if the columns are reset independently.

Method 4: model zeros on null


the
select sum(nullif(сlicks, 0)) from s;

The same 7 seconds but the method works better than the previous adding up several columns.

The reason is that sum, as a strict (strict) aggregate function ignores input from s null.

Methods 3 and 4 is useful where in column a significant number of zeros (at least 10% and better half).

Method 5: model zeros for null s in the table


This optimization method a few "unethical": we offer brazenly flout the semantics of null values. But it works.
the
create table s2 (like s including all);
insert into s2 select d, browser_id, banner_id, nullif(views, 0), nullif(clicks, 0) from s;


the
select sum(clicks) from s2;

6 seconds. Apparently, the reason lies in the fact that Postgres stores nulls in the bitmap and so the size tupl'but decreases.

Perhaps the reader is surprised to ask: why are we right columns of zeros are not thrown away with the sample, for example by using a partial index?
As we explain in the real table a lot of columns, and zeros are independent.
Links to the documentation:
    the
  1. Numeric Types
  2. the
  3. Sum aggregate function
  4. the
  5. User-defined aggregate functions, strictness

Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

ODBC Firebird, Postgresql, executing queries in Powershell

garage48 for the first time in Kiev!

The Ministry of communications wants to ban phones without GLONASS