Million rows per second from Postgres using Python

ноября 09, 2017

asyncpg — new Python open-source library to work with PostgreSQL. It was written using asyncio and Python 3.5. asyncpg — the fastest driver for PostgreSQL among similar implementations in Python, NodeJS and Go.

the

Why asyncpg?

We create EdgeDB — database of new generation, with PostgreSQL on the backend. We need a high performance, low latency access and the additional features of PostgreSQL.

The most obvious option is to use psycopg2 is the most popular Python driver for PostgreSQL. He's got a great community, it is stable and time-tested. Also have aiopg that implements an asynchronous interface on top of psycopg2. Then the obvious question is — why paint your bike? The short answer is: performance and support for PostgreSQL. Below we discuss this in more detail.

the

Features

data type Support

Our main concern psycopg2 was mediocre handling different data types, including arrays and complex types. Many different types of data one of the distinguishing features of PostgreSQL. And yet, out of the box psycopg2 only supports simple data types — integers, strings, dates and times. It makes you write your own, for something else.

The main reason for this lies in the fact that psycopg2 communicates with the database server data in a text format, which a lot of data has to parse, especially for complex data types.

Unlike psycopg2, asyncpg uses binary I/O Protocol of PostgreSQL, which, besides the advantages in performance, also has native support for container data types (arrays, range and composite types).

Prepared queries (prepare)

Also asyncpg uses prepared statements for PostgreSQL. This is a great opportunity for optimization, since it allows to avoid re-parsing, analysis, and build a query plan. In addition, asyncpg caches I/O data for each prepared statement.

Prepared statements in asyncpg can be created and used directly. They provide APIs to retrieve and parse the query results. Most request methods are creating a connection directly and asyncpg creates and stores in the cache the prepared queries.

Ease of deployment

Another important feature asyncpg is the lack of dependencies. Direct implementation of the PostgreSQL Protocol means that you do not need to install libpq. Simply execute pip install asyncpg. In addition, we also provide package and manual build on Linux and macOS (Windows is planned in the future).

the

Performance

Soon it became clear to us that implementing the PostgreSQL Protocol directly, we can achieve a significant increase in speed. Our past experience uvloop showed that using Cython, you can create an efficient and productive library. asyncpg completely written in Cython with memory management and highly optimized. The result asyncpg was on average 3 times faster than psycopg2 (or aiopg).

the

Testing

In the same way as for uvloop, we have created a separate utility to test pgbench and create reports for asyncpg and other implementations of the PostgreSQL driver. We measured the query performance (rows per second) and latency. The main purpose of these tests is to know the overhead for this driver.

To be honest, all the tests were run in a single thread (GOMAXPROCS=1 in the case of Golang) in asynchronous mode. Python driver was run using uvloop.

This testing took place on a clean server with this configuration:

the

CPU: Intel Xeon E5-1620 v2 @ 3.70 GHz, DDR3 64GiB

Gentoo Linux, GCC 4.9.3 the

Go 1.6.3, 3.5.2, Python, NodeJS 6.3.0
PostgreSQL 9.5.2

Python: asyncpg-0.5.2, psycopg2-2.6.2, aiopg-0.10.0, uvloop-0.5.0. aiopg is a tiny wrapper over psycopg2 for asynchronous operation
NodeJS: pg-6.0.0, pg-native-1.10.0
Golang: github.com/lib/pg@4dd446efc1, github.com/jackc/pgx@b3eed3cce0

Podrobnee.

Conclusion

1M rows/s from Postgres to Python

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express