Million rows per second from Postgres using Python

asyncpg — new Python open-source library to work with PostgreSQL. It was written using asyncio and Python 3.5. asyncpg — the fastest driver for PostgreSQL among similar implementations in Python, NodeJS and Go.
the
Why asyncpg?
We create EdgeDB — database of new generation, with PostgreSQL on the backend. We need a high performance, low latency access and the additional features of PostgreSQL.
The most obvious option is to use psycopg2 is the most popular Python driver for PostgreSQL. He's got a great community, it is stable and time-tested. Also have aiopg that implements an asynchronous interface on top of psycopg2. Then the obvious question is — why paint your bike? The short answer is: performance and support for PostgreSQL. Below we discuss this in more detail.
the
Features
data type Support
Our main concern psycopg2 was mediocre handling different data types, including arrays and complex types. Many different types of data one of the distinguishing features of PostgreSQL. And yet, out of the box psycopg2 only supports simple data types — integers, strings, dates and times. It makes you write your own, for something else.
The main reason for this lies in the fact that psycopg2 communicates with the database server data in a text format, which a lot of data has to parse, especially for complex data types.
Unlike psycopg2, asyncpg uses binary I/O Protocol of PostgreSQL, which, besides the advantages in performance, also has native support for container data types (arrays, range and composite types).
Prepared queries (prepare)
Also asyncpg uses prepared statements for PostgreSQL. This is a great opportunity for optimization, since it allows to avoid re-parsing, analysis, and build a query plan. In addition, asyncpg caches I/O data for each prepared statement.
Prepared statements in asyncpg can be created and used directly. They provide APIs to retrieve and parse the query results. Most request methods are creating a connection directly and asyncpg creates and stores in the cache the prepared queries.
Ease of deployment
Another important feature asyncpg is the lack of dependencies. Direct implementation of the PostgreSQL Protocol means that you do not need to install libpq. Simply execute
pip install asyncpg
. In addition, we also provide package and manual build on Linux and macOS (Windows is planned in the future).the
Performance
Soon it became clear to us that implementing the PostgreSQL Protocol directly, we can achieve a significant increase in speed. Our past experience uvloop showed that using Cython, you can create an efficient and productive library. asyncpg completely written in Cython with memory management and highly optimized. The result asyncpg was on average 3 times faster than psycopg2 (or aiopg).
the
Testing
In the same way as for uvloop, we have created a separate utility to test pgbench and create reports for asyncpg and other implementations of the PostgreSQL driver. We measured the query performance (rows per second) and latency. The main purpose of these tests is to know the overhead for this driver.
To be honest, all the tests were run in a single thread (GOMAXPROCS=1 in the case of Golang) in asynchronous mode. Python driver was run using uvloop.
This testing took place on a clean server with this configuration:
the
-
the
- CPU: Intel Xeon E5-1620 v2 @ 3.70 GHz, DDR3 64GiB
- Go 1.6.3, 3.5.2, Python, NodeJS 6.3.0 the
- PostgreSQL 9.5.2 Driver to be used:
- Python: asyncpg-0.5.2, psycopg2-2.6.2, aiopg-0.10.0, uvloop-0.5.0. aiopg is a tiny wrapper over psycopg2 for asynchronous operation the
- NodeJS: pg-6.0.0, pg-native-1.10.0 the
- Golang: github.com/lib/pg@4dd446efc1, github.com/jackc/pgx@b3eed3cce0
Gentoo Linux, GCC 4.9.3 the
the
-
the


Graphs show the average results obtained by running 4 different types of queries:
— Direct the query to select all rows from the table pg_type (about 350 lines). It's pretty close to the total number of application requests. This test shows asyncpg specified in the header of the performance of 1 million lines per second. Podrobnee.
Query generates 1000 rows consisting of a single integer. This test is designed to see performance when creating records and getting results. Podrobnee.
The query returns 100 rows, each containing 1 KB of binary data (blob). This is a stress I/O test. Podrobnee.
The query returns 100 rows, each containing an array of 100 integers. This test is designed to test the speed of decoding arrays. Here asyncpg slower than the more rapid implementation in golang. This is due to the cost of the creation and deletion of tuples in Python. Podrobnee.
the
Conclusion
We are confident that with Python it is possible to create high-performance and scalable system. For this we need to invest effort to create quick, high-quality drivers, cycles of events, frameworks.
asyncpg is one of the steps in this direction. This is the result of intelligent design, fueled by our experience of creating uvloop and effective use Cython and asyncio.
Translation of the article 1M rows/s from Postgres to Python
Комментарии
Отправить комментарий