Twitter: 1 billion queries a day and the new engine

At the moment, the load on the servers, Twitter has grown to 1,000 TPS (tweets / sec) and 12,000 QPS (queries per second) — more than 1 billion queries per day. The current infrastructure still stands, but to create a reserve for the next few years, the company has decided to update a backend for a search engine. "If we worked well, you weren't supposed to notice anything in the last week," reported the blog developers Twitter.

Until recently, search the backend of Twitter was based on the old SQL system from the company Summize. Her bought in July 2008 for this purpose, and took five of six developers. The need to upgrade Twitter became clear immediately after the presentation of iPhone 3G, and then began a collaboration with Summize. But now it's time to upgrade again.

About six months ago, it was decided to develop a new, modern search architecture based on efficient inverted index instead of a relational database. Because Twitter loves open source, as a start point for the solution chosen search library, Apache Lucene written in Java.

Requirements for the new search engine was good scalability and maximum speed of indexing. The task was that since the publication of the tweet to the full-text search capabilities it should be no more than 10 seconds. Since the indexer is only part of the pipeline on the way, he had to work as quickly as possible (less than 1 second).

To achieve these goals, had little to alter Lucene, because it is not very suitable for the search engine in real time. Was copied the main in-memory data structure, especially post-leaves, but at the same time saved support standard Lucene API, so almost never had to redo a search of the library. Here are the key benefits obtained as a result of the modification:

* significantly improved performance of garbage collection (garbage collection)
* data structures and algorithms non-blocking synchronization (lock-free)
* post-sheets, which can take place in reverse order
* effective termination requests at an early stage

According to the developers themselves, some of the applied methods can be interesting and useful to other programmers (not only in the search area), so that in the future perhaps a more detailed discussion of the topic.

One way or another, but all made modifications to Lucene will be sent to Apache, and some already included in the core Lucene code and its new branch to search in real-time.

After the upgrade the search infrastructure was significantly reduced load on the backend (now it is only 5% of resources), so there is a good reserve for the future. The new indexer can index roughly 50 times more tweets per second than is published today. And the new search engine works absolutely stable, without any complaints.

One of the unpleasant moments of Twitter search has always been the inability to search the archive of tweets for more than a few days. They obyasnili this "lack of space". To get around this limit, we have to use the search engines third-party development, which index tweets, for example, Topsy.

Danny Sullivan January 14, 2010 checked search results with the word [today] and found the old tweet posted 7 days ago.

The same test in mid-September showed that the depth of the index decreased to 4 days.

With the introduction of the new search announced "the increase in the index twice without any impact on the speed of search queries". Apparently, we are talking about returning to the same seven-day limit.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

ODBC Firebird, Postgresql, executing queries in Powershell

garage48 for the first time in Kiev!

The Ministry of communications wants to ban phones without GLONASS