Index construction for search engines

Full contents and a list of my articles on search engine will be updated here.

In previous articles I talked about the work of the search engine here and came to a difficult technical point. Let me remind you that there are 2 types of indexes – direct and reverse. Direct – mapping document a list of words it encountered. Opposite – word is mapped to a list of documents in which it is. Logically, for a quick search of best suited opposite the index. Interesting question and in what order in the list of documents to be stored.

In the previous step from the DataFlow module, the indexer, we got a little piece of data in the form of direct index, the reference information and information about the pages. Usually I have it around 200-300mb and contains about 100 thousand pages. Eventually, I abandoned the strategy of storage of whole direct index, and store all of these pieces + full reverse index in several versions so you can revert back.

The device index to look at, simple – stored file, it is at the beginning of the address table start of the data for each word, then the actual data. I exaggerated. So it turns out the best for optimizing the speed of the format — no need to jump through the pages — wrote Brin and page — the 1 seek, 1 read. At each iteration of the rebuild, I use 20-50 pieces of information described above, it is clear to upload all the old ones in memory, I can not, especially since there is useful to store a bunch more overhead on the index.

To solve this, and many other problems associated with the limitations of FS, disk, parallel access to multiple lists of pages, I broke the index into 256 parts. In part I get a list of pages for words WORD_ID%256=I
Thus divide the index into approximately equal parts. In addition, all processing for the next iteration of the rebuild of the index is linked only within one part. I.e. from 20-50 pieces database 200-300 Mb should load only the data that is associated with words that fall into the processed part.

Total: do 256 repetitions of the same procedure:

    the
  1. read the desired data from the input pieces of the database, after downloading all the information about the words, pages and pages Url. In-memory sort, we lists the correct structure. Pack the old page and add her ID, HASH, and other data.
  2. the
  3. open the barrel (I called them so the pieces of index) return index "extreme version" to read. Open an empty barrel "new version" of the index for the entry.
  4. the
  5. To each word in order to merge 2 lists – one sorted built in memory, and the other to read and are already sorted in the file.
  6. the
  7. Closes appends all that is necessary, rejoice.


In practice of course it turned out that the list that is already stored in the "previous version" of the index is sorted when a new iteration only when no external information about the pages has not changed, and it's obviously not.

For while we have amassed the info for the index, could update the data on PR, metrics, pages, nature sites, and so on. And as a result will change the odds for sorting. Question do I need to sort the list by such parameters, is still open. It is rumored that Google sorts the pages in the list by PR, but obviously then you should be able to quickly discard the husk when searching – which is specially pumped up PR, but tematicheskie enough.

I use the combined ratio, constructed from General information about the word on the page (number of meetings; place of meetings – title, meta, links, text selection, format), metrics tematicheskie (yet they are developed I have very little), of PR, word-for-word from the PR (PR for each word I also have) and some others.

Moreover, it is clear that for high frequency queries, we really need to give God the first 1000-2000 results in a list, to be sure, we assume 262 thousand. Because all popular answers obviously hits them, and then they can start a ranking function which should sort it.

In General, if we can guarantee that the first N thousand pages in the list will be the best of the rest a couple of million, the problem is solved, and ensure it is quite simple heuristic methods. Then for a correct rebuild of each barrel:
    the
  1. remember what PR has changed dramatically in terms
  2. the
  3. will ship first N thousand from the list+a bit of residue on the criteria of PR, metrics, pages, and other heuristics
  4. there also will ship the input data, the desired portion of the words of the current barrel the

  5. only for the selected pages have been updated all the data to the correct, new – PR metrics
  6. the
  7. sort the hundreds of thousands of results, instead of a couple tens of millions and already in memory
  8. the
  9. write sorted, and the rest is just copied from the previous index and removing duplicates


And of course we need to have "update complete" when once a month-six months, for example, the index will rebuild all the simple direct algorithm

It is clear that even this amount of work in the application to 3 million medium used words and numbers is not easy, but compared to a full rebuild of winning significant: it is not necessary to sort on disk, no need to ship external metrics, PR, and other parameters – they, too, can't cache all in memory.

It's funny that the iterative method rebuild gives significantly better performance than the simplest construction of a reverse index from straight. Some search engines can certainly manage to do it fast enough on a fairly powerful hardware, but it's still not the best approach though, and the most correct. Just the number of long insertions in the reverse index brings all pluses to nothing, and I know the storage structures suffered either insertion speed or the speed of sequential reads.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

A little task — to bring a date

ODBC Firebird, Postgresql, executing queries in Powershell

garage48 for the first time in Kiev!