A public index of the web (5 billion web pages)
Organization the Common Crawl made a generous gift to developers and companies working in the field of search and information processing. In the open access on Amazon S3 posted an index of 5 billion web pages with metadata, PageRank and count the hyperlinks.
/ >
If you saw in the logs of the web server CCBot/1.0, that is their crawler. Non-profit organization Common Crawl is in favour of freedom of information and aims to make publicly available search index, which will be available to each developer or startup. It is expected that this will lead to the creation of a whole galaxy of innovative web services.
Search cluster Common Crawl running in Hadoop, data is stored in HDFS and processed through MapReduce, and then all the content is compressed in format files ARC, files, 100 MB (total base 40-50 TB). You can download a file to yourself or directly processed on EC2 using the same MapReduce. Access bucket'the only possible flag Amazon Requester Pays, that is for registered users of EC2 (read more about Amazon Requester-Pays SDAs). Download 40-50 TB from the external network will cost about $130 at current rates, Amazon, treatment using MapReduce inside EC2 for free.
Data are available almost without restriction: see manual access and terms of use. It is only prohibited to upload the downloaded data somewhere in other place, sell, access or use data in any unlawful manner.
Add to that the head of the Common Crawl Foundation is widely known in narrow circles Elbaz Gilad (Gilad Elbaz), the main developer of the Google AdSense system and Executive Director of the startup Factual.
Article based on information from habrahabr.ru
/ >
If you saw in the logs of the web server CCBot/1.0, that is their crawler. Non-profit organization Common Crawl is in favour of freedom of information and aims to make publicly available search index, which will be available to each developer or startup. It is expected that this will lead to the creation of a whole galaxy of innovative web services.
Search cluster Common Crawl running in Hadoop, data is stored in HDFS and processed through MapReduce, and then all the content is compressed in format files ARC, files, 100 MB (total base 40-50 TB). You can download a file to yourself or directly processed on EC2 using the same MapReduce. Access bucket'the only possible flag Amazon Requester Pays, that is for registered users of EC2 (read more about Amazon Requester-Pays SDAs). Download 40-50 TB from the external network will cost about $130 at current rates, Amazon, treatment using MapReduce inside EC2 for free.
Data are available almost without restriction: see manual access and terms of use. It is only prohibited to upload the downloaded data somewhere in other place, sell, access or use data in any unlawful manner.
Add to that the head of the Common Crawl Foundation is widely known in narrow circles Elbaz Gilad (Gilad Elbaz), the main developer of the Google AdSense system and Executive Director of the startup Factual.
Комментарии
Отправить комментарий