How does the search robot

On this subject there is a detailed faq on Yandex at http://help.yandex.ru/webmaster/?id=995296
Detailed, but not sufficiently informative. For example, a direct question, and, ask yourself: "What is a search engine robot and what is he doing?" Yandex itself and answers:

Robot (born crawler) maintains a list of URL, which it can index and to keep pumping the corresponding documents. If the analysis of the document robot detects a new link, it adds it to your list. Thus, any document or site, which has links, can be found by the robot, and hence the search for Yandex.

As you can see, the answer is only the second part of the question. Because what is a robot, we have not learned. Let us turn to the independent experts from Wikipedia.

Search robot (spider web, spider, spider, crawler) - a program which is part of the search engine and designed to circumvent Internet pages for the purpose of adding information about them (keywords) to the database search engine. At its core, the spider is most like web browser. It scans the contents of the page, throws it on the server search engine, which owns and sent the links on the following pages. The owners of search engines typically limit the depth of penetration of the spider inside the site and the maximum size of scanned text, so too large sites may not be fully indexed by search engines. Besides the usual spiders, there are so-called "woodpeckers" - robots that "tapping" indexed site to determine that it is connected to the Internet.

The order of traversal of pages, frequency of visits, protection against loops, as well as the selection criteria determined by the keyword search engine algorithms.

In most cases, the transition from one page to another through the links contained on the first and subsequent pages.

Also, many search engines provide users the ability to add the site to the queue for indexing. Usually it is substantially faster indexing site, and in cases where no external links do not lead to the site in general is the only opportunity to declare its existence.

Limit the site can be indexed by using a robots.txt, but some search engines may ignore the existence of this file. Full protection from indexing provides a mechanism to circumvent that have spiders are not under force. Usually - set a password on the page, or the requirement to complete the registration form before you get access to the page content.

Even clearer. Robot - is a program. The program, built-in search engine as its component and subordinate to the algorithms that search engine. In addition, the robot is subject and the author or the web site administrator. To subdue the robot search engine site admin must competently perform the dance with a tambourine to write instructions in the file robots.txt, which a file is an instruction for the robot, which pages do not appear in its index. We note here that access to these pages, if they have inbound links, for the robot still open. He was not only puts them in the index, though, given its submission to search engine algorithm, it is frequently changing, for absolutely sure that your sensitive data will not be in error property of the people, better play it safe and set the same password on your page, or other obstacles for robots, such as sms-lock:) Robots, of course, constantly being improved intellectually, but something tells me that to pay by card or by SMS not learn essentially never.

And below we see a link to the script through which we can check which pages on the server are protected against robots Yandex to match the instructions in robots.txt: script

On the same theme:

Roboblog
Search-Bot Log

Like the record? Be sure to subscribe to updates by RSS or by email!

Leave your response!

I'm not a robot.

Liveinternet