A blog about WordPress plugin settings and search engine optimization for beginner webmasters. Search index Index a new page in Yandex

What is indexing? This is the process of a robot receiving the content of your site's pages and including that content in search results. If we look at the numbers, the indexing robot’s database contains trillions of website page addresses. Every day the robot requests billions of such addresses.

But this whole large process of indexing the Internet can be divided into small stages:

First, the indexing robot must know that a page on your site has appeared. For example, by indexing other pages on the Internet, finding links, or downloading the set nemp. We learned about the page, after which we plan to crawl this page, send data to your server to request this page of the site, receive the content and include it in the search results.

This entire process is the process of exchanging the indexing robot with your website. If the requests sent by the indexing robot practically do not change, and only the page address changes, then your server’s response to the robot’s page request depends on many factors:

from your CMS settings;
from the hosting provider settings;
from the work of the intermediate provider.

This answer is just changing. First of all, when requesting a page, the robot from your site receives the following service response:

These are HTTP headers. They contain various proprietary information, which allows the robot to understand what content will be transmitted now.

I would like to focus on the first header - this is the HTTP response code that indicates to the indexing robot the status of the page that the robot requested.

There are several dozen such HTTP code statuses:

I'll tell you about the most popular ones. The most common response code is HTTP-200. The page is available, it can be indexed, included in search results, everything is fine.

The opposite of this status is HTTP-404. The page is not on the site, there is nothing to index, and there is nothing to include in the search. When changing the structure of sites and changing the addresses of internal pages, we recommend setting up a 301 server for redirection. He will just point out to the robot that old page moved to a new address and it is necessary to include the new address in the search results.

If the page content has not changed since the last time a robot visited the page, it is best to return an HTTP-304 code. The robot will understand that there is no need to update the pages in the search results and the content will not be transferred either.

If your site is only available for a short period of time, for example, when doing some work on the server, it is best to configure HTTP-503. It will indicate to the robot that the site and server are currently unavailable, you need to come back a little later. In case of short-term unavailability, this will prevent pages from being excluded from search results.

In addition to these HTTP codes and page statuses, you also need to directly obtain the content of the page itself. If for a regular visitor the page looks like this:

these are pictures, text, navigation, everything is very beautiful, then for the indexing robot any page is just a set of source code, HTML code:

Various meta tags, text content, links, scripts, a lot of all kinds of information. The robot collects it and includes it in search results. It seems that everything is simple: they requested a page, received the status, received the content, and included it in the search.

But it’s not for nothing that the Yandex search service receives more than 500 letters from webmasters and site owners stating that certain problems have arisen with the server’s response.

All these problems can be divided into two parts:

These are problems with the HTTP response code and problems with the HTML code, with the direct content of the pages. There can be a huge number of reasons for these problems. The most common is that the indexing robot is blocked by the hosting provider.

For example, you launched a website, added new section. The robot begins to visit your site more often, increasing the load on the server. The hosting provider sees this on their monitoring, blocks the indexing robot, and therefore the robot cannot access your site. You go to your resource - everything is fine, everything works, the pages are beautiful, everything opens, everything is great, but the robot cannot index the site. If the site is temporarily unavailable, for example, if you forgot to pay domain name, the site has been down for several days. The robot comes to the site, it is inaccessible, under such conditions it can disappear from the search results literally after a while.

Incorrect CMS settings, for example, when updating or switching to another CMS, when updating the design, can also cause pages on your site to disappear from the search results if the settings are incorrect. For example, the presence of a prohibiting meta tag in source code site pages, incorrect setting of the canonical attribute. Make sure that after all the changes you make to the site, the pages are accessible to the robot.

The Yandex tool will help you with this. To the webmaster to check the server response:

You can see what HTTP headers your server returns to the robot, and the contents of the pages themselves.

The “indexing” section contains statistics where you can see which pages are excluded, the dynamics of changes in these indicators, and do various sorting and filtering.

Also, I already talked about this section today, the “site diagnostics” section. If your site becomes unavailable to a robot, you will receive a corresponding notification and recommendations. How can this be fixed? If no such problems arise, the site is accessible, meets codes 200, and contains correct content, then the robot begins automatic mode visit all the pages that he recognizes. This does not always lead to the desired consequences, so the robot’s activities can be limited in a certain way. There is a robots.txt file for this. We'll talk about it in the next section.

Robots.txt

The robots.txt file itself is small text document, it lies in the root folder of the site and contains strict rules for the indexing robot that must be followed when crawling the site. The advantages of the robots.txt file are that you do not need any special or specialized knowledge to use it.

All you have to do is open Notepad, enter certain format rules, and then simply save the file on the server. Within a day, the robot begins to use these rules.

If we take an example of a simple robots.txt file, here it is, just on the next slide:

The “User-Agent:” directive shows for which robots the rule is intended, allowing/denying directives and auxiliary Sitemap and Host directives. A little theory, I would like to move on to practice.

A few months ago I wanted to buy a pedometer, so I turned to Yandex. Market for help with the choice. Switched from home page Yandex to Yandex. Market and got to the main page of the service.

Below you can see the address of the page I went to. The address of the service itself also added the identifier of me as a user on the site.

Then I went to the “catalog” section

I selected the desired subsection and configured the sorting parameters, price, filter, how to sort, and manufacturer.

I received a list of products, and the page address has already grown.

I went to the desired product, clicked on the “add to cart” button and continued checkout.

During my short journey, the page addresses changed in a certain way.

Service parameters were added to them, which identified me as a user, set up sorting, and indicated to the site owner where I came from to this or that page of the site.

I think such pages, service pages, will not be very interesting to search engine users. But if they are available to the indexing robot, they may be included in the search, since the robot essentially behaves like a user.

He goes to one page, sees a link that he can click on, goes to it, loads the data into his robot’s database and continues this crawl of the entire site. This category of such addresses also includes personal data of users, for example, such as delivery information or contact information of users.

Naturally, it is better to ban them. This is exactly what the robots.txt file will help you with. You can go to your website this evening after the Webmaster Workshop, click on it, and see which pages are actually available.

In order to check robots.txt there is a special tool in Webmaster:

You can download, enter page addresses, see if they are accessible to the robot or not.

Make some changes, see how the robot reacts to these changes.

Errors when working with robots.txt

In addition to such a positive effect - closing service pages, robots.txt can play a cruel joke if handled incorrectly.

Firstly, the most common problem when using robots.txt is the closing of really necessary site pages, those that should be in the search and shown for queries. Before you make changes to robots.txt, be sure to check whether the page you want to close is showing up for search queries. Perhaps a page with some parameters is in the search results and visitors come to it from search. Therefore, be sure to check before using and making changes to robots.txt.

Secondly, if your site uses Cyrillic addresses, you won’t be able to indicate them directly in robots.txt; they must be encoded. Since robots.txt is an international standard that all indexing robots follow, they will definitely need to be coded. It is not possible to explicitly specify the Cyrillic alphabet.

The third most popular problem is different rules for different robots of different search engines. For one indexing robot, all indexing pages were closed, for the second, nothing was closed at all. As a result of this, everything is fine in one search engine, the desired page is in the search, but in another search engine there may be trash, various garbage pages, and something else. Be sure to make sure that if you set a ban, it must be done for all indexing robots.

The fourth most popular problem is the use of the Crawl-delay directive when it is not necessary. This directive allows you to influence the purity of requests from the indexing robot. This is a practical example, a small site, placed it on a small hosting, everything is fine. We added a large catalog, the robot came, saw a bunch of new pages, started accessing the site more often, increased the load, downloaded it and the site became inaccessible. We set the Crawl-delay directive, the robot sees this, reduces the load, everything is fine, the site works, everything is indexed perfectly, and is in the search results. After some time, the site grows even more and is transferred to a new hosting that is ready to cope with these requests, with a large number requests, and they forget to remove the Crawl-delay directive. As a result, the robot understands that a lot of pages have appeared on your site, but cannot index them simply because of the established directive. If you've ever used the Crawl-delay directive, make sure it's not there now and that your service is ready to handle the load from the indexing robot.

In addition to the described functionality, the robots.txt file allows you to solve two very important tasks - get rid of duplicates on the site and indicate the address of the main mirror. This is exactly what we will talk about in the next section.

Doubles

By duplicates we mean several pages of the same site that contain absolutely identical content. The most common example is pages with and without a slash at the end of the address. Also, a duplicate can be understood as the same product in different categories.

For example, roller skates can be for girls, for boys, the same model can be in two sections at the same time. And thirdly, these are pages with an insignificant parameter. As in the example with Yandex. The market defines this page as a “session ID”; this parameter does not change the content of the page in principle.

To detect duplicates and see which pages the robot is accessing, you can use Yandex. Webmaster.

In addition to statistics, there are also addresses of pages that the robot downloaded. You see the code and the last call.

Troubles that duplicates lead to

What's so bad about doubles?

Firstly, the robot begins to access absolutely identical pages of the site, which creates an additional load not only on your server, but also affects the crawling of the site as a whole. The robot begins to pay attention to duplicate pages, and not to those pages that need to be indexed and included in search results.

The second problem is that duplicate pages, if they are accessible to the robot, can end up in search results and compete with the main pages for queries, which, naturally, can negatively affect the site being found for certain queries.

How can you deal with duplicates?

First of all, I recommend using the “canonical” tag in order to point the robot to the main, canonical page, which should be indexed and found in search queries.

In the second case, you can use a 301 server redirect, for example, for situations with a slash at the end of the address and without a slash. We set up redirection - there are no duplicates.

And thirdly, as I already said, this is the robots.txt file. You can use both deny directives and the Clean-param directive to get rid of insignificant parameters.

Site mirrors

The second task that robots.txt allows you to solve is to point the robot to the address of the main mirror.

Mirrors are a group of sites that are absolutely identical, like duplicates, only the two sites are different. Webmasters usually encounter mirrors in two cases - when they want to move to a new domain, or when a user needs to make several website addresses available.

For example, you know that when users type your address, your website address is in address bar, often make the same mistake - they misspell, put the wrong symbol, or something else. You can purchase an additional domain in order to show users not a stub from the hosting provider, but the site they really wanted to go to.

Let's focus on the first point, because it is with this that problems most often arise when working with mirrors.

I advise you to carry out the entire moving process according to following instructions. A small instruction that will allow you to avoid various problems when moving to a new domain name:

First, you need to make sites accessible to the indexing robot and place absolutely identical content on them. Also make sure that the robot knows about the existence of the sites. The easiest way is to add them to Yandex. Webmaster and confirm rights to them.

Secondly, using the Host directive, point the robot to the address of the main mirror - the one that should be indexed and be in the search results.

We are waiting for gluing and transfer of all indicators from the old site to the new one.

After which you can set up redirection from the old address to the new one. A simple instruction, if you are moving, be sure to use it. I hope there won't be any problems with
moving.

But, naturally, errors arise when working with mirrors.

First of all, the most important problem is the lack of explicit instructions for the indexing robot to the address of the main mirror, the address that should be in the search. Check on your sites that they have a host directive in their robots.txt, and that it leads to exactly the address that you want to see in the search.

The second most popular problem is using redirection to change the main mirror in an existing group of mirrors. What's happening? The old address, since it redirects, is not indexed by the robot and is excluded from search results. In this case, the new site does not appear in the search, since it is not the main mirror. You lose traffic, you lose visitors, I think no one needs this.

And the third problem is the inaccessibility of one of the mirrors when moving. The most common example in this situation is when they copied the site’s content to a new address, but the old address was simply disabled, they did not pay for the domain name and it became unavailable. Naturally, such sites will not be merged; they must be accessible to the indexing robot.

Useful links in the work:

More useful information you will find in the Yandex.Help service.
All the tools I talked about and even more - there is a beta version of Yandex.Webmaster.

Answers to questions

“Thank you for the report. Is it necessary to disable indexing of CSS files for the robot in robots.txt or not?

We do not recommend closing them at this time. Yes, it’s better to leave CSS and JavaScript, because now we are working to ensure that the indexing robot begins to recognize both scripts on your site and styles, and see how a visitor does from a regular browser.

“Tell me, if the site URLs are the same for the old and the new, is that normal?”

Yes, it's okay. Basically, you just update the design, add some content.

“The site has a category and it consists of several pages: slash, page1, page2, up to 10, for example. All pages have the same category text, and it turns out to be duplicate. Will this text be a duplicate or should it be closed somehow, a new index on the second and further pages?

First of all, since the pagination on the first page and the content on the second page are generally different, they will not be duplicates. But you need to expect that the second, third and further pagination pages can get into the search and show up for some relevant query. Better in pagination pages, I would recommend using the canonical attribute, in the best case - on the page on which all products are collected so that the robot does not include pagination pages in the search. People very often use canonical on the first page of pagination. The robot comes to the second page, sees the product, sees the text, does not include the page in the search and understands due to the attribute that it is the first pagination page that should be included in the search results. Use canonical, and close the text itself, I think there is no need.

Source (video): How to set up site indexing- Alexander Smirnov

Magomed Cherbizhev

From this material you will learn:

What is indexing
How to add a site to the database
How to speed up indexing

What is indexing?

We have already told you about. In short, these are answers to user queries that they type into a search engine. If you click “buy an elephant”, you will get sites that offer elephants wholesale and retail. Yandex or Google will not show plastic windows or call girls for such requests. And now attention, the question: does every site get into the search results? No, not everyone. At a minimum, search engines need to know about the existence of this site and the content that is posted on it. After all, how can you show something that no one knows about? There are website databases for this. In short, adding a site with its content to this database is called indexing. How does indexing work? The Yandex or Google robot operates according to its own algorithms: it searches for information about your site (this information includes key articles, texts, photos, etc. - in a word, all content). It is like an X-ray, it scans through every page of the site, analyzes it, scans it and adds your site to the database. It will now appear in search results and appear in response to user requests. And in what positions - leaders or outsiders - depends on the content with which it is filled. Of course, site indexing is simply necessary. When your site begins to be seen by search engines, visitors come to it and it grows

How to add a site to the database?

Let's say you created the website Mostbestsite.rf. We filled it, of course, with the best content and are looking forward to it reaching the top. In order for the site and content to be indexed and included in the database, you can use two methods.

Wait for self-indexing. Sooner or later, robots will find your site - to do this, you just need to leave active links to it from other resources. It is believed that the more links there are, the faster the indexing will be.
Add a site manually by filling out a special form in the search engine. There you need to provide a link to the site and brief information about it. In Yandex, for example, this is done in the serviceYandex Webmaster on the page Report a new site .

There, in Webmaster, you can then monitor the statistics of requests that bring users to your site. Everything is also simple in Google - you can register a site using the linkGoogle Webmaster Center . How fast is indexing? It’s hard to say exact numbers - it depends on your luck. But we know one thing for sure: Yandex indexes slower than Google. There were cases when indexation in it took several weeks.

How to speed up indexing?

Search engine algorithms are an unpredictable thing, and, as already mentioned, there is no exact recipe. We can recommend proven methods (essentially the same ones that affect website promotion in search results)

Unique content that search engines love so much. One thing: if you have posted an article on a website or blog and it has not yet been indexed, anyone can theoretically steal it and post it on their own. If the article is indexed earlier on another site, you are in trouble. Your article will be considered non-unique, and your competitor will be left in the dark.
Clear site structure. Follow the three-click rule: you can get to each page from the main page in no more than three clicks. Less is good, more is better! This way you will make the task easier for search engines, which will index not only the main page, but also other pages.
Linking with internal pages is useful for both visitors and the robot.
Broadcasting posts to RSS directories. All your posts will be duplicated in RSS directories to which RSS feeds are added. These are feeds for subscribing to blog or website updates so that you receive the latest in your email. With their help, the search engine will quickly find your new articles and index them.

Here's an example of successful indexing:

A week ago, an article about a gun appeared on one of the sites VPO-213 . There was practically no information about this gun on the Internet - it had not yet appeared on sale. There was only a short presentation in a video on YouTube. Accordingly, the text was completely unique and almost the only one on the Internet on this topic. Around midnight, the text was published on the site (not even on the main page!) and was indexed within a short time. At half past nine in the morning the next day, the article was in third place in the Yandex search results. At 9.50 she became the first and still holds this position.

In this instruction we will consider the issue of adding our new site for indexing to various search engines.

I decided to mention how popular search engines, as well as those that you may not have heard of.

Site indexing in Yandex

To add a site for indexing, just enter the url of the main page and the captcha. Captcha is a few numbers that protect against automatic registrations. After you click the “add” button, several options for the development of events are possible.

1) The message “your site has been added” signals the successful addition of a resource to the queue for indexing in Yandex.
2) If the message “Your hosting is not responding” appears, it means that your server is down at this moment. You can try adding a site later or find better hosting.
3) But if a message appears that “the specified URL is prohibited from indexing,” then things are bad. This indicates that sanctions have been imposed on your resource in the form of a site ban. It is quite possible that the domain you purchased already had a website that received sanctions. Using addurl, webmasters often check sites for bans in Yandex.

Site indexing in Google (Google)

The next most important for our site is search engine Google (Google). The process of adding a site to indexing in Google is exactly the same as on Yandex. Google also has its own adurilka, which is located at: https://www.google.com/webmasters/tools/submit-url.

You also need to enter a captcha when adding. But there are two differences. If in Yandex you can simply add a URL without unnecessary movements, then in Google you need to be logged in to your account. Otherwise it won't work. Accordingly, if you don’t have an account there yet, you will have to create one. The second difference between Google and Yandex is the speed of indexing. Google indexes websites very quickly.

Site indexing in Rambler (Rambler.ru)

Of course, Rambler is not what it used to be, as many will say, and provides very little traffic. And anyway, why neglect them? The site indexing process in Rambler is the longest among other domestic search engines. And his adurilka has not been working for a long time, it was at: robot.rambler.ru/cgi-bin/addsite.cgi

He has been using the Yandex search database for a long time. Therefore, to get into the Rambler.ru index it is enough to add to Yandex.

Site indexing in Mail.ru (Mail)

The search engine Mail.ru also has a webmaster’s account. Adding a site for indexing in Mail.ru occurs through the addurl form, which is located at: go.mail.ru/addurl

Moreover, like Google, in order to apply for indexing, you need to create your own account and log into it. It won't work out otherwise. Mile has recently been trying to develop his own tools for webmasters.

Above we looked at the main domestic search engines in which we would like to index our site. I will give the following PS more for your general SEO erudition than for specific actions.

Search engine Aport.ru (Aport)

Aport.ru was once a search engine, with its own index database and adurilka. Now it has been turned into a product search engine in which you can compare prices for goods and services.

Search engine Nigma.ru (Nigma)

Nigma.ru is our Russian intelligent search engine. The total volume of its traffic is about three million requests per day. Obviously, traffic from Nygma should not be neglected. You can add your site for indexing in Nigma on the page nigma.ru/index_menu.php?menu_element=add_site.

Media navigator Tagoo.ru

The Tagoo.ru system is a media search engine that searches for media data. This includes music content, videos and programs. In order for your site to be indexed by the Tagoo system, you need to use the add form: tagoo.ru/ru/webmaster.php?mode=add_site.

Search engine Turtle.ru (Turtle)

The international search engine Turtle (Turtle) searches across the CIS countries in any language. For indexing, resources located in such domain zones as: ru, su, ua, am, az, ge, by, kz, kg, uz, md are accepted. To add a site for indexing in Turtle, you need to use Aduril: http://www.turtle.ru/add.html. It is advisable to wait for the message that your site has been accepted. It may not be added, but you won’t even know.

Foreign search engines

If the above were domestic search engines, then below there will be a list of foreign search engines.

Search engine Yahoo.com (Yahoo)

Search engine Bing.com (Bing)

The Bing search engine is owned by Microsoft and was created to replace Live Search. Microsoft hopes that the new brainchild will be much more popular than its predecessor. If you want your Bing.com site to be indexed, you can do so at http://www.bing.com/toolbox/submit-site-url.

Ukrainian search engines

And in conclusion of my review, I will cite two popular search engines in Ukraine.

Ukrainian search engine Meta.ua (Meta)

For a number of reasons, search engines do not index all pages of a site or, conversely, add unwanted ones to the index. As a result, it is almost impossible to find a site that has the same number of pages in Yandex and Google.

If the discrepancy does not exceed 10%, then not everyone pays attention to it. But this position is valid for media and information sites, when the loss of a small part of pages does not affect overall traffic. But for online stores and other commercial sites, the absence of product pages in the search (even one out of ten) is a loss of income.

Therefore, it is important to check the indexing of pages in Yandex and Google at least once a month, compare the results, identify which pages are missing in the search, and take action.

Problem with indexing monitoring

Viewing indexed pages is not difficult. This can be done by uploading reports in panels for webmasters:

(“Indexing” / “Pages in search” / “All pages” / “Download XLS / CSV table”);

Tool capabilities:

simultaneous checking of indexed pages in Yandex and Google (or in one PS);
the ability to check all site URLs at once by;
There is no limit on the number of URLs.

Peculiarities:

work “in the cloud” - no need to download and install software or plugins;
uploading reports in XLSX format;
notification by email about the end of data collection;
storage of reports for an unlimited time on the PromoPult server.