What is Robots txt? How to edit the robots txt file

Hello, dear readers of the “Webmaster’s World” blog!

File robots.txt– this is a very important file that directly affects the quality of indexing of your site, and therefore its search engine promotion.

That is why you must be able to correctly format robots.txt, so as not to accidentally prohibit any important documents of the Internet project from being included in the index.

About how to apply robots file.txt, what syntax should be used, how to allow and deny documents to the index, will be discussed in this article.

About the robots.txt file

First, let's find out in more detail what kind of file this is.

File robots is a file that shows search engines which pages and documents on a site can be added to the index and which cannot. It is necessary because initially search engines try to index the entire site, and this is not always correct. For example, if you are creating a site on an engine (WordPress, Joomla, etc.), then you will have folders that organize the work of the administrative panel. It is clear that the information in these folders cannot be indexed; in this case, the robots.txt file is used, which restricts access to search engines.

The robots.txt file also contains the address of the site map (it improves indexing by search engines), as well as the main domain of the site (the main mirror).

Mirror– this is an absolute copy of the site, i.e. when there is one site, then they say that one of them is the main domain, and the other is its mirror.

Thus, the file has quite a lot of functions, and important ones at that!

Robots.txt file syntax

The robots file contains blocks of rules that tell a particular search engine what can be indexed and what cannot. There can be one block of rules (for all search engines), but there can also be several of them - for some specific search engines separately.

Each such block begins with a “User-Agent” operator, which indicates which search engine these rules apply to.

User-Agent:A
(rules for robot “A”)

User-Agent:B
(rules for robot “B”)

The example above shows that the “User-Agent” operator has a parameter - the name of the robot search engine, to which the rules apply. I will indicate the main ones below:

After “User-Agent” there are other operators. Here is their description:

All operators have the same syntax. Those. operators should be used as follows:

Operator1: parameter1

Operator2: parameter2

Thus, first we write the name of the operator (no matter in capital or small letters), then we put a colon and, separated by a space, indicate the parameter of this operator. Then with new line we describe operator two in the same way.

Important!!! An empty line will mean that the block of rules for this search engine is complete, so do not separate statements with an empty line.

Example robots.txt file

Let's look at a simple example of a robots.txt file to better understand the features of its syntax:

User-agent: Yandex
Allow: /folder1/
Disallow: /file1.html
Host: www.site.ru

User-agent: *
Disallow: /document.php
Disallow: /folderxxx/
Disallow: /folderyyy/folderzzz
Disallow: /feed/

Sitemap: http://www.site.ru/sitemap.xml

Now let's look at the described example.

The file consists of three blocks: the first for Yandex, the second for all search engines, and the third contains the sitemap address (applied automatically for all search engines, so there is no need to specify “User-Agent”). We allowed Yandex to index the folder “folder1” and all its contents, but prohibited it from indexing the document “file1.html” located in the root directory on the hosting. We also indicated the main domain of the site to Yandex. The second block is for all search engines. There we banned the document "document.php", as well as the folders "folderxxx", "folderyyy/folderzzz" and "feed".

Please note that in the second block of commands to the index we did not prohibit the entire “folderyyy” folder, but only the folder inside this folder – “folderzzz”. Those. we have provided the full path for "folderzzz". This should always be done if we prohibit a document located not in the root directory of the site, but somewhere inside other folders.

It will take less than two minutes to create:

The created robots file can be checked for functionality in the Yandex webmaster panel. If errors are suddenly discovered in the file, Yandex will show it.

Be sure to create a robots.txt file for your site if you don’t already have one. This will help your site develop in search engines. You can also read our other article about the method of meta tags and .htaccess.

Good afternoon, dear friends! All you know is that search engine optimization- a responsible and delicate matter. You need to take into account absolutely every little detail to get an acceptable result.

Today we will talk about robots.txt - a file that is familiar to every webmaster. It contains all the most basic instructions for search robots. As a rule, they are happy to follow the prescribed instructions and, if they are compiled incorrectly, refuse to index the web resource. Next, I will tell you how to compose the correct version of robots.txt, as well as how to configure it.

In the preface I already described what it is. Now I’ll tell you why it is needed. Robots.txt is a small text file that is stored in the root of the site. It is used by search engines. It clearly states the rules of indexing, i.e. which sections of the site need to be indexed (added to the search) and which sections should not.

Typically, technical sections of a site are closed from indexing. Occasionally, non-unique pages are blacklisted (copy-paste of the privacy policy is an example of this). Here the robots are “explained” the principles of working with sections that need to be indexed. Very often rules are prescribed for several robots separately. We will talk about this further.

At correct setting robots.txt, your website is guaranteed to rise in search engine rankings. Robots will only take into account useful content, paying attention to duplicated or technical sections.

Creating robots.txt

To create a file, just use the standard functionality of your operating system, and then upload it to the server via FTP. Where it lies (on the server) is easy to guess - at the root. Typically this folder is called public_html.

You can easily get into it using any FTP client (for example) or built-in file manager. Naturally, we will not upload empty robots to the server. Let's write some basic directives (rules) there.

User-agent: *
Allow: /

Using these lines in your robots file, you will contact all robots (User-agent directive), allowing them to index your entire site (including all technical pages Allow: /)

Of course, this option is not particularly suitable for us. The file will not be particularly useful for search engine optimization. It definitely needs some proper tuning. But before that, we will look at all the main directives and robots.txt values.

Directives

User-agentOne of the most important, because it indicates which robots should follow the rules that follow it. The rules are taken into account until the next User-agent in the file.
AllowAllows indexing of any resource blocks. For example: “/” or “/tag/”.
DisallowOn the contrary, it prohibits indexing of sections.
SitemapPath to site map (in xml format).
HostMain mirror (with or without www, or if you have several domains). The secure protocol https (if available) is also indicated here. If you have standard http, you don't need to specify it.
Crawl-delayWith its help, you can set the interval for robots to visit and download files on your site. Helps reduce the load on the host.
Clean-paramAllows you to disable indexing of parameters on certain pages (like www.site.com/cat/state?admin_id8883278).
Unlike previous directives, 2 values ​​are specified here (the address and the parameter itself).

These are all rules that are supported by flagship search engines. It is with their help that we will create our robots, operating with various variations for the most different types sites.

Settings

To properly configure the robots file, we need to know exactly which sections of the site should be indexed and which should not. In the case of a simple one-page website using html + css, we just need to write a few basic directives, such as:

User-agent: *
Allow: /
Sitemap: site.ru/sitemap.xml
Host: www.site.ru

Here we have specified the rules and values ​​for all search engines. But it’s better to add separate directives for Google and Yandex. It will look like this:

User-agent: *
Allow: /

User-agent: Yandex
Allow: /
Disallow: /politika

User-agent: GoogleBot
Allow: /
Disallow: /tags/

Sitemap: site.ru/sitemap.xml
Host: site.ru

Now absolutely all files on our html site will be indexed. If we want to exclude some page or picture, then we need to specify a relative link to this fragment in Disallow.

You can use the services automatic generation robots files. I don’t guarantee that with their help you will create a perfectly correct version, but you can try it as an introduction.

Among such services are:

With their help you can create robots.txt in automatic mode. Personally, I highly do not recommend this option, because it is much easier to do it manually, customizing it for your platform.

When we talk about platforms, I mean all kinds of CMS, frameworks, SaaS systems and much more. Next we will talk about how to set up the WordPress and Joomla robot file.

But before that, let’s highlight a few universal rules that can guide you when creating and setting up robots for almost any site:

Disallow from indexing:

  • site admin;
  • personal account and registration/authorization pages;
  • cart, data from order forms (for an online store);
  • cgi folder (located on the host);
  • service sections;
  • ajax and json scripts;
  • UTM and Openstat tags;
  • various parameters.

Open (Allow):

  • pictures;
  • JS and CSS files;
  • other elements that must be taken into account by search engines.

In addition, at the end, do not forget to indicate the sitemap (path to the site map) and host (main mirror) data.

Robots.txt for WordPress

To create a file, we need to drop robots.txt into the root of the site in the same way. In this case, you can change its contents using the same FTP and file managers.

There is a more convenient option - create a file using plugins. In particular, Yoast SEO has such a function. Editing robots directly from the admin panel is much more convenient, so I myself use this method of working with robots.txt.

How you decide to create this file is up to you; it is more important for us to understand exactly what directives should be there. On my sites running WordPress I use this option:

User-agent: * # rules for all robots, except Google and Yandex

Disallow: /cgi-bin # folder with scripts
Disallow: /? # request parameters with home page
Disallow: /wp- # files of the CSM itself (with the wp- prefix)
Disallow: *?s= # \
Disallow: *&s= # everything related to search
Disallow: /search/ # /
Disallow: /author/ # author archives
Disallow: /users/ # and users
Disallow: */trackback # notifications from WP that someone is linking to you
Disallow: */feed # feed in xml
Disallow: */rss # and rss
Disallow: */embed # built-in elements
Disallow: /xmlrpc.php #WordPress API
Disallow: *utm= # UTM tags
Disallow: *openstat= # Openstat tags
Disallow: /tag/ # tags (if available)
Allow: */uploads # open downloads (pictures, etc.)

User-agent: GoogleBot # for Google
Disallow: /cgi-bin
Disallow: /?
Disallow: /wp-
Disallow: *?s=
Disallow: *&s=
Disallow: /search/
Disallow: /author/
Disallow: /users/
Disallow: */trackback
Disallow: */feed
Disallow: */rss
Disallow: */embed
Disallow: /xmlrpc.php
Disallow: *utm=
Disallow: *openstat=
Disallow: /tag/
Allow: */uploads
Allow: /*/*.js # open JS files
Allow: /*/*.css # and CSS
Allow: /wp-*.png # and images in png format
Allow: /wp-*.jpg # \
Allow: /wp-*.jpeg # and other formats
Allow: /wp-*.gif # /
# works with plugins

User-agent: Yandex # for Yandex
Disallow: /cgi-bin
Disallow: /?
Disallow: /wp-
Disallow: *?s=
Disallow: *&s=
Disallow: /search/
Disallow: /author/
Disallow: /users/
Disallow: */trackback
Disallow: */feed
Disallow: */rss
Disallow: */embed
Disallow: /xmlrpc.php
Disallow: /tag/
Allow: */uploads
Allow: /*/*.js
Allow: /*/*.css
Allow: /wp-*.png
Allow: /wp-*.jpg
Allow: /wp-*.jpeg
Allow: /wp-*.gif
Allow: /wp-admin/admin-ajax.php
# clean UTM tags
Clean-Param: openstat # and don’t forget about Openstat

Sitemap: # specify the path to the site map
Host: https://site.ru # main mirror

Attention! When copying lines to a file, do not forget to remove all comments (text after #).

This robots.txt option is most popular among webmasters who use WP. Is he ideal? No. You can try to add something or, on the contrary, remove something. But keep in mind that errors are common when optimizing a robot’s text engine. We will talk about them further.

Robots.txt for Joomla

And although in 2018 few people use Joomla, I believe that this wonderful CMS cannot be ignored. When promoting projects on Joomla, you will certainly have to create a robots file, otherwise how do you want to block unnecessary elements from indexing?

As in the previous case, you can create a file manually by simply uploading it to the host, or use a module for these purposes. In both cases, you will have to configure it correctly. This is what the correct option for Joomla will look like:

User-agent: *
Allow: /*.css?*$
Allow: /*.js?*$
Allow: /*.jpg?*$
Allow: /*.png?*$
Disallow: /cache/
Disallow: /*.pdf
Disallow: /administrator/
Disallow: /installation/
Disallow: /cli/
Disallow: /libraries/
Disallow: /language/
Disallow: /components/
Disallow: /modules/
Disallow: /includes/
Disallow: /bin/
Disallow: /component/
Disallow: /tmp/
Disallow: /index.php
Disallow: /plugins/
Disallow: /*mailto/

Disallow: /logs/
Disallow: /component/tags*
Disallow: /*%
Disallow: /layouts/

User-agent: Yandex
Disallow: /cache/
Disallow: /*.pdf
Disallow: /administrator/
Disallow: /installation/
Disallow: /cli/
Disallow: /libraries/
Disallow: /language/
Disallow: /components/
Disallow: /modules/
Disallow: /includes/
Disallow: /bin/
Disallow: /component/
Disallow: /tmp/
Disallow: /index.php
Disallow: /plugins/
Disallow: /*mailto/

Disallow: /logs/
Disallow: /component/tags*
Disallow: /*%
Disallow: /layouts/

User-agent: GoogleBot
Disallow: /cache/
Disallow: /*.pdf
Disallow: /administrator/
Disallow: /installation/
Disallow: /cli/
Disallow: /libraries/
Disallow: /language/
Disallow: /components/
Disallow: /modules/
Disallow: /includes/
Disallow: /bin/
Disallow: /component/
Disallow: /tmp/
Disallow: /index.php
Disallow: /plugins/
Disallow: /*mailto/

Disallow: /logs/
Disallow: /component/tags*
Disallow: /*%
Disallow: /layouts/

Host: site.ru # don't forget to change the address here to yours
Sitemap: site.ru/sitemap.xml # and here

As a rule, this is enough to extra files were not included in the index.

Errors during setup

Very often people make mistakes when creating and setting up a robots file. Here are the most common of them:

  • The rules are specified only for User-agent.
  • Host and Sitemap are missing.
  • The presence of the http protocol in the Host directive (you only need to specify https).
  • Failure to comply with nesting rules when opening/closing images.
  • UTM and Openstat tags are not closed.
  • Writing host and sitemap directives for each robot.
  • Superficial elaboration of the file.

It is very important to configure this small file correctly. If you make serious mistakes, you can lose a significant part of the traffic, so be extremely careful when setting up.

How to check a file?

For these purposes it is better to use special services from Yandex and Google, since these search engines are the most popular and in demand (most often the only ones used), there is no point in considering search engines such as Bing, Yahoo or Rambler.

First, let's consider the option with Yandex. Go to Webmaster. Then go to Tools – Analysis of robots.txt.

Here you can check the file for errors, as well as check in real time which pages are open for indexing and which are not. Very convenient.

Google has exactly the same service. Let's go to Search Console. Find the Scanning tab and select Robots.txt File Check Tool.

The functions here are exactly the same as in the domestic service.

Please note that it shows me 2 errors. This is due to the fact that Google does not recognize the directives for clearing the parameters that I specified for Yandex:

Clean-Param: utm_source&utm_medium&utm_campaign
Clean-Param: openstat

You should not pay attention to this, because Google robots only use GoogleBot rules.

Conclusion

The robots.txt file is very important for SEO optimization of your website. Approach its setup with all responsibility, because if implemented incorrectly, everything can go to waste.

Keep in mind all the instructions I've shared in this article, and don't forget that you don't have to copy my robots variations exactly. It is quite possible that you will have to further understand each of the directives, adjusting the file to suit your specific case.

And if you want to understand robots.txt and creating websites on WordPress in more depth, then I invite you to. Here you will learn how you can easily create a website, not forgetting to optimize it for search engines.

Almost every project that comes to us for audit or promotion has an incorrect robots.txt file, and often it is missing altogether. This happens because when creating a file, everyone is guided by their imagination, and not by the rules. Let's figure out how to correctly compose this file so that search robots work with it effectively.

Why do you need to configure robots.txt?

Robots.txt is a file located in the root directory of a site that tells search engine robots which sections and pages of the site they can access and which they cannot.

Setting up robots.txt is an important part in search engine results; properly configured robots also increases site performance. Missing Robots.txt won't stop search engines from crawling and indexing your site, but if you don't have this file, you may have two problems:

    The search robot will read the entire site, which will “undermine” the crawling budget. Crawling budget is the number of pages that a search robot is able to crawl in a certain period of time.

    Without a robots file, the search engine will have access to draft and hidden pages, to hundreds of pages used to administer the CMS. It will index them, and when it comes to the necessary pages that provide direct content for visitors, the crawling budget will “run out.”

    The index may include the site login page and other administrator resources, so an attacker can easily track and ddos attack or hack the site.

How search robots see a site with and without robots.txt:


Robots.txt syntax

Before we start understanding the syntax and setting up robots.txt, let's look at what the “ideal file” should look like:


But you shouldn’t use it right away. Each site most often requires its own settings, since we all have a different site structure and different CMS. Let's look at each directive in order.

User-agent

User-agent - defines a search robot that must follow the instructions described in the file. If you need to address everyone at once, use the * icon. You can also contact a specific search robot. For example, Yandex and Google:


Using this directive, the robot understands which files and folders are prohibited from being indexed. If you want your entire site to be open for indexing, leave the Disallow value empty. To hide all content on the site after Disallow, put “/”.

We can prevent access to a specific folder, file or file extension. In our example, we contact all search robots, block access to the bitrix, search and pdf extension.


Allow

Allow forces pages and sections of the site to be indexed. In the example above, we contact the Google search robot, block access to the bitrix, search folder and the pdf extension. But in the bitrix folder we force open 3 folders for indexing: components, js, tools.


Host - site mirror

A mirror site is a duplicate of the main site. Mirrors are used for a variety of purposes: changing the address, security, reducing the load on the server, etc.

Host is one of the most important rules. If registered this rule, then the robot will understand which of the site’s mirrors should be taken into account for indexing. This directive is necessary for Yandex and Mail.ru robots. Other robots will ignore this rule. Host is registered only once!

For the “https://” and “http://” protocols, the syntax in the robots.txt file will be different.

Sitemap - site map

A sitemap is a form of site navigation that is used to inform search engines about new pages. Using the sitemap directive, we “forcibly” show the robot where the map is located.


Symbols in robots.txt

Symbols used in the file: “/, *, $, #”.


Checking functionality after setting up robots.txt

After you have placed Robots.txt on your website, you need to add and check it in the Yandex and Google webmaster.

Yandex check:

  1. Follow the link.
  2. Select: Indexing settings - Robots.txt analysis.

Google check:

  1. Follow the link.
  2. Select: Scan - Robots.txt file inspection tool.

This way you can check your robots.txt for errors and make the necessary adjustments if necessary.

  1. The contents of the file must be written in capital letters.
  2. Only one file or directory needs to be specified in the Disallow directive.
  3. The "User-agent" line must not be empty.
  4. User-agent should always come before Disallow.
  5. Don't forget to include a slash if you need to disable indexing of a directory.
  6. Before uploading a file to the server, be sure to check it for syntax and spelling errors.

Good luck to you!

Video review of 3 methods for creating and customizing the Robots.txt file

Robots.txt is text file, which contains site indexing parameters for search engine robots.

Yandex supports the following directives:

Directive What does
User-agent *
Disallow
Sitemap
Clean-param
Allow
Crawl-delay
Directive What does
User-agent * Indicates a robot for which the rules listed in robots.txt apply.
Disallow Disables indexing of partitions or individual pages site.
Sitemap Specifies the path to the Sitemap file that is located on the site.
Clean-param Indicates to the robot that the page URL contains parameters (for example, UTM tags) that do not need to be taken into account when indexing.
Allow Allows indexing of sections or individual pages of the site.
Crawl-delay

Sets the minimum time period (in seconds) for the robot between finishing loading one page and starting loading the next.

* Mandatory directive.

The most common directives you may need are Disallow, Sitemap and Clean-param. For example:

User-agent: * #specify for which robots the directives are installed\nDisallow: /bin/ # prohibits links from the \"Shopping cart\".\nDisallow: /search/ # prohibits links to pages built into the search site\nDisallow: /admin / # prohibits links from the admin panel\nSitemap: http://example.com/sitemap # point the robot to the sitemap file for the site\nClean-param: ref /some_dir/get_book.pl

Robots of other search engines and services may interpret directives differently.

Note. The robot takes case into account when writing substrings (name or path to the file, robot name) and does not take case into account in the names of directives.

Using the Cyrillic alphabet

The use of Cyrillic is prohibited in the robots.txt file and server HTTP headers.

Greetings, friends and subscribers of my blog. Today on the agenda is Robots.txt, everything you wanted to know about it, in brief, without unnecessary fluff.

What is Robots.txt and why is it needed?

Robots.txt is needed to indicate to the search engine (Yandex, Google, etc.) how correctly (from your point of view) the site should be indexed. Which pages, sections, products, articles need to be indexed, and which ones, on the contrary, are not necessary.

Robots.txt is a plain text file (with .txt resolution) that was adopted by the W3C on January 30, 1994, and which is used by most search engines, and it usually looks like this:

How does it affect the promotion of your site?

To successfully promote a website, it is necessary that the index (base) of Yandex and Google contains only the necessary pages of the site. By the required pages I mean the following:

  1. Home;
  2. pages of sections, categories;
  3. Goods;
  4. Articles;
  5. Pages “About the company”, “Contacts”, etc.

By NOT needed pages I mean the following:

  1. Duplicate pages;
  2. Print pages;
  3. Search results pages;
  4. System pages, registration, login, logout pages;
  5. Subscription pages (feed);

For example, if the search engine index contains duplicates of the main promoted pages, this will cause problems with the uniqueness of the content within the site, and will also negatively affect positions.

Where is it?

The file is usually located in the root of the public_html folder on your hosting, here:

What you should know about the Robots.txt file

  1. Robots.txt instructions are advisory in nature. This means that the settings are directions and not direct commands. But as a rule, both Yandex and Google follow the instructions without any problems;
  2. The file can only be hosted on the server;
  3. It should be at the root of the site;
  4. Violation of the syntax leads to incorrectness of the file, which can negatively affect indexing;
  5. Be sure to check the correct syntax in the Yandex Webmaster panel!

How to block a page, section, file from indexing?

For example, I want to block the page from indexing in Yandex: http://site/page-for-robots/

To do this, I need to use the “Disallow” directive and the URL of the page (section, file). It looks like this:

User-agent: Yandex
Disallow: /page-for-robots/
Host: website

If I want close category
User-agent: Yandex
Disallow: /category/case/
Host: website

If I want to block the entire site from indexing, except for the section http://site/category/case/, then you will need to do this:

User-agent: Yandex
Disallow: /
Allow: /category/case/
Host: website

The “Allow” directive, on the contrary, indicates which page, section, file needs to be indexed.

I think the logic of construction has become clear to you. Please note that the rules will only apply to Yandex, since User-agent is specified: Yandex. Google will ignore this design and will index the entire site.

If you want to write universal rules for all search engines, use: User-agent: *. Example:

User-agent: *
Disallow: /
Allow: /category/case/
Host: website

User-agent is the name of the robot for which the instructions are intended. The default value is * (asterisk) - this means that the instructions are intended for absolutely all search robots.
The most common robot names:

  • Yandex – all robots of the Yandex search engine
  • YandexImages – image indexer
  • Googlebot - Google robot
  • BingBot – robot of the Bing system
  • YaDirectBot – system robot contextual advertising Yandex.

Links to detailed review all Yandex and Google directives.

What must be in the higher Robots.txt file

  1. The Host Directive is configured. It must state main mirror Your website. Main mirrors: site.ru or www.site.ru. If your site is with http s, then this must also be indicated. The main mirror in the host and in Yandex.Webmaster must match.
  2. Sections and pages of the site that do not carry any useful load, as well as pages with duplicate content, print pages, search results and system pages should be closed from indexing (using the Disallow: directive).
  3. Provide a link to sitemap.xml (your sitemap in xml format).
    Sitemap: http://site.ru/sitemap.xml

Indication of the main mirror

First you need to find out which mirror you have as the main one by default. To do this, enter the URL of your site in Yandex, point to the URL in the results and at the bottom left of the browser window it will be indicated with www domain, or without. In this case, without WWW.

If the domain is specified with https, then in both Robots and Yandex.Webmaster you must specify https! It looks like this:

Share