What is a robots.txt file? How to create a robots.txt file? What is the best and most suitable robot file for WordPress and Joomla? You can read the answers to all these questions in this article from the Seo Teaching site. This article also provides 8 practical examples of robot files. Be with me.
What is a robots.txt file?
The robots.txt file is a text file by which we set a guideline for search engine robots to crawl and index URLs and not crawl and index URLs. Note that the nature of the creep is different from the index. A robot must first crawl and then decide whether or not to store it in a search engine database. If saved, the index operation has occurred. In this article from Seo Teacuhing site, we will discuss how to create and build a robots.txt file, and by reading this article, you can create and manage this file, regardless of what type of CMS you have (WordPress, Joomla, etc.).
Search engines crawl through the pages of your site, index them, and follow the links to other pages or sites. Before crawling a page from a domain, any standard robot first reads the robots.txt file and, based on the instructions in that file, is allowed to crawl so that it can index. So you have to be careful what pages you forbid to crawl because if you block your important pages, especially landing pages or the same landing page for bots through this file, you will hit your SEO site and consequently your online business. (You can read the importance and how to build landing pages or landing page with an example in the comprehensive article What is a landing page)
The robots.txt file is for robots, but it is interesting to note that such a file is also written for humans. This file is called humans.txt and contains a message for users and visitors to your site and is written primarily in English.
Sample robots.txt file and humans.txt file for Google site:
Robots.txt file or robot metatag
If you want the URL of the pages and its links to be checked by the robot but not displayed in the search results, then you should use the robot metatag instead of the robots.txt file. If you want to do this optimally, be sure to read our robots meta tag article to get some interesting tips about this meta tag.
What address should we put the robots.txt file in?
After creating a robots.txt file in ASCII or UTF-8 format, this file must be accessed by bots at http://domain.com/robots.txt (put your site domain name instead of domain.com). Now two points about this topic:
- If the crawler wants to access our subdomains, we need to create a separate robots.txt file for each subdomain.
- If your site opens with or without www, you need to set the robots.txt file in the same way; The same is true for http and https, but keep in mind that if the site is available in this way, it will create duplicate content, which is very detrimental to the site’s internal SEO and is one of the reasons why the site is penalized by Google, because All subdomains are considered a separate site by Google. But know about this file that if your subdomains are opened separately, Google will prioritize the http domain to read the robots.txt file.
Another important point is that the robots.txt file must be written in lower case. This file, like the URL of this file, is case sensitive. For example, the following addresses are completely different and only the first address is correct.
How to create a robots.txt file
If you see an error after visiting http://domain.com/robots.txt, it means that your site does not have a robots.txt file. To build robots.txt, it is enough to refer to the control panel of the site host, now if the control panel of the host is of the cipanel type
Create a simple file at the root or root of the site and enter its format and name robots.txt. In the article Create a robots.txt file, Google has mentioned how to create a robots.txt file, and we will refer to the example commands in the Robots.txt file in the continuation of this article, but first of all, it is better to define three keywords first. .
User-agent: We write the name of the desired robot in front of this phrase, which of course is not case sensitive (in other words, it is non case sensitive). Using the user-agent keyword, you can target a specific robot or enter a command for all of them in general. Sometimes some search engines have different bots, such as the Google search engine, which has its own bot for images, news, and more. Here are a few examples to help you better understand this.
If you want to learn more about Google bots, read this article on Google.
Disallow: In front of this keyboard, you enter directories that you do not want to be indexed by the robot. After the User-agent phrase, you can use Disallow as many times as necessary. Note that search engine crawlers are case sensitive in directories. Here are a few examples to help you better understand this.
Allow: The opposite of the Disallow command. Although the Allow command is unofficial, it is known by most popular robots.
Sample commands in the Robots.txt file
First of all, know that:
* Refers to all, for example, all robots or all characters. In the following examples, you will understand the application * well.
/ Alone means all addresses.
$ Indicates the end of a URL path.
Example 1 – Lack of access to the entire site
As a result, in the first line of the following example, we address * all search engine robots by inserting *, and in the second line, by inserting /, we refer to non-creep and index all domain addresses. Therefore, the following command means: None of the search engines are allowed to crawl your site.
Example Two – Access to the whole site
The following command, contrary to the above command, says that all search engine bots have access to all site URLs.
Example 3 – Lack of access to a specific directory
The following command means that the Google bot does not have access to the blog folder and all blog subfolders. In fact, inaccessibility includes both Seo-teaching.com/blog and URLs like Seo-teaching.com/blog/example. In this case, all bots except Google bots have access to this directory.
Example 4 – Robot priority
As we said, search engines may have a large number of bots for specific cases, so this priority is important for them. If you enter commands in a robots.txt file that have several different blocks valid for a (crawler) robot, the search engine robot will always select the command that most explicitly refers to that particular robot and will execute the commands in that section. . For example, suppose a robots.txt file contains a Googlebot block and a Googlebot-Video block as shown below. In this case, if the Google photo bot (Googlebot-Image) enters your site, it will follow the first of these blocks, and if the Google video bot (Googlebot-Video) enters the site, it will follow the second block and the commands of the first block. Ignored because the second block is a priority for it.
Example Five – Regular Expression
You can also use regular expressions to use in robots.txt file commands. Of course, these regular expressions are not formally defined in this file, but most of the world-famous robots also support regular expressions. For example, the command also states that not all bots should have access to the pdf file in the test directory.
Example Six – Specifying Exceptions to Access Directory Content
Now we are going to talk a little about WordPress and create a robots txt file for WordPress in an optimized and appropriate way. The following command, which is used in many WordPress sites, means that not all bots have access to the wp-admin folder but access to the admin-ajax.php file in that folder. Of course, there is no harm in having such commands in the robots.txt file, although the robot is not able to enter the WordPress admin environment, but coding errors are naturally possible from both Google and us. If you have noticed that in this article we used the User-agent key once and then entered 2 command lines, you can enter the required number of commands in each line.
Example Seven – Common Mistakes
Another common and wrong command from the point of view of an SEO expert in the WordPress robots.txt file that is used in many famous Iranian sites is the following code. Personally, I did not find such codes in any of the reputable external sites that have a WordPress content management system. Lack of access to wp-includes, which contains a series of important files such as jQuery, prevents the site from outputting as it should for the search engine. Sites like Youast, Neil Patel, searchengineland, and dozens of other well-known WordPress SEO sites do not use such commands that prevent wp-includes files from being accessed. Bots become some files that are not pleasant for SEO.
Note: In the above example we have introduced disallowing / wp-includes / as a common mistake, not / wp-admin /. Inserting / wp-admin / in the WordPress robots.txt file is a default method and we preferred to include it in every example. (Of course, disallowing / wp-admin / is not a version that can be wrapped up for all websites and depends entirely on the site)
Example 8 – Lack of access to an address with a special character
The following command is used when you have a character like? Because sometimes due to technical problems on the site, your article may be published with different URLs and with different amounts. To do this, temporarily enter the following command so that bots do not index addresses with question marks.
In the example below, we emphasize $ if the address should be Disallow if there is a question mark (?) At the end of the address. As a result of the example below, no robot is allowed to crawl addresses with? They do not have to be finished. In this case, the addresses with? They are not completed, they are not considered.