What is it? The robots.txt file is a plain text file that must meet the robots exclusion standard.
You can create the file with the Windows notepad and save it with the name robots.txt
This file consists of one or more rules, each of which blocks or allows a specific crawler access to a specific file path on a website.
The robots.txt file is used to manage crawler traffic to your site.
It is used to prevent the requests that your website receives from overloading it, with the robots.txt file well configured, you can avoid when you receive several visits from these indexers at the same time, that the speed of your website or even the Cloud itself is seen adversely affected.
What do we block? The crawler, also known as a tracker, spider, robot or bot. It is a program that analyzes the documents on the website. Search engines use very powerful crawlers that navigate and analyze websites creating a database with the information collected.
What elements make up the robots.txt? When generating the robots.txt file you have to take into account the specific commands and rules.
Commands User agent: It is the command used to specify the search engine robots/spiders that we allow to crawl our website.
The syntax of this command is: User-agent: (robot name)
(In each rule there must be at least one Disallow or Allow entry)
Disallow: Indicate a directory or a page of the root domain that you do not want the user-agent to crawl.
Allow: Indicates the directories or pages of the root domain that the user ‑ agent specified in the group should crawl. It is used to override the Disallow directive and allow a specific subdirectory or page of a locked directory to be crawle.
One option is to put an asterisk, this means that you allow all search engines to crawl the web.
User-agent: (*) Disallow
The following command is to instruct search engines not to crawl, access or index a specific part of the web, such as the wp-admin folder.
Disallow: /wp-admin/ Allow
With the following command you indicate the opposite, you mark the search engines what they can crawl. In this example it only allows one file from a specific folder.
Other elements to consider.
When adding elements to block, you must place the forward slash (/), at the beginning and end. The code can be simplified. *. The asterisk is used to block a sequence of characters. $. The dollar symbol is used when you want to block URL's with a specific ending.
Examples of commands used in robots.txt.
Exclude all robots from the server:
User-agent: * Disallow: /
Allow all robots to have access to scan everything:
User-agent: * Disallow:
Exclude only one bot, in this case Badbot:
User-agent: BadBot Disallow: /
Allow only one bot, in this case Google:
User-agent: Google Disallow: User-agent: * Disallow: /
Exclude a directory for all bots:
User-agent: * Disallow: /nombre-directorio/
Exclude a specific page:
User-agent: * Disallow: /url-pagina.html
Block images from the web:
User-agent: Googlebot-Image Disallow: /
Lock an image for one bot only:
User-agent: Googlebot-Image Disallow: /imagen/bloqueada.jpeg
Exclude a specific file type:
User-agent: Googlebot Dissallow: /*.jpeg$
Exclude URL's with a specific ending:
User-agent: * Disallow: //pdf$
These are examples of use, use the one that suits your needs or create one that suits you.
Once the robots.txt file has been created, upload it through FTP dentro del directorio /tudominio/datos/web/