Robots.txt file
09 / 04 / 2024

Robots.txt File: what it is and how to configure it

Bruno Díaz Marketing Manager
Bruno Díaz
Marketing Manager
SectorLet's talk about

The robots.txt is a file that we create and host on our website, whose purpose is to indicate to robots which URLs of our site they can access and which ones they cannot.

The robots.txt file is a whole world for SEOs with a technical component. It is largely unknown to the general public, but for us it is something fundamental to review in our SEO projects and to consult when learning from other projects. In this article, you will learn in detail many aspects you may not have known, and you will become familiar with something that might seem intimidating at first for being very technical and delicate, but you will see that in the end everything follows logical and predictable rules.

What it is and what it's for

The robots.txt is a file we create and host on our website, whose purpose is to indicate to robots which URLs of our site they can access and which ones they cannot.

By filtering the requests that bots can make, its main advantage is to prevent requests that overload the server. This ensures that the website is always available and performing optimally.

What it's NOT for

Many beginner SEOs often think it is a tool to index or deindex content, but nothing could be further from the truth. In fact, a URL blocked by robots can still be indexed if, for example, it is linked somewhere or appears in the sitemap. And the worst part is that it will be indexed without content, because Google cannot access it.

However, by shaping the robots file we can partially influence the crawling and indexing process of our URLs. So a good robots blocking strategy, combined with other implementations such as internal linking, noindex tags, or canonicals, can help.

Things you should know about this file

  • It can be used to block the indexing of image, video, or PDF files, among others
  • Google will follow our robots file, but it is not mandatory for all other bots on planet Earth
  • If we want to index a URL (or deindex it), we should not block it
  • There must be one robots file per subdomain and protocol
  • The robots file is public and can always be viewed at domain/robots.txt. Not only will Google see it, but also your competitors. This leads some people to hide things, while others are 100% transparent. You can check ours at latevaweb.com/robots.txt and draw your own conclusions
robots.txt

Ours is a very simple robots. We allow all robots, block access to certain folders (they caused crawling issues due to server configurations), and indicate where to find our sitemaps.

Syntax in robots.txt

This file has its own syntax and commands that you need to know in order to first interpret them and later modify them. They are:

  • user-agent: indicates which robot the rule applies to. If we use *, it applies to all; if it is specific, we can define the robot(s) we are addressing
  • Allow: by here, go ahead
  • Disallow: access forbidden
  • Sitemap: here, dear Google, you will find all the URLs I want you to index. You must include the full URLs of the sitemaps, including protocol, www if applicable, host and slug
  • #: comments for humans, not for robots
  • $: end of a string
  • *: 1 or more repetitions

The syntax of robots files is very strict and any error will prevent it from working properly. Watch out for uppercase and lowercase letters — they matter.

If there is a conflict between two contradictory rules, Google will apply the least restrictive one.

If there are two rules going in the same direction but with different depth, Google will apply the most specific one.

Practical application cases

What to do with parameters

This is an important dilemma. Let’s assume we have a classifieds portal or an e-commerce site. When we navigate a product listing, we can sort, filter, or paginate results, and this generates a large number of parameters in the URLs. And you may wonder what you should do about that.

Well, the key question is whether applying a parameter significantly changes the content of the page. If it doesn’t, we call them passive parameters — the most common are UTMs and ordering options. Conversely, those that alter content are considered active, and the most typical examples are pagination or language parameters.

In general, for active parameters we shouldn’t do anything because they create original content we want indexed. But passive parameters can cause issues with indexing and crawling. We can control indexing with canonicals and noindex, but if logs analysis (log analysis) shows that passive-parameter URLs receive many bot requests, then it’s best to block them via robots with a Disallow. This helps Google prioritize and crawl only the URLs that matter.

Here you have a real example of a robots file of one of our e-commerce projects defined manually. You will see that the goal here is to block access to URLs with applied filters, listings with different sorting, language parameters, or internal search URLs:

robots.txt disallow

Robots for WordPress and other CMS

For those of you using WordPress, you should keep some things in mind because it has its quirks.

Although you can generate a manual robots file, the ideal way is to create it using an SEO tool such as Rank Math or Yoast, which will produce a base document depending on your SEO settings, which you can then edit. And from there, things you should review and adjust, which are reasonable for most projects:

  • Allow CSS, JS and Ajax files, at least in the front-end
  • Disallow wp-admin (backoffice), plugin files and theme files
  • Allow the blog feed, but block access to postblog/feed URLs
  • Disallow tag paginations
  • Disallow internal search queries
  • Disallow parameterized URLs, either all of them or only specific types
  • Allow resources such as PDFs, images or videos

All this is because WordPress generates thousands of URLs you don’t need, and it’s better for Google not to waste time crawling them.

Here’s an example of a robots file from a simple WordPress with WooCommerce, generated by RankMath and edited by us:

robots.txt woocommerce rankmath

Here the goal was to allow all robots to access the front-end, not the backoffice. We block specific parameters from URLs generated when adding products to the cart, attribute filters, listing sorting, and cache. However, we allow access to a back-end section needed for processing Ajax.

For PrestaShop, you can check its particularities in our post SEO for Prestashop. The CMS does not provide a very complete file, and it cannot be edited.

Testing robots.txt

As mentioned at the beginning of the article, the robots.txt file is extremely delicate, and a misplaced comma or dot can ruin a website’s SEO in very little time. So it must always be carefully controlled, with very few people having access to it, and even if you are an expert, you must always test the rules before and after applying them.

How can we test it? Historically, SEOs used Google’s robots.txt tester, but it was discontinued. So now you will have to rely on third-party tools, such as the one from TechnicalSEO. With this tool, we can check that a robots file is available and readable in general, but the most useful feature is that we can enter the URL of a page or file on our website and it will tell us whether the robots file allows access or not. It will also tell us, in case of blocking, which rule is causing it and on which line it can be edited.

An alternative, or even better as a complement, is to install the excellent Robots Exclusion Checker extension in Chrome. Very intuitively, by visiting a URL, it shows whether it’s blocked by robots or not.

Checklist for a valid robots.txt

We finish with a list of basic things that a robots file should include, in checklist format:

  1. It is located in the root of the website
  2. Returns a 200 server response
  3. Is in UTF-8 format
  4. Is under 500kb
  5. Uses valid syntax
  6. Passes Google Search Console test
  7. Does not block CSS or JS files
  8. Indicates where to find the Sitemaps

Would you add anything else? I’d say these are the basics applicable to all websites; more specific aspects will depend on each case.

Bruno Díaz Marketing Manager
About the author
Bruno Díaz — Marketing Manager
Professional with a long career as a communication and digital marketing consultant, specializing in SEO, SEM and web projects. As Marketing Manager of the agency, I coordinate a great team of digital marketing technicians of which I am very proud.

Related news

Hello! drop us a line