

The robots.txt file is a whole world for SEOs with a technical component. It is largely unknown to the general public, but for us it is something fundamental to review in our SEO projects and to consult when learning from other projects. In this article, you will learn in detail many aspects you may not have known, and you will become familiar with something that might seem intimidating at first for being very technical and delicate, but you will see that in the end everything follows logical and predictable rules.
The robots.txt is a file we create and host on our website, whose purpose is to indicate to robots which URLs of our site they can access and which ones they cannot.
By filtering the requests that bots can make, its main advantage is to prevent requests that overload the server. This ensures that the website is always available and performing optimally.
Many beginner SEOs often think it is a tool to index or deindex content, but nothing could be further from the truth. In fact, a URL blocked by robots can still be indexed if, for example, it is linked somewhere or appears in the sitemap. And the worst part is that it will be indexed without content, because Google cannot access it.
However, by shaping the robots file we can partially influence the crawling and indexing process of our URLs. So a good robots blocking strategy, combined with other implementations such as internal linking, noindex tags, or canonicals, can help.

Ours is a very simple robots. We allow all robots, block access to certain folders (they caused crawling issues due to server configurations), and indicate where to find our sitemaps.
This file has its own syntax and commands that you need to know in order to first interpret them and later modify them. They are:
The syntax of robots files is very strict and any error will prevent it from working properly. Watch out for uppercase and lowercase letters — they matter.
If there is a conflict between two contradictory rules, Google will apply the least restrictive one.
If there are two rules going in the same direction but with different depth, Google will apply the most specific one.
This is an important dilemma. Let’s assume we have a classifieds portal or an e-commerce site. When we navigate a product listing, we can sort, filter, or paginate results, and this generates a large number of parameters in the URLs. And you may wonder what you should do about that.
Well, the key question is whether applying a parameter significantly changes the content of the page. If it doesn’t, we call them passive parameters — the most common are UTMs and ordering options. Conversely, those that alter content are considered active, and the most typical examples are pagination or language parameters.
In general, for active parameters we shouldn’t do anything because they create original content we want indexed. But passive parameters can cause issues with indexing and crawling. We can control indexing with canonicals and noindex, but if logs analysis (log analysis) shows that passive-parameter URLs receive many bot requests, then it’s best to block them via robots with a Disallow. This helps Google prioritize and crawl only the URLs that matter.
Here you have a real example of a robots file of one of our e-commerce projects defined manually. You will see that the goal here is to block access to URLs with applied filters, listings with different sorting, language parameters, or internal search URLs:
For those of you using WordPress, you should keep some things in mind because it has its quirks.
Although you can generate a manual robots file, the ideal way is to create it using an SEO tool such as Rank Math or Yoast, which will produce a base document depending on your SEO settings, which you can then edit. And from there, things you should review and adjust, which are reasonable for most projects:
All this is because WordPress generates thousands of URLs you don’t need, and it’s better for Google not to waste time crawling them.
Here’s an example of a robots file from a simple WordPress with WooCommerce, generated by RankMath and edited by us:
Here the goal was to allow all robots to access the front-end, not the backoffice. We block specific parameters from URLs generated when adding products to the cart, attribute filters, listing sorting, and cache. However, we allow access to a back-end section needed for processing Ajax.
For PrestaShop, you can check its particularities in our post SEO for Prestashop. The CMS does not provide a very complete file, and it cannot be edited.
As mentioned at the beginning of the article, the robots.txt file is extremely delicate, and a misplaced comma or dot can ruin a website’s SEO in very little time. So it must always be carefully controlled, with very few people having access to it, and even if you are an expert, you must always test the rules before and after applying them.
How can we test it? Historically, SEOs used Google’s robots.txt tester, but it was discontinued. So now you will have to rely on third-party tools, such as the one from TechnicalSEO. With this tool, we can check that a robots file is available and readable in general, but the most useful feature is that we can enter the URL of a page or file on our website and it will tell us whether the robots file allows access or not. It will also tell us, in case of blocking, which rule is causing it and on which line it can be edited.
An alternative, or even better as a complement, is to install the excellent Robots Exclusion Checker extension in Chrome. Very intuitively, by visiting a URL, it shows whether it’s blocked by robots or not.
We finish with a list of basic things that a robots file should include, in checklist format:
Would you add anything else? I’d say these are the basics applicable to all websites; more specific aspects will depend on each case.

Hello! drop us a line
The robots.txt is a file that we create and host on our website, whose purpose is to indicate to robots which URLs of our site they can access and which ones they cannot.