Robots.txt
What is a Robots.txt File?
A robots.txt file is used to instruct search engine crawlers about which URLs they should avoid visiting or crawling on a website. This can be used to prevent bots from accessing low-quality pages or pages that are essential for user experience but unnecessary for search engine crawlers to explore.
A website's robots.txt file is typically located at the root domain, such as domain.com/robots.txt. This means the file pertains specifically to the entire domain and won't affect the crawling behaviour of any other versions of the site, like domain.co.uk/robots.txt, which would require its own robots.txt file.
Google recommends using robots.txt to combat issues related to crawl efficiency or server resources, such as preventing Googlebot from spending excessive time crawling parts of the site that hold little value.
The syntax of a Robots.txt File
When creating a robots.txt file, there are several key components you should include:
1. Specify The User Agent: Begin your robots.txt file by specifying the user agent to which the rules should apply. User agents represent different search engine bots, and you can target specific ones with your rules.
Examples of user agents include:
• User-agent: * (applies to all bots unless overridden by more specific rules)
• User-agent: Googlebot (specific to all Google crawlers)
• User-agent: Bingbot (specific to Bing's crawler)
• User-agent: Yandex (specific to Yandex's crawler)
• User-agent: Baiduspider (specific to Baidu's crawler)
• User-agent: Twitterbot (specific to Twitter's crawler)
2. Pattern Match URLs: To improve efficiency and avoid the need to list out every URL, you can use regular expressions (regex) to pattern match URLs. You can employ symbols like * and $ to refine URL paths:
• * represents any amount of any character and can be used at the start or in the middle of a URL path, but not at the end.
• $ signifies the end of a URL string.
3. Directive Rules: Directive rules are case-sensitive and apply to URL paths only, excluding the protocol or domain. A slash at the start of directives matches the beginning of the URL path. For instance, Disallow: /cats would apply to domain.com/cats
• A directive match must start with either a / or *; otherwise, it won't match anything. For example, Disallow: dogs wouldn't match anything on the site.
4. Prioritisation and Allow Directives: If you wish to allow search engines to crawl a particular type of URL that you've disallowed, you can use an "allow" directive. However, remember that if a URL matches both an "allow" rule and a "disallow" rule, the longer matching rule takes precedence.
For example, if your robots.txt file looks like this:
Disallow: */dog-breeds/*
Allow: /husky
The "disallow" rule will be applied, and the /husky page will not be crawled.
However, you can use the * character to make the "allow" rule longer for a more specific match. E.g
Disallow: */dog-breeds/*
Allow: /*********husky
Additionally, if two rules are of the same length, the "disallow" rule will be followed.