Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Understanding Largest Contentful Paint (LCP) for Web Performance Optimization

    July 18, 2023

    Mastering the Use of Robots.txt file: A Detailed Guide

    July 18, 2023

    Unlocking the Power of Sitemaps for SEO Success

    July 18, 2023
    Facebook Twitter Instagram
    Facebook Twitter Instagram Pinterest Vimeo
    Raj Maliyala | Digital Marketing Consultant | Hyderabad
    • SEO
    • Analytics
    • GTM
    Subscribe
    Raj Maliyala | Digital Marketing Consultant | Hyderabad
    Home»SEO»Mastering the Use of Robots.txt file: A Detailed Guide
    SEO

    Mastering the Use of Robots.txt file: A Detailed Guide

    Raj MaliyalaBy Raj MaliyalaJuly 18, 2023Updated:July 18, 2023No Comments6 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    robots txt file
    Share
    Facebook Twitter LinkedIn Pinterest Email

    A robots.txt file is a simple text file that website owners can use to tell search engine bots(crawlers) which pages or sections of their site should not be crawled. It is important to note that the robots.txt file is not a guarantee that a website will not be crawled or indexed by search engines, but it is a strong indication that certain pages should not be accessed.

    How robots.txt works

    The way robots.txt works is simple: web crawlers or “bots” that belong to search engines such as Google, Bing, and Yahoo, will look for a file named “robots.txt” at the root of a website, when they visit it. If the file is present, the bots will read the instructions provided in the file and will follow them while crawling the website. If the file is not present, the bots will assume that they are free to crawl the entire website.

    Lets say your domain name is mywebsite.com, then the robots.txt URL should be:

    http://yourdomain.com/robots.txt

    The instructions provided in the robots.txt file are in the form of a set of “User-agent” and “Disallow” or “Allow” directives.

    User-agent directive in robots.txt file

    User-agent is a useful feature when it comes to managing access to your website by various web crawlers. With hundreds of crawlers potentially trying to access your site, it can be beneficial to set specific boundaries for each of them based on their intentions. User-agent allows you to do this by providing a way to identify the specific crawler and apply different instructions for each.

    You can also use User-agent to target specific web crawlers and give them different instructions, it will depend on the crawler capabilities and behavior. Also some crawlers have their own name and format, for example Googlebot-Image, Googlebot-News, Googlebot-Video etc. You can find the user-agent name of each crawler on their official website.

    Also, you can use the wildcard * to apply the instructions to all web crawlers that visit your site.

    Disallow: directive

    The “Disallow” directive is used to specify which pages or sections of a website should not be accessed by web crawlers.

    User-agent: *
    Disallow: /admin

    The above example will block all URLs whose path starts with “/admin”:

    http://yourdomain.com/admin
    http://yourdomain.com/admin?test=0
    http://yourdomain.com/admin/somethings
    http://yourdomain.com/admin-example-page-keep-them-out-of-search-results

    Allow: directive

    The Allow directive is used to specify which parts of a website should be accessible to web crawlers or search engine bots.

    User-agent: *
    Allow: /some-directory/important-page
    Disallow: /some-directory/

    Above example will block the following URLs:

    http://yourdomain.com/some-directory/
    http://yourdomain.com/some-directory/everything-blocked-but

    But it will not block any of the following:

    http://yourdomain.com/some-directory/important-page
    http://yourdomain.com/some-directory/important-page-its-someting
    http://yourdomain.com/some-directory/important-page/anypage-here

    Sitemap directive

    The Sitemap directive can be used to specify the location of a sitemap for a website. A sitemap is an XML file that lists all of the URLs on a website and provides information about each URL, such as when it was last updated. This can be useful for search engines when they are crawling a website, as it allows them to find all of the URLs on the site more easily.

    Example:

    User-agent: *
    Sitemap: http://yourdomain.com/sitemap.xml

    Crawl-delay: directive

    The Crawl-delay directive can be used to specify the number of seconds that a web crawler should wait between requests to a website. This can be useful for preventing a website from being overwhelmed by too many requests from a single crawler.

    Note: This directive is not a standard and not all web crawlers support it. In particular, Google does not support it, instead, it uses different methods to control the crawling rate, such as setting the crawl rate in the Google Search Console.

    Example:

    User-agent: *
    Crawl-delay: 2

    Wildcards “*” asterisk in robots.txt

    Wildcards can be used to specify a pattern of URLs that should be blocked or allowed for web crawlers. Wildcards can be used in both the Disallow and Allow directives. For Example

    Disallow: /names/*/details

    Below are the URLs that will be blocked from above directive

    http://yourdomain.com/names/ravi/details
    http://yourdomain.com/names/rohit/account/details
    http://yourdomain.com/names/aarush/details-about-something
    http://yourdomain.com/names/varma/search?q=/details

    End-of-string operator “$” (Dollar sign)

    The dollar sign $ can be used to indicate the end of a URL. This can be useful in cases where you want to block a specific file type or extension.

    User-agent: *
    Disallow: /junk-page$
    Disallow: /*.pdf$

    From the above example any URLs ending with pdf and junk-page will be blocked

    But it will not block any of the following:

    http://yourdomain.com/junk-page-and-how-to-avoid-creating-them
    http://yourdomain.com/junk-page/
    http://yourdomain.com/junk-page?a=b

    What if you want to block all URLs that contain a dollar sign?

    http://yourdomain.com/store?price=$50

    The following will not work:

    Disallow: /*$

    This directive will actually block everything on your website. To avoid this mistake, an extra asterisk should be placed after the dollar sign.

    Disallow: /*$*

    Common Robots.txt Configuration Mistakes

    Not placing the robots.txt file in the correct location

    Placing the robots.txt file anywhere other than the site root will result in it being ignored by search engines. If you do not have access to the site root, you can block pages using alternative methods such as robots meta tags or by using the X-Robots-Tag header in the .htaccess file (or equivalent)

    Blocking subdomains in robots.txt

    Trying to target specific subdomains using robots.txt is a common mistake. Robots.txt is only applicable to the specific domain it is placed on, and will not affect any subdomains. To block specific subdomains, you will need to create a separate robots.txt file for each subdomain and place it in the root directory of that subdomain.

    Case Consistency in Robots.txt

    The robots.txt standard is case-sensitive, meaning that “User-agent” and “user-agent” would be interpreted as different commands. Similarly, “Disallow” and “disallow” would also be treated as different commands.

    Forgetting the user-agent line

    The “User-agent” line tells search engines which crawlers the rules in the file apply to. Without this line, search engines will not know which rules to follow, and may ignore the entire file.

    How to test robots.txt file?

    Testing the robots.txt file can be done with the use of web-based tools like Google Search Console and Bing Webmaster Tools, which allows you to enter the URL you want to verify and see if it’s allowed or disallowed. If you have technical knowledge, you can also use Google’s open-source robots.txt library to test the file locally on your computer.

    Meta Robots Tag

    Robots.txt is not the sole method of communicating with web crawlers. Alternative methods include using the Meta Robots Tag and X-Robots-Tag headers to specify crawling instructions for specific pages or sections of a website.

    Learn more about Meta Robots Tag

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Raj Maliyala
    • Website
    • Facebook
    • Twitter
    • LinkedIn

    A Digital Marketing Consultant with 7+ years of experience based in Hyderabad specializes in developing a successful digital marketing plan to help B2B and B2C companies increase online visibility and Sales

    Related Posts

    Understanding Largest Contentful Paint (LCP) for Web Performance Optimization

    July 18, 2023

    Unlocking the Power of Sitemaps for SEO Success

    July 18, 2023

    Leave A Reply Cancel Reply

    Top Posts

    Understanding Largest Contentful Paint (LCP) for Web Performance Optimization

    July 18, 2023139 Views

    Unlocking the Power of Sitemaps for SEO Success

    July 18, 2023108 Views

    Mastering the Use of Robots.txt file: A Detailed Guide

    July 18, 202389 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    Understanding Largest Contentful Paint (LCP) for Web Performance Optimization

    July 18, 2023139 Views

    Unlocking the Power of Sitemaps for SEO Success

    July 18, 2023108 Views

    Mastering the Use of Robots.txt file: A Detailed Guide

    July 18, 202389 Views
    Our Picks

    Understanding Largest Contentful Paint (LCP) for Web Performance Optimization

    July 18, 2023

    Mastering the Use of Robots.txt file: A Detailed Guide

    July 18, 2023

    Unlocking the Power of Sitemaps for SEO Success

    July 18, 2023

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook Twitter Instagram Pinterest
    © 2023 Raj Maliyala. Digital Marketing Consultant .

    Type above and press Enter to search. Press Esc to cancel.