Robots.txt File

Updated on July 4, 2018

WordPress offers a lot of SEO benefits and users need to do some tweaking to ensure they have a better search engine optimized site. In this article we’ll talk about robots.txt file and how to hide the archive pages for custom post types from the search engines.


What Is the Robots.txt File?

Robots.txt is a web standard developed by Robots Exclusion Protocol (REP) to regulate the behavior of robots and search engine indexing.

Robots.txt file helps search engine robots to direct which part to crawl and which part to avoid. When Search bot or spider of Search Engine comes to your site and wants to index your site, they follow Robots.txt file first. Search bot or spider follows this files direction for index or no index any page of your website.

If you are using WordPress, you will find Robots.txt file in the root of your WordPress installation. It is easy to access the robots.txt file, simply type the domain name and add “robots.txt” at the end of the URL, for example: http://yourdomain.com/robots.txt


Structure of the Robots.txt File

Robots.txt is a general text file. So, if you don’t have this file on your website, open any text editor as you like ( as the example: Notepad) and make one or more records and save the file as “robots.txt“. Every record bears important information for search engine. Usually, an unmodified WordPress robots.txt file looks like this:

User-agent: *
Disallow: /wp-admin/


The * (asterisk) mark with User-agent implies that all search engines are allowed to index the site. The Disallow condition prevents the search engines to index some portions of the site like wp-admin, plugins and themes because they are sensitive information and if indexing them is allowed, it will put the site at grave risk.
Another example:

User-agent: googlebot
Disallow: /cgi-bin

It means that it’s allowed Google bot for index every page of your site. But cgi-bin folder of root directory isn’t allowed for indexing. Google bot won’t index cgi-bin folder.

By using Disallow option, you can restrict any search bot or spider for indexing any page or folder. There are many sites who use no index in Archive folder or page for not making duplicate content.


Editing the Robots.txt File

You can either edit your WordPress Robots.txt file by logging into your FTP account of the server or you can use plugin like Robots TXT to edit robots.txt file from WordPress dashboard. There are few things, which you should add in your robots.txt file along with your sitemap URL. Adding sitemap URL helps search engine bots to find your sitemap file and thus faster indexing of pages.

Here is a sample Robots.txt file for any domain. In sitemap, replace the Sitemap URL with your site URL:

sitemap: https://www.your_domain.com/sitemap.xml

User-agent:  *
# disallow all files in these directories
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /archives/
Disallow: /comments/feed/
User-agent: Mediapartners-Google*
Allow: /
User-agent: Googlebot-Image
Allow: /wp-content/uploads/

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /


Stopping Search Engines from Indexing Specific Posts and Pages

Search engine spiders will crawl your whole website to cache your website’s pages for their index. In general, most website owners are happy for search engines to crawl and index any page they want; however there are situations where you would not want pages to be indexed.

For example, if you are developing a new website, it is usually best if you block search engines from indexing your website so that your incomplete website is not listed on search engines. This can be done easily through the Settings > Reading page in the WordPress Dashboard.

All you have to do is scroll down to the Search Engine Visibility section and enable the option entitled “Discourage search engines from indexing this site”.

robots.txt file

Robots Meta Tag Overview

Google advises websites owners to block URL’s using the robots meta tag. The robots meta tag follows this format:

<meta name="value" content="value">

The robots meta tag should be placed within the <head> section of your WordPress theme header i.e. between <head> and </head>. There are a few different values available for the name and content attributes. The values that Google advise using to block access to a page are robots and noindex:

<meta name="robots" content="noindex">

Robots refers to all search engines while noindex disallows the search engine from displaying the page in their index.

If you want to block content from a specific search engine, you need to replace the value of robots with the name of the search engine spider. Some common search engine spiders are:

Name Description

googlebot Google
googlebot-news Google News
googlebot-image Google Images
bingbot Bing
teoma Ask

All you have to do to block a specific crawler is replace robots with the name of the spider.

<meta name="googlebot-news" content="noindex">

Multiple search engines can be blocked by specifying more spiders and separating them by commas.

<meta name="googlebot-news,bingbot" content="noindex">

So far, you have only seen the noindex meta tag being used, however there are many values that can be used with the content attribute. These values are normally referred to as directives.

For reference, here is a list of the most common directives that are available to you:

Name Description

all No restrictions on indexing or linking.
index Show the page in search results and show a cached link in search results.
noindex Do not show the page in search results and do not show a cached link in search results.
follow Follow links on the page.
nofollow Do not follow links on the page.
none The same as using “noindex, nofollow”.
noarchive Do not show a cached link in search results.
nocache Do not show a cached link in search results.
nosnippet Do not show a snippet for the page in search results.
noodp Do not use the meta data from the Open Directory Project for titles or snippets for this page.
noydir Do not use the meta data from the Yahoo! Directory for titles or snippets for this page.
notranslate Do not offer translation for the page in search results.
noimageindex Do not index images from this page.
unavailable_after: [RFC-850 date/time] Do not show the page in search results after a date and time specified in the RFC 850 format.

Adding the Robots Meta Tag to the Theme Header

In order to block a specific post or page, you need to know its post ID. The easiest way to find the ID of a page is to edit it. When you edit any type of page on WordPress, you will see a URL such as https://www.yourwebsite.com/wp-admin/post.php?post=15&action=edit in your browser address bar. The number denoted in the URL is the post ID. It refers to the row in the wp_posts database table.

Once you know the ID of the post or page you want to block, you can block search engines from indexing it by adding the code below to the head section of your theme’s header.php template between <head> and </head>. You can place code anywhere within the head section; however we recommend placing it underneath, or above, your other meta tags, as it makes it easier to reference later.

<?php if ($post->ID == X) { echo '<meta name="robots" content="noindex,nofollow">'; } ?>

In the code above, X denotes the ID of the post you want to block. Therefore, if your page had an ID of 15, the code would be:

<?php if ($post->ID == 15) { echo '<meta name="robots" content="noindex,nofollow">'; } ?>

As all post types are stored in the wp_posts database table, the above code will work with any type of page; be it a post, page, attachment, or custom types such as galleries and portfolios.

You can block additional pages on your website by using the OR operator.

<?php if ($post->ID == X || $post->ID == Y) { echo '<meta name="robots" content="noindex,nofollow">'; } ?>

You simply need to specify the ID of the pages that you want to block. For example, say you want to block search engines from indexing posts and pages with ID 15, 137, and 4008. You can do this easily using:

<?php if ($post->ID == 15 || $post->ID == 137 || $post->ID == 4008) { echo '<meta name="robots" content="noindex,nofollow">'; } ?>

To confirm that you have configured everything correctly, it is important to verify that you have blocked the correct pages from search engines. The simplest way to do this is to view the source of the page you want to block. If you have added the code correctly, you will see <meta name=”robots” content=”noindex,nofollow”> listed in the head section of the page. If not, the code has not been added correctly.

Stop Search Engines Crawling a Post or Page Using Robots.txt

The concept behind the Robots.txt protocol is the same as the robots meta tag. There are only a few basic rules.

  • User-agent – The search engine spider that the rule should be applied to
  • Disallow – The URL or directory that you want to block

Here are a few examples to help you understand how to use a robots.txt file to block search engines.

The code below will block search engines from indexing your whole website. Only add this to your robots.txt file if you do not want any pages on your website indexed.

User-agent: *
Disallow: /

To stop search engines from indexing your recent announcement post, you could use something like this:

User-agent: *
Disallow: /2014/06/big-announcement/

Another rule that is available to you is Allow. This rule allows you to specify user agents that are permitted. The example below shows you how this works in practice. The code will block all search engines, but it will allow Google Images to index the content inside your images folder.

User-agent: *
Disallow: /
 
User-agent: Googlebot-Image
Allow: /images/

To stop crawling custom post type, simply use this syntax in robots.txt file:

User-agent: *
Disallow: /your-custom-post-type-slug/

So for example, you have a custom post type “Portfolios” with a slug “portfolios”, below will be the robots.txt:

User-agent: *
Disallow: /portoflios/

There is also a way to do it with a function. You can add the noindex-tag by adding this in your functions.php file:

function noindex_for_portfolios()
{
    if ( is_singular( 'portfolios' ) ) {
        echo '<meta name="robots" content="noindex, follow">';
    }
}

add_action('wp_head', 'noindex_for_companies');

 

If you have any questions or difficulties setting up the robots.txt file, you can create a ticket to our support to get some help.

Did this answer your question?