Chris Farrell Membership
Camtasia Studio
Snagit
This site uses cookies. By continuing to browse the site, you are agreeing to our use of cookies. Find out more.

Robots.txt for WordPress – How To Manage Your Robots.txt File

 Posted by  Add comments
 

Robots.txt for WordPress – How To Manage Your Robots.txt File

spider

The robots.txt is a simple text file used to tell search engine robots and spiders which pages on your web site should be crawled and indexed. Pages you don’t want crawled include pages such as junk pages and duplicate content.

Controlling what content is either visible or hidden to search engines can be important in optimising how search engine robots and spiders access your website. Fortunately, the major search engines observe the Robots Exclusion Standard or Robots Exclusion Protocol, a convention to prevent cooperating web spiders and robots from accessing all or part of a website which is otherwise publicly viewable. Be warned though, this protocol is purely advisory.

The Robots Exclusion Protocol requires a Search Engine to read a robots.txt file which resides in the top level directory of your site for information on where it should and should not look on your site for information and links. Specifying where search engines should look for quality content can increase the ranking of your site.

  • This method is recommended by all the major Search Engines.
  • A list of all the major Search Engines can be found in the Top Search Engines article.

The robots.txt file contains rules that robots should obey, but remember, they don’t have to obey those rules, it’s just a convention.

  • There are those who will certainly not obey the robots.txt rules, unscrupulous robots and spiders will ignore of robots.txt and crawl blocked areas anyway, but that’s another story.

Public vs. Private

There are parts of you site you do not want crawled by search engines, areas such as installation directories and archives. So the first step in managing the robots and spiders is deciding what content should be public and what content should be private.

  • With no robots.txt file, everything is public. So if you want the search engines to access all the content on your site, you don’t need a robots.txt file at all.

As every website and every webmaster is different the following are just a few pointers to the areas you might like to consider making private:

  • Private data such as user information.
  • Non-content such as images used for navigation.
  • Printer-friendly pages to avoid duplicate content issues and visitor arriving at the wrong page.
  • Affiliate links and advertising.
  • Landing pages used for advertising purposes.
  • Experimental pages.

Implementing the Robots Exclusion Protocol

Implementation is best achieved by specifying policies for your entire website (or sub-domain) and then more granularly at the page or link level as needed.

  • Important, a robots.txt file must be located in the root directory of each domain or sub-domain (e.g. http://yourwebsite.com/robots.txt.)
  • A robots.txt file is a UTF-8 (plain text) encoded file.

What should the robots.txt file contain?

A simple robots.txt file should contain a user-agent line which defines the search engine robots allowed and disallowed followed by Disallow directives to specify what content is blocked.

 

User-agent: * (* specifies this entry applies to all robots and spiders)
Disallow: / (Specifies blocked content. Must begin with a slash (/))

 

 

A WordPress robots.txt file

To specify which areas should be crawled in WordPress we can use the Allow directive and to specify which areas should not be crawled we can use the Disallow directive. An example WordPress robots.txt file showing which WordPress directories to Allow/Disallow is shown below:

 

User-agent: *

Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /*?*
Disallow: /*?
Allow: /wp-content/uploads

# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*

# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

# Internet Archiver Wayback Machine
User-agent: ia_archiver
Disallow: /

# digg mirror
User-agent: duggmirror
Disallow: /

Sitemap: http://www.example.com/sitemap.xml

For a comprehensive view of how GoogleBot searches your site Google has a very good Google tutorial for webmasters. A visit to Google Webmaster Central is also a good idea, lots of helpful information and tools available.

 


KingSolutions.org.uk is hosted on JustHost

 Leave a Reply

(required)

(required)

88 queries in 0.877 seconds (Child).