Search Engine Indexing and Robots.txt Files

What is It?

Define what parts of your site are off-limits to search engine crawlers with a special text file called robots.txt, which should be placed in the root directory of your web server. Robots.txt implements the Robots Exclusion Protocol, which blocks search engines from crawling certain directories. Any directories listed in the robots.txt file (such as private or temporary directories) will not be indexed by search engines, and won't appear in search results.

Robots.txt File

The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The robots.txt file is made up of two parts, the User-agent and the Disallow. The User-agent specifies robots to allow or disallow, and the Disallow specifies which directories robots can or cannot crawl. Be aware that some crawlers may ignore a robots.txt file that disallows all crawling.

Example of a recommended robots.txt file:

User-agent: *
Disallow: /cgi-bin/
Disallow: /scripts/
Disallow: /images/

Why It’s Important

If your web content is not showing up in search results, check to see if you have a command in your robots.txt that disallows searchbots from crawling your sites. If you have disallowed all search engines from crawling your site, your content will NOT be included in any search engine, and no one will be able to find your content. At a minimum, you should allow Bing's searchbot to crawl your site, so it is included in search results for USA.gov, the official web portal of the U.S. government.

In addition, OMB Memo M-06-02 "Improving Public Access to and Dissemination of Government Information and Using the Federal Enterprise Architecture Data Reference Model" (PDF, 64 KB, 6 pages, December 2005) says: "when disseminating information to the public-at-large, publish your information directly to the Internet. This procedure exposes information to freely available and other search functions and adequately organizes and categorizes your information."

This memorandum assumes that your robots.txt file is allowing search engines to crawl your site. If you are disallowing search engine crawlers, you are not exposing information to search engines, and therefore not complying with this guidance.

Best Practices

  • Include the robots.txt file in your server's root directory. This is a standard web management best practice.
  • Search your server for stray robots.txt files and delete any robots.txt file below the root directory. If a robots.txt file is found in any subdirectory, it will block crawling of that subdirectory and any directory below.

Meta-Tag Robots Exclusion

Review your pages to make sure you are not using robots exclusion in your Meta tags, if those pages should be publicly disseminated. Meta-tag robots exclusion is an HTML meta tag that will exclude robots on a web page.  Note, this relies on the cooperation of the robot programs - some crawlers may ignore these tags.

Example of a meta-tag robots exclusion:

<head><meta name="robots" content="no index, nofollow"></head>

Resources

 

Content Lead: Ammie Farraj Feijoo
Page Reviewed/Updated: April 25, 2013

You are now leaving the HowTo.gov website.


CancelView Link