Search Engine Indexing and Robots.txt Files
What is It?
Define what parts of your site are off-limits to search engine crawlers with a special text file called robots.txt, which should be placed in the root directory of your web server. Robots.txt implements the Robots Exclusion Protocol, which blocks search engines from crawling certain directories. Any directories listed in the robots.txt file (such as private or temporary directories) will not be indexed by search engines, and won't appear in search results.
Robots.txt File
The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The robots.txt file is made up of two parts, the User-agent and the Disallow. The User-agent specifies robots to allow or disallow, and the Disallow specifies which directories robots can or cannot crawl. Be aware that some crawlers may ignore a robots.txt file that disallows all crawling.
Example of a recommended robots.txt file:
User-agent: *
Disallow: /cgi-bin/
Disallow: /scripts/
Disallow: /images/
Why It’s Important
If your web content is not showing up in search results, check to see if you have a command in your robots.txt that disallows searchbots from crawling your sites. If you have disallowed all search engines from crawling your site, your content will NOT be included in any search engine, and no one will be able to find your content. At a minimum, you should allow Bing's searchbot to crawl your site, so it is included in search results for USA.gov, the official web portal of the U.S. government.
- Learn about how Bing crawls your site.
In addition, OMB Memo M-06-02 "Improving Public Access to and Dissemination of Government Information and Using the Federal Enterprise Architecture Data Reference Model" (PDF, 64 KB, 6 pages, December 2005) says: "when disseminating information to the public-at-large, publish your information directly to the Internet. This procedure exposes information to freely available and other search functions and adequately organizes and categorizes your information."
This memorandum assumes that your robots.txt file is allowing search engines to crawl your site. If you are disallowing search engine crawlers, you are not exposing information to search engines, and therefore not complying with this guidance.
Best Practices
- Include the robots.txt file in your server's root directory. This is a standard web management best practice.
- Search your server for stray robots.txt files and delete any robots.txt file below the root directory. If a robots.txt file is found in any subdirectory, it will block crawling of that subdirectory and any directory below.
Meta-Tag Robots Exclusion
Review your pages to make sure you are not using robots exclusion in your Meta tags, if those pages should be publicly disseminated. Meta-tag robots exclusion is an HTML meta tag that will exclude robots on a web page. Note, this relies on the cooperation of the robot programs - some crawlers may ignore these tags.
Example of a meta-tag robots exclusion:
<head><meta name="robots" content="no index, nofollow"></head>
Resources
- How to create a robots.txt file
- Search Indexing Robots and Robots.txt
- Search Indexing Robots and the Robots META Tag
- OMB Policy 5: Search Public Websites
- How Search Engines Work
Content Lead:
Ammie Farraj Feijoo
Page Reviewed/Updated: April 25, 2013
