Exactly what is a robots.txt and what does it do?
The official Google definition of robots.txt is : “A robots.txt file tells search engine crawlers which pages or files the crawler can or can’t request from your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google.”
How To Use a Robots.txt File
How To Use a Robots.txt file on your webiste
What is a Robots.txt file?
-
- Basically, it’s just a file on the root of your website that tells search engines (Google, Bing, Yahoo) what pages and files to index, and what to ignore.
- If properly coded, a robots.txt file will prevent images, videos, audio files, and even script and style files from being indexed.
- HTML and other website page filetypes can be excluded as traffic management. If you actually want to block a page from search results either password protect them, require authentication, or use a noindex directive on the page itself.
- If you use a hosting service or pagebuilder you might not have a way to provide a robots.txt file. Many will have a way to discourage search engines from indexing the site.
What Are some Examples of Robots.txt Use?
-
- The syntax for a robots.txt file is
- User-agent : Google, Bing, * (wildcard)
- List of Top 10 Web Crawlers and Bots : https://www.keycdn.com/blog/web-crawlers
- Instruction : Disallow, Crawl-delay, Sitemap
- Rules are case sensitive, so be careful
- The default setting is that any search engine can index the entire website, so robots.txt provides some directions to enhance or change that.
- User-agent : Google, Bing, * (wildcard)
- Any number of “Groups” can be created.
- Groups is an easy way to separate multiple engine instructions
- Group 1 – Google not allowed so index a certain directory
- Group 2 – All other engines allows to search entire site
- Groups is an easy way to separate multiple engine instructions
- The syntax for a robots.txt file is
-
- For example, this syntax would block all search engines from all content (notice the addition after Disallow)
- User-agent: * Disallow: /
- And this syntax would ALLOW all search engines to index all content
- User-agent: * Disallow:
- Block a specific search engine from a specific page
- User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html
- For example, this syntax would block all search engines from all content (notice the addition after Disallow)
What is Crawl Budget?
-
- Many tools and resources will mention the “crawl budget” of a website. Basically it’s a number known only to the search engine on how many pages, images, and other files the engine will index, or how long an engine will stay on a site.
- If you think pages aren’t being fully indexed, it may be a good idea to identify the page you absolutely need to have indexed and set them to allow (Disallow:). That way the search engines will look at them first.
Why would I use crawl-delay?
-
- A directive command that can be used is crawl-delay, then a second command.
- Crawl delays will slow down a search engine like Bing, which tends to be a little quick to start. This can increase accuracy while decreasing the load on the site and bandwidth.
- Heads up, Google does not use the crawl-delay directive.
- (crawl-delay: 10)
Common Robots.txt Rules
https://developers.google.com/search/docs/advanced/robots/create-robots-txt
Rule | |
Disallow crawling of the entire website. Keep in mind that in some situations URLs from the website may still be indexed, even if they haven’t been crawled.This does not match the various AdsBot crawlers, which must be named explicitly. |
User-agent: * Disallow: / |
Disallow crawling of a directory and its contents by following the directory name with a forward slash. Remember that you shouldn’t use robots.txt to block access to private content: use proper authentication instead. URLs disallowed by the robots.txt file might still be indexed without being crawled, and the robots.txt file can be viewed by anyone, potentially disclosing the location of your private content. |
User-agent: * Disallow: /calendar/ Disallow: /junk/ |
Allow access to a single crawler |
User-agent: Googlebot-news Allow: / User-agent: * Disallow: / |
Allow access to all but a single crawler |
User-agent: Unnecessarybot Disallow: / User-agent: * Allow: / |
Disallow crawling of a single web page by listing the page after the slash: |
User-agent: * Disallow: /private_file.html |
Block a specific image from Google Images: |
User-agent: Googlebot-Image Disallow: /images/dogs.jpg |
Block all images on your site from Google Images: |
User-agent: Googlebot-Image Disallow: / |
Disallow crawling of files of a specific file type (for example, .gif): |
User-agent: Googlebot Disallow: /*.gif$ |
Disallow crawling of an entire site, but show AdSense ads on those pages, and disallow all web crawlers other than Mediapartners-Google. This implementation hides your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors to your site. |
User-agent: * Disallow: / User-agent: Mediapartners-Google Allow: / |
To match URLs that end with a specific string, use $. For instance, the sample code blocks any URLs that end with .xls: |
User-agent: Googlebot Disallow: /*.xls$
|
Viewing of mobile websites has increased from over 30% in 2015 to now over 50%, and there’s no sign of slowing down. Even if your customers are thought to be mainly on desktop and laptop computers, mobile indexing will force you to get your website designed for mobile use starting in March 2021.
If you need help getting this process done, especially in a WordPress environment, please contact BeBizzy Consulting at bebizzy.com and let’s get your site ready for mobile use.
Thanks for listening to this episode of the WP Wednesday Podcast
Do you have questions, experiences related to today’s topic? Head over to @Bebizzy on Twitter and send them there.
Don’t forget to check out SEM Rush for all your SEO needs. Visit bebizzy.com/semrush.
And remember to subscribe to the WP Wednesday Podcast for more great tips on managing your WordPress website.
Then, click in your podcast player to subscribe and leave us a review. Then you can sit back, relax, and leave the technical stuff to us.
WordPress News
- WordPress 5.7 Released March 9, 2021
-
- Reusable Blocks
- Easier font-size adjustments
- Drag and drop from the inserter right into your page or post
- Switch from HTTP to HTTPS in one click. No database edits
- Lazy loading of iFrames
Recent Comments