How To Use Sitemap & Robots.txt To Give Google a Complete Picture of Your Blog

So you’ve started a new blog. That’s great. How are you going to tell Google about it? Well, if you’re like most, you wait until Google finds out about you. However, there’s a way for you to let Google know that you’re alive.

The Google Sitemap

A sitemap is just that – a map of all the pages on your blog. A Google sitemap is a page that inform Google and other search engines about the URLs on your blog that are available for crawling.

A sitemap is an XML file that lists the URLs for your blog. It allows bloggers to include additional information about each URL like when it was last updated, how often it changes, and how important it is in relation to other URLs in the site.

Basically, you generate a sitemap and then let Google know about. If you’re running WordPress, this is very easy to do. But the first step is to get a Google Webmaster Tools account.

Webmaster Tools is where you’ll be communicating with Google. From there, you’ll be able to add your blog and tell Google where you sitemap is located. Just by adding your blog to Webmaster Tools, you’ve told Google that you have a blog and that their bot should go pay it a visit. The sitemap is kinda like a guide for the bot as it tells which pages are available for indexing.

Google Sitemaps Generator for WordPress

The nice thing about running WordPress on your blog is you never have to worry about creating an XML sitemap because there’s a plugin that will do it all for you.

Google Sitemap Generator for WordPress

The Google Sitemaps Generator for WordPress generates a XML-sitemap compliant sitemap of your WordPress blog. This format is supported by Ask.com, Google, YAHOO and MSN Search.

Installation is little more involved than the average WordPress plugin but once installed, it pretty much runs itself. Whenever you add a new blog post, the sitemap will auto update with the new page so the next time the Google bot comes by, it will know about it.

While the sitemap will allow Google and other search engines to crawl your blog more intelligently, it is only an URL inclusion protocol. In other words, it just tells what URLs to include. To give the bots a complete picture, you need to complement the sitemap with a robots.txt file

Using The Robots.txt File

Sitemap is an URL inclusion protocol. Robots.txt is an URL exclusion protocol. Together, they give Google a complete picture of your blog and how it should be index.

Now you might ask why would you want to exclude some URLs from Google? Wouldn’t you get more traffic if you have as many pages on there as possible? The answer is no. There are some pages you’ll want to be excluded from the index.

The robots.txt file tells the Google bot what it can and cannot index. The most common use for a robots.txt file is to prevent the indexing of duplicate content or members only area.

Sample Robots.txt File

sitemap: https://johnchow.com/sitemap.xml

User-agent: *
Disallow: /cgi-bin/
Disallow: /go/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /author/
Disallow: /page/
Disallow: /category/
Disallow: /wp-images/
Disallow: /images/
Disallow: /backup/
Disallow: /banners/
Disallow: /archives/
Disallow: /trackback/
Disallow: /feed/

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

User-agent: Mediapartners-Google
Allow: /

User-agent: duggmirror
Disallow: /

The above is the robots.txt file that powers my blog. The first link tells the bot location of my sitemap. This is followed by a bunch of folders that I don’t want to the bot to index.

Your blog is a huge generator of duplicated content so you’ll want to use the robots.txt file to block them out. I only want the bot to index the actual blog post. However, the post is generally repeated in category, archives, trackback and feed. Other areas I don’t want the bot going into includes my WordPress admin folder and the folder where I keep my redirects.

By combining a sitemap with a good robots.txt file, you give Google complete picture of your blog and that’ll get you ranked faster.