Archive for May 12th, 2011

12
May
11

10 Step SEO: Sitemaps, part 2

Let’s put this 10 Step thing to bed, then, shall we? Last week we talked about HTML sitemaps, the kind that live on your site and are linked to from your pages and that actual real people might even be able to access and use. This week, we’ll delve into their more esoteric cousin: the XML sitemap.

In 2005, Google unveiled what they called the “Sitemaps Protocol.”  The idea was to create a single format for building a sitemap file that all (or at least “most”) search engines could use to find and index pages that might be otherwise difficult to crawl.  This protocol uses XML as a formatting medium. It’s simple enough to code by hand, but robust enough to support dynamic, database-driven systems.

At first, only Google crawled sitemap.xml files, but they encouraged webmasters to create and publish them by opening a submission service. You would build an XML sitemap, upload it to your web server, then submit the URL to Google via their webmaster interface. The Goog would crawl it, and—in theory—follow all the links and index all your pages.

It actually worked rather well. Pretty soon, all the web pros were calling the system “Google Sitemaps” and uploading and submitting like crazy. With so many sitemaps installed on so many websites, it wasn’t long before the other major engines adopted the protocol.

Are XML sitemaps a magic bullet?

No. Don’t be silly. But they are useful additions to a website’s structural navigation, especially for complex architectures that may be resistant to spider crawls. We’ve used them on many sites and find that a valid XML sitemap can lead to a faster, more accurate indexing.

So what is this thing?

It’s really a pretty simple construction. You could easily make one without any understanding of XML at all. The Sitemaps Protocol dictates a text file, with the extension “xml,” using this template:

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
        <url>
                <loc>http://example.com/</loc>
                <lastmod>2006-11-18</lastmod>
                <changefreq>daily</changefreq>
                <priority>0.8</priority>
        </url>
</urlset>

Every page on your site that you want crawled will have an entry between <loc></locl> markers. You do not have to set every parameter. This would be a valid sitemap for a site with a home page and three internal pages.

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
        <url>
                <loc>http://example.com/</loc>
                <loc>http://example.com/page.html</loc>
                <loc>http://example.com/page2.html</loc>
                <loc>http://example.com/page3.html</loc>
        </url>
</urlset>

The other parameters—lastmod, changefreq, and priority—are nice ideas, but ideas we’ve never seen have any effect. So use ’em or don’t. You can write an XML sitemap with any text editor. Just be sure to save it with “utf-8” encoding and with the name sitemap.xml. (To save in “utf-8” encoding in Notepad, click “save as” and you’ll find it in a pull-down menu at the very bottom of the box.)

And wait! It can be even simpler! The Sitemaps Protocol also stipulates that a simple list of URLs in a text file like:

http://example.com/
http://example.com/page.html
http://example.com/page2.html
http://example.com/page3.html

(The file would be named “sitemap.txt” instead of “sitemap.xml” and also must be “utf-8” encoded.)

And wait again! Even simpler than that! There are a host of online tools that will turn a list of urls into an XML sitemap, or even spider your site for you and produce the sitemap file from that.

There are just a couple of rules to be mindful of:

  • Sitemap files cannot be over 10 MB
  • Sitemap files can be compressed as a gzip file
  • The maximum number of URLs per file is 50,000
  • Multiple sitemaps can be linked together with a “Master Sitemap”
  • Sitemaps should not contain duplicate URLs
  • Sitemaps should be referenced in your robots.txt file using this notation:
    • Sitemap: <sitemap_location>
      (of course, “sitemap_location” would be the actual URL address of your sitemap file)

When you have the file ready, you should use one of the many XML sitemap verification services. An invalid sitemap won’t help much.

Should you submit the file to search engines?

You can. If your site is brand new, it might help. But if you’ve done it right—complete with an entry in the robots.txt file—you really shouldn’t have to. Google, Bing, and Yahoo all know where you live.

Other sitemap resources

Sitemaps.org
Wikipedia on Sitemaps
Google’s List of Sitemap Generators
xml Sitemap Validator