Google XML and HTML Sitemaps

Phil Payne is a Google Bionic Poster and Top Contributor based in Sheffield UK. The opinions here are not Google's

Sitemaps have many uses, though they have no effect on search engine ranking and are not a way to get a site indexed. Sitemaps are optional. If your problem is ranking in the SERPs, stop reading now - look at your site's content and the web's value of it as expressed by organic inbound links. Check also that it conforms with Google's Quality Guidelines. There are three types of sitemap - XML, HTML and plain text.

XML Sitemaps

XML sitemaps - often just "Google sitemaps" but now used by Yahoo, Ask Jeeves, Bing, etc. - are the current vogue since Google launched them in June 2005. But they are optional, do not force indexing, and have no effect on ranking. So what do they do? A lot of people give them little thought - this thread from the [old] Google Webmaster Support group contains a Google employee's views at some length:

http://groups.google.com/group/Google_Webmaster_Help-Indexing/browse_thread/thread/2bace1a95e24fb87/20ded381449c7d5a#20ded381449c7d5a

There's also Google's official Q&A on the subject:

http://googlewebmastercentral.blogspot.com/2008/01/sitemaps-faqs.html

Dynamic vs. Statis XML sitemaps

Some large sites with high change rates have tried using dynamic code to generate a new sitemao each time it is downloaded. This eases the maintenance of the site and ensures the sitemap is always up-to-date, but most scripting languages are very inefficient and the operation is dependent on the availability of server resources. If the process takes more than a few seconds, the Googlebot may simply move to its next task and drop the sitemap. It is trivial to arrange for static sitemaps to be rebuilt periodically using a cron job or similar.

<Priority>

Priority is the most useful of the XML sitemap values. It is the only way to express a view of a page's importance within a site - setting every page to 1.0 will NOT beat other pages on the web - it just wastes an opportunity. Priority helps decide crawl priorities - it currently has no effect on ranking.

Priority must be managed. First, it should reflect business goals - which products are most profitable, where inventory is a problem, what other campaigns are running, etc. Second, a site's priorities should be asymptotic - or "long tail" - just a very few high priority pages with most really quite low. Google's default of 0.5 is too high to allow fine definition of the upper part of the curve, so the isham research system uses a default of 0.1 - though this is easily changed. Sitemap entries are sorted by descending priority; this allows the important pages to appear first in the HTML sitemap - just in case a crawler does impose a limit - and also permits the creation of a CSV file for further analysis.

<Lastmod>

Lastmod tells a search engine the date a page was changed. This is useful to the crawler, because FTP cannot transfer the file date to the web server - the date seen by the Googlebot is the date it was uploaded. Some FTP clients (such as CuteFTP) claim to be able to set the server-side date. Many CMSes refresh sites regularly, so the search engines may be unable to see the true date of the file and can waste bandwidth (theirs and the site's) crawling unchanged pages. They cannot be fooled - even if the header says the file has been changed, a checksum will show it hasn't. Lastmod is a way of showing that a page with a newer server-side date has not in fact changed and does not need to be crawled again.

<Changefreq>

Changefreq is the least useful and most dangerous of the XML sitemap parameters and may be omitted - the search engines can work it out for themselves. Many sites have XML sitemaps with changefreq set to "hourly" (or even "always") in the hope that search engines will crawl them more often - but crawling and search ranking are not related. And if the search engine crawlers are told a page changes daily and they see it only changes once or twice a year (or in one recent case not since 1996), will this affect their trust of the sitemap and the site? Possibly the only useful value is 'never' on archive pages - saving bandwidth by asking a search engine not to bother with pages that don't change. And if such a page should change, an updated <lastmod> will get it crawled.

This is a sample URL entry for the isham research home page:

<url>
<loc>http://www.isham-research.co.uk/index.html</loc>
<lastmod>2007-12-31</lastmod>
<priority>0.3</priority>
</url>

Note that the first priority is set quite low within the allowable range of 1.0 to 0.0 - only 2.9% of visitors to this site actually reach the home page. Over 80% arrive from search engines on specific landing pages.

This entry is for an archived page on the same site:

<url>
<loc>http://www.isham-research.co.uk/19990126.html</loc>
<lastmod>2006-07-03</lastmod>
<changefreq>never</changefreq>
<priority>0.0</priority>
<url>

Sadly, there is no anchor text in an XML sitemap, as there is in an HTML sitemap. And some search engines don't use them at all. So the two are not true alternatives, and every site should still have an HTML sitemap.

HTML Sitemaps

Only HTML sitemaps contain anchor text. If each page's <title> statement is used for this, it is yet another of many reasons to make page titles unique and descriptive. The isham research system adds the ISO 8601 file date to the title - this ensures that the anchor text for each target page changes each time that page is updated.

HTML sitemaps are easily generated, though care is necessary. Tools such as Xenu (ask for the report and reply "Cancel" to the password request) produce HTML code that can be cut'n'pasted to create a sitemap. For small sites up to around fifty pages, this is really enough and an XML sitemap is too much effort. HTML sitemaps can also 'level' a site so that no page is more than two levels down from the home page. Problems arise with large sites, because there are supposedly limits to the number of links on any page. A hundred was once suggested as a limit, but this was plucked from the air on the spur of the moment - it is a competitive issue and if others support more, Google will have to do the same.

The isham research system produces HTML sitemaps as above, using the <title> tag as link anchor text for each page, but instead of reflecting the tree structure, it sorts the URIs into descending priority. This means that any search engine crawler that does limit the number of links will still get the top ones.

Plain Text Sitemaps

Plain text sitemaps are simply lists of URIs, one per line, with no extra data. They were originally designed for batch submission of pages to search engines, but explicit submission of URIs is no longer needed - it's better and some say necessary for the search engines to find each page organically. Once they find a site, good navigation should take them to every page. Javascript-based menus may not do this.

Discovery

Discovery is a benefit of all sitemaps, making it a useful way for all search engines to find pages that normal links would not take them to. As Google puts it:

"Sitemaps are particularly beneficial when users can't reach all areas of a website through a browseable interface. (Generally, this is when users are unable to reach certain pages or regions of a site by following links). For example, any site where certain pages are only accessible via a search form would benefit from creating a Sitemap and submitting it to search engines."

https://www.google.com/webmasters/tools/docs/en/protocol.html

The above means, of course, that sitemaps can have an effect on crawling. Many sites have pages reached only via JavaScript menus, and in the absence of a link to such pages a sitemap can make a search engine aware of them. Most search engines now read XML sitemaps so once this system is set up, one sitemap change is enough.

robots.txt (lower case is important) is used in two ways:

Web sites under development often contain pages that should not be included in a sitemap. A robots.txt file can be written to block these, and it does no harm if the robots.txt file is the same as the one used on the web server. The isham research system reads any robots.txt file in the root and applies the rules, so as not to pester search engine crawlers with URIs they cannot reach.
The robots.txt file also helps search engines such as Yahoo, Bing, etc., that do not have a sitemap submission system. One blank line is left after the usual entries, followed by a line containing the sitemap or sitemap index URI.

Maintenance

Writing, uploading and submitting a sitemap is only the start. Once adopted, the sitemap should at all times reflect the status of the site. Every change to the site requires a matching change to the sitemap.

Google has its own sitemap FAQ which covers most of these issues. Code is available in .aspx and .php form to generate XML sitemaps on request, but the server has to have enough cycles available to do this quickly - the Googlebot is not very patient.

meta data

Sitemap meta data is a way of storing data about pages so that it does not have to be reinput every time a page is changed or a new sitemap is created. In the design of the isham research system, three ideas were considered:

Rereading the existing sitemap. This is a very simple way of preserving edits made to a sitemap - but it means that site reconfiguration - moving or renaming directories, for instance - remains hard.
Using a lookaside database of some sort. This solves the reorganization issue, but it needs change control and can be an error-prone manual process.
On-page meta data using a markup extension. This means that meta data can be reviewed and updated during normal page development. In the end, this is the approach that was chosen.

The isham research system uses codified SGML comment statements such as:

These lines are added to the page behind the !DOCTYPE statement. The most important is priority - no other mechanism can convey the site's goals correctly to a search engine. The idea that it can be calculated is fatuous - and the search engines do that already, considering hundreds of parameters. Changefreq is of little value and the isham research system has no default - any value used should match what search engine spiders see. Lastmod is taken from the actual file date on the development system - the last date the page was edited and stored. This date is not available from the server.

One extra tag:

This is used on pages that shouldn't be indexed - such pages might also include a robots noindex meta tag.

Mobile Sitemaps

Mobile sitemaps describe those pages on a site that are handheld friendly. Mobile devices may dominate web use in a few years' time, and mobiles are rapidly converging with desktops. Although the USA lags in the use of handheld browsers - witness the raving about the very normal iPhone - mobile browsing is now common in Europe, with a quarter of users having surfed the web on the move. For over a tenth of young users, their mobile is their main means of accessing the web. Early attempts to adapt web content to handhelds (WAP, .mobi) are falling away as handheld browsers catch up. Google now accepts mobile sitemaps. Rather than define its own meta data, isham research uses the AvantGo browser extension meta tag:

<meta name="HandHeldFriendly" content="true">

This permits the controlled creation of mobile sitemaps. It has no effect on traffic - the Avantgo offline browser is now defunct. For backward compatibility reasons, the system also processes the equivalent PalmOS meta tag:

<meta name="palmcomputingplatform" content="true">

And once the XML sitemap is created, don't forget to add it to your robots.txt file, leaving a blank line between it and the last User-agent: record.

User-agent: *
Disallow:

Sitemap: http://www.example.com/sitemap.xml

This will help other search engines find it without any further effort.

The isham research Sitemap Generator is written in REXX, which has now been open-sourced. There is a list of Sitemap Generators maintained by Google.

Contact Phil Payne by email or use mobile/SMS 07833 654800