April 22, 2004

Feed Analysis: The Age of Feeds

There are two sets of timestamps associated with feeds: the timestamp reported by HTTP in its response, and the timestamps inside the feed file. The first group can be seen as an accurate measure of when the feed file was last changed. You cannot conclude from that what type of updates were done to the feed, but it is important as this is likely to be the timestamp the aggregator will use in its next HTTP request to initiate a conditional-get. The timestamps appear to be reliable, although some servers will return times in the future and some of the content management systems that generate RSS/Atom feeds dynamically, always respond with the current time of the server. Some files are updated rather frequently, for example those generated by a CMS that add a 'comment count' to each item.

The timestamps inside feeds are a different beast altogether. <lastbuilddate> and <pubdate> are optional elements in almost all of the RSS feed specs and although they should be RFC 822 compliant, it appears many CMS implementers take this rather liberally. The RSS 1.0 spec for example doesn't define these items, but a CMS developer is allowed to add those by either using the Dublin Core spec or adding his/her own module. Of the over 23,000 feeds I used for the analysis, only 5402 had timestamps associated with the individual items, and of those more than 300 had obvious errors (Timestamp in the future, formatting errors). The feeds that were usable produced statistics that are in my eyes in need of  further analysis before I am going to report on them (e.g. 45% of the feeds had items in there that were older than 6 weeks).

One of the reasons for using the feeds from the 'share your opml' project was to get the feeds that people are actually subscribed to.  Of these feeds more 15% have not been updated in the past 6 weeks, and 25% had not been updated in the past 2 weeks or more. This means that there is a significant number of 'dead' or slow updating feeds in people's aggregators.

This post will only report on the HTTP times which indicate the 'age' of the the feed file. I am not sure how useful the stats are of a granularity of  less than a day, as that depends on the distance between my time zone and the one of the feed. The first two graphs give the histogram and the CDF for the age of the feed files, with decreasing granularity in the feed files. The 3rd graph shows the CDF with uniform granularity.

