These are the old pages from the weblog as they were published at Cornell. Visit www.allthingsdistributed.com for up-to-date entries.

March 30, 2004

Feed Analysis: Readers per Feed

I am slowly continuing to collect and process data that will eventually be input to questions about the scalability of different access methods. One of the points where I stopped in the earlier analysis was that the effectiveness of different polling mechanisms depend on the rates of change at the producer of the data. To build a model for this I needed access to a reasonable large set of feeds, so I decided to use the information registered at 'Share your OPML' project.

I'll get back to my particular analysis later, but I first want to spent a posting or two on the particular data from this site. People have uploaded their aggregator reading lists to this site, which is different from the information at technorati or blog ecosystem which centers around weblogs linking to each other. I assume that the folks at centralized readers such as bloglines will have similar or probably much better information. Caveat: I am not sure how representative the data is for the 'general' public.

First some basic information on the data set;  846 people have made their reading lists available, 833 of those are of 'valid enough' XML such that the XML reader does crash on them. These 833 reading lists contain 96,187 subscriptions, which would average to 115 subscriptions per user. The subscriptions reference about 28,000 unique feeds. I am a bit vague on the uniqueness here as sometimes multiple feed references the same source, just different formats For example if we combine the atom and rss feed of Mark Pilgrim or the 0.91, 2.0 and atom feeds of Sam Ruby, they end up at different places than in the ranking scheme current used at the site. Sites such as computerworld which have feeds which names changes over time, but deliver the same data are particularly difficult to track.

The distribution of readers over feeds is not really relevant for my particular research project but it is somewhat interesting because of the discussionn at several places after Clay Shirky posted his essay on the Power Laws and Weblogs (which is a sociology discussion not technology)  Do we really see a power law in the popularity of feeds to read? It certainly looks that way from this data only a very small percentage (70 of the 28K feeds) are read by more than 10% of the readers. The most popular feed is read by about 50% of the readers and feed number 500 is read by about 3%.

This graph shows only the 500 most popular feeds, but the extension of the graph is easy to imagine, about feed # 11,000 a feed has only 1 reader. This latter is actually more remarkable than the graph about the most popular weblogs in this particular population. The majority of the feeds tracked by this community of readers have only a few readers. A histogram of this data only reveal the massive dominance of the feeds with only one or two readers, even the CDF graph of this shows the usual trouble with a meaningful visual representation of the outliers.

below is a LogLog plot of the histogram data, which basically show that even though the top 500 feeds rise above the noise, in the overall absolute number of feeds reads they have limited significance.

Information on feeds per user can be found in the next posting

Posted by Werner Vogels at March 30, 2004 04:30 PM
TrackBacks

Feed Analysis
Excerpt: "I am slowly continuing to collect and process data that will eventually be input to questions about the scalability of different access methods. One of the points where I stopped in the earlier analysis was that the effectiveness of different polling ...
Weblog: Lockergnome's RSS Resource
Tracked: March 30, 2004 11:46 PM
Comments