These are the old pages from the weblog as they were published at Cornell. Visit www.allthingsdistributed.com for up-to-date entries.

March 18, 2004

Continuing the Feed Analysis

I finally got the change to do some more processing on the feed data. I am going to spread the graphs over a few postings that will follow this one.

There are always a few complications with data mining server logs that I guess everyone runs into. The most important issue is the trouble with identifying unique users, such that you can correlate visits over time. The use of the IP address as a unique identification has always been wrong, but it is the only identifier that you can work with here. I don't believe multi-user machines are that pervasive anymore, but the hiding of multiple machines behind a firewall/gateway that does network address translation is common now. Laptops travel places, the simplest case has a different address at night than during the day. And not all DHCP servers do their best to re-assign the same address to a node when it re-registers.

I have chosen to make IP address + type of feed + software agent represent a unique user. This is of course only an approximation, and I am interested if others have better solutions to this issue. Please keep that in mind when looking at the graphs which include the notion of 'users'.

The other problem is a more traditional analysis problem. Almost all of the dimensions over which you can look at the data exhibit (negative) exponential distribution behavior, often with a heavy tail. Which means that traditional statistics such as average, etc. are useless as representatives of the data set. For example, the average of the feed poll interval for a user who sometimes switches her aggregator off at night, will be dominated by this nightly gap. Which is why in the graphs and other data you will hardly see any mention of averages, sometimes maybe a median will be used. There are one or two graphs where there is a mention of average (e.g. a histogram of average # of pulls per day per unique user), but only when that data has been vetted to ensure that the data matches a distribution for which average is a reasonable predictor.

Posted by Werner Vogels at March 18, 2004 12:21 PM
TrackBacks

Take Outs for 18 March 2004
Excerpt: You have been Taken Out! Comments about your posting in this link. Thanks!
Weblog: Enjoy Every Sandwich
Tracked: March 18, 2004 11:03 PM
Comments