An RSS Conversion Site

I'm toying with building a site that would aid people who use RSS entirely for things that they scan through but don't need to read (nor care if they miss parts of) in transitioning from RSS.

The notion is that every site (exclusive of personal blogs, really) that has an RSS feed that updates regularly almost certainly also has a Twitter feed, potentially an App.net feed, or a mailing list, or some combination. Email lists stink for frequently delivered routine stuff, but many of the RSS feeds I follow, I'd be better off getting a monthly newsletter than daily headlines.

Read More

The Latest Glenn Blog: Regular Sucking Schedule

Sure, the title of this new blog I started could apply to my little boy, a regular sucker, but I'm really talking about RSS and aggregator behavior and the bandwidth, scaling, and cost behind it.

I've started Regular Sucking Schedule to try to pull together information from many sources and report on my own experiments. I'll also post code there under Creative Commons license (and hope that others improve and contribute to it) as I write it and clean it up.

Please write me with RSS issues that I can link to or respond to my posts on the matter!

Throttling RSS Seems to Work

On Nov. 13, I posted a graph showing the fast growth in the requested bytes in RSS and similar feeds from my Wi-Fi Networking News and a few (much smaller) other sites. The bandwidth usage showed a growth from the mid-200 MB per day range up to about 350 MB per average per day. During that same time, I wasn't seeing an increase in visitors of that scale--maybe 10 to 20 percent, not 75 percent.

After analyzing logs, I discovered that a small percentage of aggregation sites and aggregation servers were requesting as much as 20 to 30 percent of the bandwidth unnecessarily through aggressive downloads that didn't check the If-Modified-Since headers or other tools to prevent a retrieval of a page that hadn't changed.

I built a simple program running via Apache that throttles RSS downloads: a given IP and user agent combination can only request a given RSS feed file if it's changed since they last retrieved it. Pretty simple. But the effects are profound, as this graph shows.

Rss Nov Dec

As you can see, I threw the switch on Nov. 20, just before Thanksgiving, but I haven't seen a real decline in readership at my Wi-Fi site or the other sites--just a decline in bandwidth. The average (with lots of posts over the last week or so, meaning more RSS retrievals because of the update) is back to about 200 MB.

This reveals a lot about the sloppiness of some of the aggregators out there. Right now, my top aggregator is Mozilla (Firefox, primarily), which makes perfect sense: there are a lot of people using the RSS button in Firefox to subscribe to my feed, and if it's the top engine that's because of many unique users.

Since I pay by the gigabyte for overages above my minimum (which I've hit), this change will save me a reasonable pittance: probably $10 or $12 per month. Sounds like someone needs to build a master site for testing aggregation competence so that aggregator software developers can test this, and users and Web site operators can report on it back to developers.

Throttling RSS

I've put into place a throttling mechanism for my RSS feed from Wi-Fi Networking News. I'll post the code soon. I use an Apache server and am essentially forcing through several RSS, RDF, and Atom feed files to be retrieved seamlessly through a script.

The script uses a MySQL database to record the agent name and IP address of requests. If the RSS feed hasn't changed in the last hour or, if longer, since the last time the same IP and agent requested a feed, the RSS aggregator gets a 304 (not modified) instead of a full dump.

I'm willing to take a small hit on testing this--losing some RSS aggregators that don't interpret this behavior correctly--in order to test whether I reduce my overall RSS feed suck.

As I noted a few days ago, my RSS feed from Wi-Fi Networking News (vast majority) is nearly an average of 400 MB per day. But a substantial minority of that is from stupid aggregators that don't check for modifications, but always request the full feed.

This is my way of fooling them. We'll see if it breaks anything, or just makes it more efficient.

Later: I have the early observations about which aggregators are really, really stinky at understanding what "please don't retrieve a page because it hasn't changed" means. I'm only intercepting GET requests, not HEAD requests, as I understand how Apache works, so I'm only recording hits from aggregators that keep taking and taking and taking bandwidth.

The top villains (and I'd be glad to get more information about them -- please drop me a line). The fact that some of these appear multiple times means that they are being accessed from different IP addresses.

Great news/update (11/22)! The folks at Xmission, whose Xmission RPC Agent was one of my top offenders, responded to some email I wrote in which I asked them if they could take a look at how their engine works, and they said there was a bug causing this kind of repetition which they've fixed. What a win-win situation: they use less bandwidth and computational time, and I don't lose readers! I've written the SmartBarXP people and hope to get a response from them, too.

Another update (later on 11/22): Greg from NewsGator wrote to find out why I was seeing such high usage from NewsGatorOnline. It's a well-behaved 'gator, it turns out: my script captures all GET requests, and the NewsGator makes all the right moves to not retrieve a non-modified page. But these are recorded in my logs as zero-byte 200 (OK) HTTP transactions. Thus NewsGatorOnline shows up with a lot of requests, but isn't pulling down traffic. Scratch 'em off the list!

Agent name Requests over a few hours
XMission RPC Agent Fixed! 11/22 253
NewsFire/0.28 36
SmartBarXP WinInet 27
SmartBarXP WinInet 22
NewsGatorOnline/2.0 Not a problem, turns out 19
NewsFire/0.28 17
SmartBarXP WinInet 16
SmartBarXP WinInet 16
SmartBarXP WinInet 16
SmartBarXP WinInet 16
SmartBarXP WinInet 15
SmartBarXP WinInet 15
curl/7.9.8 (i386-portbld-freebsd4.6.2) libcurl 7.9.8 (OpenSSL 0.9.6g) (ipv6 enabled) 15
SmartBarXP WinInet 15
lwp-trivial/1.35 15
SharpReader/0.9.4.1 (.NET CLR 1.1.4322.2032; WinNT 5.1.2600.0) 15
SmartBarXP WinInet 15
SmartBarXP WinInet 15
SmartBarXP WinInet 14
Oddbot/1.0 (+http://oddpost.com/oddbot.html) 13
NONE 12
IdeareNews/0.8 12
NewsFire/0.28 12
FeedOnFeeds/0.1.7 (+http://minutillo.com/steve/feedonfeeds/) 11

It looks like my next plan may be to entirely block certain aggregators by replying with an XML "pllllllllhhhhbbbbtt" and an item encoded note saying, "Please ask your aggregator's software developer to correct behavior in using requests to determine changed syndication feeds. You will then be allowed to use this feed again." I might offend some readers, but it looks to me like I might save a number of gigabytes a month now and much more in the future as usage grows. If you use RSS with Wi-Fi Networking News, please let me know if you're seeing errors, by the way.

What Bandwidth RSS Uses

I ran the calculations on how much bandwidth RSS aggregators are sucking from my Web server by scanning for retrievals of files named index.xml, index.rdf, rss.xml, atom.xml, and scriptingnews2.xml. I looked at just the HTTP code 200 transactions, not the 304 (no modification) retrievals which are just a handful of bytes. The chart is below:
rss_bandwidth_041113
Most of this RSS is for Wi-Fi Networking News; a tiny fraction for blog.glennf.com and a few other blogs. You can see the growth and the weekends pretty obviously--weekends make the most sense as I'm least likely to post updates, so well-behaved RSS aggregators are least likely to get changed files, while ill-behaved ones are more likely to be on computers that are turned off for the weekend. In early October, the weekday average was about 275 Mb; in mid-November, we're up to 375 Mb. (BoingBoing linked to this post, and their own stats: They feed 50 Gb per month of news aggregator feeds -- that's more than they ship in HTML!)

Now my co-location host, digital.forest, has great bandwidth pricing: a buck a gig over the 80 Gb per machine that I have co-located. I transfer about 2 Gb per day in Web site traffic from the machine that's now pushing out nearly half a gig in RSS traffic. I may have to build a custom RSS Apache doohickey that will force a 304 (no change) to an RSS aggregator if it doesn't have an If-Modified-Since tag in its request.

I did a quick look at which aggregators represent the most traffic, and a very small number of users employing lwp-trivial, a perl-based HTTP query system, appear to be using over 10 percent of my RSS bandwidth! Time to fix their wagons, to be sure. It makes sense that various Mozilla browsers that have RSS support are using about 15 percent. NetNewsWire makes a very strong showing of 10 percent of usage lately. (Click image to see the full-sized chart; I dropped out days in which aggregators retrieve less than 7 MB, which is why you see some gaps. You can also see NetNewsWIre's beta 6 adoption curve. If you have a better way to graph this, I'm open to it: click here to download the Excel file which generated this chart.)

I can tell that Mozilla-derivatives like Firefox and NetNewsWire are well behaved because the bandwidth-abusing aggregators don't drop their traffic usage much on weekends; the well-behaved ones drop by about 80 percent. This could also indicate that the poorly behaved ones are more likely to be running on servers instead of on personal computers, too.