I've put into place a throttling mechanism for my RSS feed from Wi-Fi Networking News. I'll post the code soon. I use an Apache server and am essentially forcing through several RSS, RDF, and Atom feed files to be retrieved seamlessly through a script.
The script uses a MySQL database to record the agent name and IP address of requests. If the RSS feed hasn't changed in the last hour or, if longer, since the last time the same IP and agent requested a feed, the RSS aggregator gets a 304 (not modified) instead of a full dump.
I'm willing to take a small hit on testing this--losing some RSS aggregators that don't interpret this behavior correctly--in order to test whether I reduce my overall RSS feed suck.
As I noted a few days ago, my RSS feed from Wi-Fi Networking News (vast majority) is nearly an average of 400 MB per day. But a substantial minority of that is from stupid aggregators that don't check for modifications, but always request the full feed.
This is my way of fooling them. We'll see if it breaks anything, or just makes it more efficient.
Later: I have the early observations about which aggregators are really, really stinky at understanding what "please don't retrieve a page because it hasn't changed" means. I'm only intercepting GET requests, not HEAD requests, as I understand how Apache works, so I'm only recording hits from aggregators that keep taking and taking and taking bandwidth.
The top villains (and I'd be glad to get more information about them -- please drop me a line). The fact that some of these appear multiple times means that they are being accessed from different IP addresses.
Great news/update (11/22)! The folks at Xmission, whose Xmission RPC Agent was one of my top offenders, responded to some email I wrote in which I asked them if they could take a look at how their engine works, and they said there was a bug causing this kind of repetition which they've fixed. What a win-win situation: they use less bandwidth and computational time, and I don't lose readers! I've written the SmartBarXP people and hope to get a response from them, too.
Another update (later on 11/22): Greg from NewsGator wrote to find out why I was seeing such high usage from NewsGatorOnline. It's a well-behaved 'gator, it turns out: my script captures all GET requests, and the NewsGator makes all the right moves to not retrieve a non-modified page. But these are recorded in my logs as zero-byte 200 (OK) HTTP transactions. Thus NewsGatorOnline shows up with a lot of requests, but isn't pulling down traffic. Scratch 'em off the list!
|Agent name||Requests over a few hours|
|XMission RPC Agent Fixed! 11/22||253|
|NewsGatorOnline/2.0 Not a problem, turns out||19|
|curl/7.9.8 (i386-portbld-freebsd4.6.2) libcurl 7.9.8 (OpenSSL 0.9.6g) (ipv6 enabled)||15|
|SharpReader/0.9.4.1 (.NET CLR 1.1.4322.2032; WinNT 5.1.2600.0)||15|
It looks like my next plan may be to entirely block certain aggregators by replying with an XML "pllllllllhhhhbbbbtt" and an item encoded note saying, "Please ask your aggregator's software developer to correct behavior in using requests to determine changed syndication feeds. You will then be allowed to use this feed again." I might offend some readers, but it looks to me like I might save a number of gigabytes a month now and much more in the future as usage grows. If you use RSS with Wi-Fi Networking News, please let me know if you're seeing errors, by the way.