Getting Hammered--Not the Fun Kind

I'm sure it's not just me, but my Web sites are getting hammered with weird retrievals. I only notice this pattern when things bog down, and I had to install some extra monitoring tools because of how bad it had become. A few days ago, my Movable Type system went insane because someone was trying to post a massive number of comments. This didn't work--I have moderation enabled--but it did slow the entire server down. I wrote a little script to figure out how many non-robot hosts were pounding on my machines, and found that the number was pretty frightening.

Some out of control automatic retrievals--including one from a Wi-Fi firm--were engaged in very ugly behavior. The Wi-Fi firm was retrieving an RSS feed for its internal news site by performing a full request every two minutes! That's right: no HEAD or Etag-based request to see if it the RSS feed had changed. Rather, pulling down 30K nearly 1,000 times a day when it changes just a few times a day at most.

Other behavior is more baffling. People using Wget or lwp-trivial (a Unix library) with what must be poor scripting, retrieving the same page 1,000s of time. A number of incidents look like copying behavior for splogs. I'm guessing that there are automated engines that retrieve the entire contents of well-linked Web blogs and use algorithms to use that content on automatically created fake Web logs (splogs). This is the only good explanation I can come up with.

This demonstrates some of the problem of asymmetricality of bandwidth costs. I get about 150 GB of included bandwidth each months and pay a buck a gig after that, but I think that of the 500 MB of RSS that is retrieved each day and the gigs of other data that a good 50 to 75 percent of it is either unnecessary (bad programming on someone's part) or illegitimate. Nobody retrieving that data is paying for it as most accounts on which these activities are occurring are unlimited, whether DSL or T-1.