How to Beat Bayes

Spam is an adaptive virus: we only see the successes, as more and more filtering wipe out the less adaptive versions. Lately, I've been seeing an increasing amount of spam that's passed through three layers of filtering, two of them involving Bayesian notions of word frequency. This new spam has a bunch of randomly created word-length text strings. The subject lines have punctuation introduced in strange places so that the words are legible, but they don't "read" as words. (Of course, an easy parsing solution is to normalize words and then run filters against them.)Obviously, this is the latest end-run around the latest spam innovation. It shows that Bayesian filtering, while a wonderful idea, has its limits because of spammers' cleverness and adaptability. Ultimately, these exercises show that no matter what algorithm we use, spam will still filter through. (I'm still seeing Nigerian variants, which amazes me.) The next approach is going to be digital certificate-based: you can't forge those, and you prevent non-trusted sources from connecting. If you put certificates on the mail servers -- and make sure that VeriSign isn't the only company controlling the issuing of these certificates, but that non-profits and other organizations can be root certificate authorities -- then only mail servers configured with them will be able to exchange email with other servers. It'll be tricky, but I believe the next change in the net will come that way. Technology and legislation aren't stopping spam. Digital certificates could dramatically reduce it because of the ability to revoke certificates, eliminating an entire mail server from a system without requiring a blacklist. (Yeah, and then who decides to revoke certificates? And on and on.)