## Tue, 12 Dec 2006

## Statistical SpamAssassin

The increased amount of spam in the last few months (this month I had over 100 000 spams) made me reconsider my antispam strategy. I use a similar approach to the one described by Milan Zamazal in his article at LinuxZone: Two statistical filters (in my case BogoFilter and CRM114), each of them learned by a different method. However, with spams constructed as an excerpt from the FreeBSD mailing list, accomodated by the image with the (OCR-obfuscated) text of the spam message, this is sometimes not sufficient.

I have discovered that I can clasify those "unsure" messages just by looking at
their line in mutt's inbox view
- they have a different color telling me that my statistical filters
did not agree with each other, and from the color, the
sender's name and the Subject line, I can classify it. So I had an idea: add
a third statistical filter, learned from the `From:`

and `Subject:`

lines, mime-decoded and written in UTF-8.
So far it seems all messages classified by the previous filters as "unsure"
were correctly classified by my Subject+From filter.

However, there is a problem: I need to change the whole system of how my statistical filters work: I now have three filters instead of two, and I have to compute the final value from all of them (and probably automatically re-learn the one which did not agree with the rest). And I can imagine that in the future I can add more and more heuristics the same way I have added my Subject+From filter (filtering based on the first few lines of the message only, or a bayesian filter with different tokenization - making a token from the two adjacent words instead of one for defending against the text composed of randomly choosen words). So how to add more and more heuristics, some of them with the possibility of training?

SpamAssassin would probably be a first choice. However, manually adjusting the weight of SA's rules is not easy, and the result is not immediately visible. Also I have no idea about how good my statistical filters are. And the last problem - there surely is a difference when Bogofilter says "This message has probability 47.9% of being a spam" than when the probability is 0.01%. So the best way probably would be to relearn the particular statistical filters, and probably adjust the total weights of those statistical filters against each other. SpamAssassin is not good for this, because it allows binary rules only, and it allows rules with insane amount of weight (such as blacklists/whitelists).

Maybe I should use some kind of *likelyhood ratio* function, like in the
Naive Bayes
classifier, to evaluate the different heuristics or statistical methods
together.
[ Likelihood is a function with values from -∞ to ∞, where zero means
"unsure". These values can be summed to compute the total likelihood,
and multiplied by a constant to give them a different weight. ]
Using this would allow me to "meta-train" the whole filter, adjusting weigths
of different heuristics.

What do you think about these two ideas (Subject+From filter, and likelihood-based evaluation of different heuristics or statistical filters), my dear lazyweb? How the future spam detection software should work?

## 3 replies for this story:

### Milan Zamazal wrote:

I'd say your ideas make sense, but you should be careful if you want to rely on them absolutely (what if a ham sender uses "hi" or "advice" in the subject or if one of the classifiers makes a horrible mistake?). Anyway the whole system looks overengineered. I've abandoned the two-filter approach long time ago. I use crm114 as the only statistical filter now and I'm satisfied with it. If your filters disagree that often that it is necessary to make something about it then there is probably something wrong. I'd suggest trying to tune crm114 instead of expanding the filtering system. Don't forget there are several different classifying methods in crm114 and that the way the classifier learns matters a lot. Build some corpus, try various crm114 methods of classification and learning on it and then choose the best one.

### Yenya wrote:

To get a false positive, all three classifier would have to fail (one of them saying "unsure" instead of "ham"). So this is not a problem. I think crm114 is one of the weakest points of my filter wrt. false positives - with TOE (or TUNE) it often makes a mistake just after being trained a new message. With 100_000+ spams over the last months, I cannot even read my spambox effectively. There should be no false positives at all, except maybe some web-site registration requests, which are strangely formatted anyway :-)

### Milan Zamazal wrote:

Achieving no false positives together with manageable amount of unsure messages is a difficult goal. If you reduced your requirement to "almost no false positive", you could fight false negatives much more aggressively (this is what I do and can handle my ~10000 incoming spams per month relatively easily). Or you can try to classify incoming messages according to their importance and handle important messages more carefully while less important messages more aggressively. Perhaps the effects of various voting systems you try to implement can be evaluated using your statistical data and probability theory -- exact ways may sometimes lead to better results than experiments although we often tend to consume CPU cycles rather than using math :-). Also, to reduce the amount of spam, don't subscribe to public mailing lists and use Gmane instead. As for crm114, there is more about it than just TOE/TUNE -- e.g. different classifying methods (they can be very different in their principle!), setting proper score bounds for learning, etc. I use a simple one-time learning of unsure messages and errors. BTW, my observation is that my crm114 setup tends to make mistakes in groups, i.e. it usually either makes no errors for some time or it makes several errors in short time like a pupil who gets confused if you correct his mistake. And finally another idea: Thinking about the recent spam floods (a lot of spams with similar subjects), how about automated splitting the mails to be reviewed into several groups of similar messages? Then you could review/kill them much easily.