Tue, 21 Feb 2006
Per-list spam filters
In January I received more than 40,000 spam messages. Most of them were dropped by my spam filter, but the number of messages which went to my inbox is still high. I have found that my spam filter is not working efficiently especially on messages sent through the mailing lists or aliases. I think the range of message formats, languages, encodings and so on is too broad for my spam filters.
For example, in the CRM114 Mailfilter HOWTO the author writes, that when comparing the spam and non-spam database using the cssdiff utility, the databases are quite different:
Note that there's a big difference between the two files; in this case there are about 10 times as many differences between the two files as there are similarities. That's pretty much typical.
Well, I have tried to run cssdiff on my CRM114 databases, and I have about the same number of differences as the number of similarities, not ten times more differencies than similarities, as the CRM114 author had. This means that my spam is too similar to the non-spam. Or maybe some spam going through a particular mail alias is too similar to the legitimate mail from some other alias or mailing list.
I am subscribed to many mailing lists, and I am a member of some well-known mail aliases at the University. I think some of these addresses receive mail with unique features. For example, the linux-kernel mailing list receives almost no legal mail in HTML or in Czech but occasionally somebody has a signature in Spanish or Portuguese. On the other hand, the mail alias info(a)fi.muni.cz gets many messages in Czech, Slovak, HTML-encoded, containing "suspicious" words like "account number" (for an admission fee) etc. But no Spanish almost no English messages.
It would probably make sense to have a special spam classifier database for each mailing list or alias I am member of. The drawback of this approach is that each of these databases would have to be taught the new types of spam separately. Or maybe the spam corpus for each of those addresses could be shared, and only the non-spam corpus could be separate for each address. This would probably also require some special handling such as removal the mailing list headers/footers before classification and before learning. On the positive side, the per-mailing list spam corpus could be used for filtering the mail before it enters the listserver queue (for lists which I administrate).
What do you think about it? Does anybody use a separate spam filter database for each e-mail source?