Tue, 21 Feb 2006
Per-list spam filters
In January I received more than 40,000 spam messages. Most of them were dropped by my spam filter, but the number of messages which went to my inbox is still high. I have found that my spam filter is not working efficiently especially on messages sent through the mailing lists or aliases. I think the range of message formats, languages, encodings and so on is too broad for my spam filters.
For example, in the CRM114 Mailfilter HOWTO the author writes, that when comparing the spam and non-spam database using the cssdiff utility, the databases are quite different:
Note that there's a big difference between the two files; in this case there are about 10 times as many differences between the two files as there are similarities. That's pretty much typical.
Well, I have tried to run cssdiff on my CRM114 databases, and I have about the same number of differences as the number of similarities, not ten times more differencies than similarities, as the CRM114 author had. This means that my spam is too similar to the non-spam. Or maybe some spam going through a particular mail alias is too similar to the legitimate mail from some other alias or mailing list.
I am subscribed to many mailing lists, and I am a member of some well-known mail aliases at the University. I think some of these addresses receive mail with unique features. For example, the linux-kernel mailing list receives almost no legal mail in HTML or in Czech but occasionally somebody has a signature in Spanish or Portuguese. On the other hand, the mail alias info(a)fi.muni.cz gets many messages in Czech, Slovak, HTML-encoded, containing "suspicious" words like "account number" (for an admission fee) etc. But no Spanish almost no English messages.
It would probably make sense to have a special spam classifier database for each mailing list or alias I am member of. The drawback of this approach is that each of these databases would have to be taught the new types of spam separately. Or maybe the spam corpus for each of those addresses could be shared, and only the non-spam corpus could be separate for each address. This would probably also require some special handling such as removal the mailing list headers/footers before classification and before learning. On the positive side, the per-mailing list spam corpus could be used for filtering the mail before it enters the listserver queue (for lists which I administrate).
What do you think about it? Does anybody use a separate spam filter database for each e-mail source?
4 replies for this story:
Milan Zamazal wrote:
Personally, I wouldn't bother with separate databases. Spam is spam, regardless of its source and the classifier should recognize it. I'd suggest to rebuild your databases, possibly using different classification and/or learning methods. Note that CRM114 implements several classification methods, they are often improved and the recent (January) release contains new mailtrainer script. Just make double sure that you make no mistake when rebuilding the database (learning spam as non-spam and vice versa), that may confuse the classifier a lot.
Yenya wrote: Spam is spam, but ...
... but non-spam is different between various sources (and often more consistent inside one source, such as mailing list). So I think it may help - the classifier then would have a bigger "distance" between (general) spam and (specialized) non-spam. I'll have a look at new CRM114. I have rebuilt the databases ~2 months ago (finding few errors inside my spam corpus, of course).
Milan Zamazal wrote:
I guess the idea may work well for better detection of ordinary messages, but it may not help with special cases. Hard to say without more knowledge about the misclassified mails. BTW, if you'd like to employ your sed skills, you may analyse the misclassifications running crm with the `-T' option :-). In any case, I'm interested in results, please write about them if you have any.
Yenya wrote: Special cases
Well, it seems that some sources generate almost exclusively "special" messages. And I feel bad when teaching CRM114 that this is not spam, because I know that if it came through other alias/mailig list, it would definitely be a spam. These sources are simply different from my other mail: consider prihlaska @ our domain - there are questions about admissions, account numbers, multipart/alternative mails (which are banned from almost every serious mailing list), etc.