Mon, 08 Jan 2007

Spam in 2007

I've came across an interesting case of mail misclassified by our DSpam filter. Some of the reasons given by DSpam were the following:

Date*2007+11, 0.99000,
Received*Jan+2007, 0.99000,
Date*2007, 0.99000,
Received*2007+11, 0.99000,
Received*2007, 0.99000,

It seems that when DSpam has been initially trained, all mail which contained the string "2007" in Date: or Received: headers was spam (obviously - only spam or severly misconfigured mail servers had the system date that much in the future).

The question is, what is the correct solution of this problem: should the four-digit number in those two headers be a hard-coded exception? Should the DSpam use a higher-level information (like SpamAssassin does), such as "Date: is more than 36 hours in the future"? Or maybe should users every year on January 1st send few messages to the DSpam training address?

Milan Zamazal wrote:

I think this is much about learning strategy. First, it seems your spam database is overtrained, it's unlikely many spam messages that required training were future-date. Changes in learning strategy may also prevent such problems, how about automated (re)learning of a "ham message of the day" every day, i.e. a ham message most different of other messages received that day? I wouldn't like the other proposed solutions (hardcoded exceptions and higher level information), they are complicated, human assisted and out of the scope of the classifier.

