Mon, 08 Jan 2007
Spam in 2007
I've came across an interesting case of mail misclassified by our DSpam filter. Some of the reasons given by DSpam were the following:
Date*2007+11, 0.99000, Received*Jan+2007, 0.99000, Date*2007, 0.99000, Received*2007+11, 0.99000, Received*2007, 0.99000,
It seems that when DSpam has been initially trained, all mail which
contained the string "2007" in Date:
or Received:
headers was spam (obviously - only spam or severly misconfigured mail servers
had the system date that much in the future).
The question is, what is the correct solution of this problem: should the
four-digit number in those two headers be a hard-coded exception?
Should the DSpam use a higher-level information (like SpamAssassin does),
such as "Date:
is more than 36 hours in the future"?
Or maybe should users every year on January 1st send few messages to the
DSpam training address?
1 replies for this story:
Milan Zamazal wrote:
I think this is much about learning strategy. First, it seems your spam database is overtrained, it's unlikely many spam messages that required training were future-date. Changes in learning strategy may also prevent such problems, how about automated (re)learning of a "ham message of the day" every day, i.e. a ham message most different of other messages received that day? I wouldn't like the other proposed solutions (hardcoded exceptions and higher level information), they are complicated, human assisted and out of the scope of the classifier.