Yenya's World

Mon, 08 Jan 2007

Spam in 2007

I've came across an interesting case of mail misclassified by our DSpam filter. Some of the reasons given by DSpam were the following:

Date*2007+11, 0.99000,
Received*Jan+2007, 0.99000,
Date*2007, 0.99000,
Received*2007+11, 0.99000,
Received*2007, 0.99000,

It seems that when DSpam has been initially trained, all mail which contained the string "2007" in Date: or Received: headers was spam (obviously - only spam or severly misconfigured mail servers had the system date that much in the future).

The question is, what is the correct solution of this problem: should the four-digit number in those two headers be a hard-coded exception? Should the DSpam use a higher-level information (like SpamAssassin does), such as "Date: is more than 36 hours in the future"? Or maybe should users every year on January 1st send few messages to the DSpam training address?

Section: /computers (RSS feed) | Permanent link | 1 writebacks

1 replies for this story:

Milan Zamazal wrote:

I think this is much about learning strategy. First, it seems your spam database is overtrained, it's unlikely many spam messages that required training were future-date. Changes in learning strategy may also prevent such problems, how about automated (re)learning of a "ham message of the day" every day, i.e. a ham message most different of other messages received that day? I wouldn't like the other proposed solutions (hardcoded exceptions and higher level information), they are complicated, human assisted and out of the scope of the classifier.

Reply to this story:

 
Name:
URL/Email: [http://... or mailto:you@wherever] (optional)
Title: (optional)
Comments:
Key image: key image (valid for an hour only)
Key value: (to verify you are not a bot)

About:

Yenya's World: Linux and beyond - Yenya's blog.

Links:

RSS feed

Jan "Yenya" Kasprzak

The main page of this blog

Categories:

Archive:

Blog roll:

alphabetically :-)