Yenya's World

Tue, 12 Dec 2006

Statistical SpamAssassin

The increased amount of spam in the last few months (this month I had over 100 000 spams) made me reconsider my antispam strategy. I use a similar approach to the one described by Milan Zamazal in his article at LinuxZone: Two statistical filters (in my case BogoFilter and CRM114), each of them learned by a different method. However, with spams constructed as an excerpt from the FreeBSD mailing list, accomodated by the image with the (OCR-obfuscated) text of the spam message, this is sometimes not sufficient.

I have discovered that I can clasify those "unsure" messages just by looking at their line in mutt's inbox view - they have a different color telling me that my statistical filters did not agree with each other, and from the color, the sender's name and the Subject line, I can classify it. So I had an idea: add a third statistical filter, learned from the From: and Subject: lines, mime-decoded and written in UTF-8. So far it seems all messages classified by the previous filters as "unsure" were correctly classified by my Subject+From filter.

However, there is a problem: I need to change the whole system of how my statistical filters work: I now have three filters instead of two, and I have to compute the final value from all of them (and probably automatically re-learn the one which did not agree with the rest). And I can imagine that in the future I can add more and more heuristics the same way I have added my Subject+From filter (filtering based on the first few lines of the message only, or a bayesian filter with different tokenization - making a token from the two adjacent words instead of one for defending against the text composed of randomly choosen words). So how to add more and more heuristics, some of them with the possibility of training?

SpamAssassin would probably be a first choice. However, manually adjusting the weight of SA's rules is not easy, and the result is not immediately visible. Also I have no idea about how good my statistical filters are. And the last problem - there surely is a difference when Bogofilter says "This message has probability 47.9% of being a spam" than when the probability is 0.01%. So the best way probably would be to relearn the particular statistical filters, and probably adjust the total weights of those statistical filters against each other. SpamAssassin is not good for this, because it allows binary rules only, and it allows rules with insane amount of weight (such as blacklists/whitelists).

Maybe I should use some kind of likelyhood ratio function, like in the Naive Bayes classifier, to evaluate the different heuristics or statistical methods together. [ Likelihood is a function with values from -∞ to ∞, where zero means "unsure". These values can be summed to compute the total likelihood, and multiplied by a constant to give them a different weight. ] Using this would allow me to "meta-train" the whole filter, adjusting weigths of different heuristics.

What do you think about these two ideas (Subject+From filter, and likelihood-based evaluation of different heuristics or statistical filters), my dear lazyweb? How the future spam detection software should work?

Section: /computers (RSS feed) | Permanent link | 3 writebacks

3 replies for this story:

Milan Zamazal wrote:

I'd say your ideas make sense, but you should be careful if you want to rely on them absolutely (what if a ham sender uses "hi" or "advice" in the subject or if one of the classifiers makes a horrible mistake?). Anyway the whole system looks overengineered. I've abandoned the two-filter approach long time ago. I use crm114 as the only statistical filter now and I'm satisfied with it. If your filters disagree that often that it is necessary to make something about it then there is probably something wrong. I'd suggest trying to tune crm114 instead of expanding the filtering system. Don't forget there are several different classifying methods in crm114 and that the way the classifier learns matters a lot. Build some corpus, try various crm114 methods of classification and learning on it and then choose the best one.

Yenya wrote:

To get a false positive, all three classifier would have to fail (one of them saying "unsure" instead of "ham"). So this is not a problem. I think crm114 is one of the weakest points of my filter wrt. false positives - with TOE (or TUNE) it often makes a mistake just after being trained a new message. With 100_000+ spams over the last months, I cannot even read my spambox effectively. There should be no false positives at all, except maybe some web-site registration requests, which are strangely formatted anyway :-)

Milan Zamazal wrote:

Achieving no false positives together with manageable amount of unsure messages is a difficult goal. If you reduced your requirement to "almost no false positive", you could fight false negatives much more aggressively (this is what I do and can handle my ~10000 incoming spams per month relatively easily). Or you can try to classify incoming messages according to their importance and handle important messages more carefully while less important messages more aggressively. Perhaps the effects of various voting systems you try to implement can be evaluated using your statistical data and probability theory -- exact ways may sometimes lead to better results than experiments although we often tend to consume CPU cycles rather than using math :-). Also, to reduce the amount of spam, don't subscribe to public mailing lists and use Gmane instead. As for crm114, there is more about it than just TOE/TUNE -- e.g. different classifying methods (they can be very different in their principle!), setting proper score bounds for learning, etc. I use a simple one-time learning of unsure messages and errors. BTW, my observation is that my crm114 setup tends to make mistakes in groups, i.e. it usually either makes no errors for some time or it makes several errors in short time like a pupil who gets confused if you correct his mistake. And finally another idea: Thinking about the recent spam floods (a lot of spams with similar subjects), how about automated splitting the mails to be reviewed into several groups of similar messages? Then you could review/kill them much easily.

Reply to this story:

 
Name:
URL/Email: [http://... or mailto:you@wherever] (optional)
Title: (optional)
Comments:
Key image: key image (valid for an hour only)
Key value: (to verify you are not a bot)

Mon, 11 Dec 2006

3D Desktop

The Linux weekend was interesting even for me, altough the intended audience was people who are not familiar with Linux (I think this was an organizational mistake). One of the most interesting presentations was about the window manager named Beryl. At first I thought it was an interesting but useless eye-candy, but after discovering that there are Beryl packages in Fedora Extras, I have decided to give Beryl a try.

I have installed it on my laptop, which has a GPU supported by X.org even with 3D acceleration. It took me a nontrivial amount of time to configure it to do exactly what I want, but I was pretty surprised that things I want from a window manager are either doable with Beryl, or even are the Beryl's default behaviour. For example, I want to have a "maximize" button in the window decoration, which when pressed by the left mouse button maximizes the window, the middle button does a vertical maximize, and the right button does a horizontal maximize - this is exactly what Beryl does by default.

Beryl surely needs a further development: with the virtual desktop plane (as opposed to the desktop cube) there is no keyboard shortcut for "switch to the virtual desktop on the left and bring the currently focused window with me". Or the communication with the GNOME desktop switcher (the panel applet) is weak both with desktop cube and desktop plane.

I find it hard to think about the virtual desktops as "the desktop #1", "the desktop #2", etc. I have a 3x3 plane instead, and I think about the desktops as "the upper left desktop", or "the desktop on the left from the current one". So the desktop cube is not very usable for me, as I need many desktops (in fact I use virtual desktops instead of minimizing windows). On my primary workstation, I have 3x3 virtual desktops on each of my two monitors. However, on my home computer or on my laptop, where I don't work permanently, the desktop cube with four sides is pretty usable, and I went for Beryl on these two computers. On my primary workstation, Sawfish remains as the WM of choice.

So if you don't need many virtual desktops and have a supported GPU, give Beryl a try. I find it to be more than an eye-candy. Animated menus, for example, can simplify navigating on the desktop - it is immediately clear (by animation) where the !@$# pop-up menu came from.

Section: /computers/desktops (RSS feed) | Permanent link | 1 writebacks

1 replies for this story:

Vasek Stodulka wrote:

I told you it is interesting when Fedora was released. :) I have compiz directly connected to binary nVidia drivers, which is quite stable and all applications work (compared to crappy XGL), but when CPU load reaches 100% even for a single moment, the desktop temporialy freezes (until load expires) and even mouse cursor do not move. This is bad and forces me to "Disable desktop effects" when I do some compilation or something, which uses CPU. Fortunately switching to metacity (and then back to compiz) is done by three clicks in Fedora. I want to give Beryl a try soon, maybe it is better then compiz in this manner...

Reply to this story:

 
Name:
URL/Email: [http://... or mailto:you@wherever] (optional)
Title: (optional)
Comments:
Key image: key image (valid for an hour only)
Key value: (to verify you are not a bot)

About:

Yenya's World: Linux and beyond - Yenya's blog.

Links:

RSS feed

Jan "Yenya" Kasprzak

The main page of this blog

Categories:

Archive:

Blog roll:

alphabetically :-)