Yenya's World

Tue, 12 Dec 2006

Statistical SpamAssassin

The increased amount of spam in the last few months (this month I had over 100 000 spams) made me reconsider my antispam strategy. I use a similar approach to the one described by Milan Zamazal in his article at LinuxZone: Two statistical filters (in my case BogoFilter and CRM114), each of them learned by a different method. However, with spams constructed as an excerpt from the FreeBSD mailing list, accomodated by the image with the (OCR-obfuscated) text of the spam message, this is sometimes not sufficient.

I have discovered that I can clasify those "unsure" messages just by looking at their line in mutt's inbox view - they have a different color telling me that my statistical filters did not agree with each other, and from the color, the sender's name and the Subject line, I can classify it. So I had an idea: add a third statistical filter, learned from the From: and Subject: lines, mime-decoded and written in UTF-8. So far it seems all messages classified by the previous filters as "unsure" were correctly classified by my Subject+From filter.

However, there is a problem: I need to change the whole system of how my statistical filters work: I now have three filters instead of two, and I have to compute the final value from all of them (and probably automatically re-learn the one which did not agree with the rest). And I can imagine that in the future I can add more and more heuristics the same way I have added my Subject+From filter (filtering based on the first few lines of the message only, or a bayesian filter with different tokenization - making a token from the two adjacent words instead of one for defending against the text composed of randomly choosen words). So how to add more and more heuristics, some of them with the possibility of training?

SpamAssassin would probably be a first choice. However, manually adjusting the weight of SA's rules is not easy, and the result is not immediately visible. Also I have no idea about how good my statistical filters are. And the last problem - there surely is a difference when Bogofilter says "This message has probability 47.9% of being a spam" than when the probability is 0.01%. So the best way probably would be to relearn the particular statistical filters, and probably adjust the total weights of those statistical filters against each other. SpamAssassin is not good for this, because it allows binary rules only, and it allows rules with insane amount of weight (such as blacklists/whitelists).

Maybe I should use some kind of likelyhood ratio function, like in the Naive Bayes classifier, to evaluate the different heuristics or statistical methods together. [ Likelihood is a function with values from -∞ to ∞, where zero means "unsure". These values can be summed to compute the total likelihood, and multiplied by a constant to give them a different weight. ] Using this would allow me to "meta-train" the whole filter, adjusting weigths of different heuristics.

What do you think about these two ideas (Subject+From filter, and likelihood-based evaluation of different heuristics or statistical filters), my dear lazyweb? How the future spam detection software should work?

Section: /computers (RSS feed) | Permanent link | 3 writebacks

Mon, 11 Dec 2006

3D Desktop

The Linux weekend was interesting even for me, altough the intended audience was people who are not familiar with Linux (I think this was an organizational mistake). One of the most interesting presentations was about the window manager named Beryl. At first I thought it was an interesting but useless eye-candy, but after discovering that there are Beryl packages in Fedora Extras, I have decided to give Beryl a try.

I have installed it on my laptop, which has a GPU supported by even with 3D acceleration. It took me a nontrivial amount of time to configure it to do exactly what I want, but I was pretty surprised that things I want from a window manager are either doable with Beryl, or even are the Beryl's default behaviour. For example, I want to have a "maximize" button in the window decoration, which when pressed by the left mouse button maximizes the window, the middle button does a vertical maximize, and the right button does a horizontal maximize - this is exactly what Beryl does by default.

Beryl surely needs a further development: with the virtual desktop plane (as opposed to the desktop cube) there is no keyboard shortcut for "switch to the virtual desktop on the left and bring the currently focused window with me". Or the communication with the GNOME desktop switcher (the panel applet) is weak both with desktop cube and desktop plane.

I find it hard to think about the virtual desktops as "the desktop #1", "the desktop #2", etc. I have a 3x3 plane instead, and I think about the desktops as "the upper left desktop", or "the desktop on the left from the current one". So the desktop cube is not very usable for me, as I need many desktops (in fact I use virtual desktops instead of minimizing windows). On my primary workstation, I have 3x3 virtual desktops on each of my two monitors. However, on my home computer or on my laptop, where I don't work permanently, the desktop cube with four sides is pretty usable, and I went for Beryl on these two computers. On my primary workstation, Sawfish remains as the WM of choice.

So if you don't need many virtual desktops and have a supported GPU, give Beryl a try. I find it to be more than an eye-candy. Animated menus, for example, can simplify navigating on the desktop - it is immediately clear (by animation) where the !@$# pop-up menu came from.

Section: /computers/desktops (RSS feed) | Permanent link | 1 writebacks


Yenya's World: Linux and beyond - Yenya's blog.


RSS feed

Jan "Yenya" Kasprzak

The main page of this blog



Blog roll:

alphabetically :-)