I've only just started playing with this, but I already like it. The documentation is crap, but the general idea is that all messages are in three states: unknown, junk, and non-junk. Ever message starts out as unknown, then you classify them by hand. Sadly, the GUI does not differentiate between unknown and not-junk, which requires you to read the
relevant bug on Bugzilla, which seems to indicate that tri-state vs. dual-state is a topic of internal debate.
But, once you get past all that, it just starts working. The places you can take this sort of technology are limitless. After all, why not have one classification per folder to which you refile messages, and have Mozilla figure out what the messages in these folders have in common with each other?
Meanwhile, here's an amusing statistic: my 'training.dat' file, which I built using about two months worth of my inbox, is currently 1.5MB. It's a binary file. If you read it anyway, you see a list of (text string,32-bit number,32-bit number) tuples -- no doubt, the frequency counts for which each word occurs as junk or not-junk. If I run 'strings | wc -w' on it, I get 51701 words. If you read it over, you see that they're not very bright yet about different forms of whitespace like tabs, and it seems that they're lowercasing all the words, which might lessen their chances of noticing MAKE MONEY FAST spams. Also, they're throwing away any context of where they found a given word (subject, from, to, body, etc.), which I'd normally think would be worth keeping around.
In another week or two, I should be able to have some false-positive / false-negative rates. Right now, at least for the trickle of e-mail that showed up this afternoon, it's flawless.