Understanding & Improving Spam Filtering


Understanding Spam Filtering with SpamAssassin

Whether you are a customer of bulk email, someone who is receiving too much spam, or a legitimate bulk mail sender who's mail is getting blocked by mistake, you'll want to understand how email is blocked. For this discussion, we're going to dig into what is arguably the most popular spam filtering tool on the planet: SpamAssassin. SpamAssassin is technically not a spam removal tool, but rather a spam identification tool, which works in conjunction with other utilities to remove spam so that individual users don't have to deal with it. For purposes of this discussion, however, this technicality is not important, and virtually all email systems have removal tools to delete or move identified spam.

When you establish an email account with an external hosting service or internally with most companies, you automatically make use of any server-wide spam blocking tools in use at the time. With SpamAssassin, some companies download and install the tool directly, however, most companies receive it bundled with the web hosting control panel software that they use.

SpamAssassin is often augmented at the server level with additional company or third-party rule sets, however the server-wide implementations must stay conservative on the spam-blocking side, to ensure that no legitimate email (often called ham) is blocked. Additional configuration options are therefore available to the individual email account owners. Depending on the version and vendor of the user software for managing SpamAssassin, users generally have control over one or more of the following parameters: trigger or theshold score, whitelists, blacklists, training, and score modification. Before we explain these, we need to get a little deeper into how SpamAssassin works.

SpamAssassin attempts to identify spam using a variety of mechanisms including text analysis, Bayesian filtering, DNS blocklists, and collaborative filtering databases. SpamAssassin uses a wide range of heuristic tests on mail headers and body text to identify "spam". Put in plain english, SpamAssassin reviews all content in an email message, including the message header (which most users never look at), the sender, the receiver(s), and the message itself. By applying some common-sense spam tests, and other less obvious tests learned from experience or training, it determines the likelihood that a given message is spam. Some of the rules relate to home-grown observations (e.g. HTML messages are more likely to be spam than text messages), and some rules simply check the databases of third-party organizations (called Blacklisting or Blocklisting companies) for known spammer addresses and lists of known spam messages.

Each test or rule in SpamAssassin is associated with a particular pattern that may be consistent with spam, and contains a numerical score. When a given message is analyzed, SpamAssassin adds up the scores for each pattern, or rule that exists within the email. If the score adds up to a number that meets or exceeds your spam threshold number (e.g. 5.0), then the message is indicated as spam.

So, what does a user do if they are receiving too much spam even though SpamAssassin is used on the server? Let's get back to the parameters that are user configurable.

Threshold Score: Most software that interfaces with SpamAssassin allows a user to change the threshold for declaring an email as spam. By default many systems set the threshold in the range of 5.0-8.0. If a user feels that too much obvious spam is getting through, they can reduce the theshold score.

The proper way to do this is to reduce the score a little, then review the balance of false positives (good email or ham) going to the junk folder, versus the additional amount of spam caught. The user can then continue to carefully reduce the threshold score, until they feel that they've gone a little too far - too much ham is being filtered into the junk folder. They can then adjust the threshold back up to the lowest level they are comfortable with.

One problem legitimate bulk emailers face is that users often reduce the threshold score too much, all at once, and then neglect to check their spam folder for legitimate messages, or worse yet, instruct the system to automatically delete suspected spam, rather than sending it to a spam or junk folder. So it's highly recommended that users create a filter that sends spam to a specific folder rather than just deleting it - at least until they are comfortable with their spam settings. Commercial emailers should make their customers aware of the care needed in setting these parameters.

Whitelists: Most software that interfaces with SpamAssassin allows a user to add trusted addresses that are stored in a database. Users can put specific addresses in the whitelist, or they can use wildcards, e.g. *, to whitelist larger groups. For example, if you trust all email coming from widgets.com, then you would put *@widgets.com in your whitelist. It is recommended that legitimate bulk emailers add an instruction to their signup process for the new customers to add their domain (e.g. *@widgets.com) to the whitelist, to avoid numerous headaches down the road related to email that they "didn't receive."

Blacklists: Most software that interfaces with SpamAssassin allows a user to add bad addresses (known spammers) to their personal database. Users can put specific addresses in the blacklist, or they can use wildcards, e.g. *, to blacklist larger groups. For example, if you distrust all email coming from widgets.com, then you would put *@widgets.com in your blacklist. Spammers often use randomly generated email addresses, so from a user perspective, blocking individual addresses is not generally effective. For a company only doing business in one geographic area, however, blacklisting an entire top-level domain can be helpful.

Training: Some software that interfaces with SpamAssassin (e.g. later versions of Plesk) allow a user to train the software on what is and isn't spam, in addition to the rule sets that are offered server-wide. This takes advantage of an advanced statistical process called Bayesian Filtering. In this case, the best approach is for the user to save a lot of email messages that are known spam, but weren't caught by SpamAssassin in a junk folder, then when the sample size is sufficient for training (usually around 200 messages), the messages are selected, and the training command is issued (often a button in a graphical user interface).

Score Modification : Some software that interfaces with SpamAssassin (e.g. cPanel) allow a user to modify the scores for certain rules. One simple way to significantly impact the amount of spam received is to increase the scores to the threshold level for emails that have content or originations that are blacklisted by the major blacklisting organizations that SpamAssassin checks, e.g. SORBS, Spamhaus, NJABL, RBL, etc. For legitimate bulk emailers, this is yet another reason to make sure that your organization is not in any of the major blacklisting databases.

Another excellent method of reducing spam using score modification is to manually check the message headers of email spam that gets through SpamAssassin. Within the header, you will generally see the total spam score for the email, the rules that it tested positive for, and the SpamAssassin threshold. Increasing the score for any or all of the rules listed such that the total meets the spam threshold, will ensure that these types of messages do not make it through to the user's inbox in the future.

To modify a score in cPanel, you simply go to the SpamAssassin configuration panel, and list the rule followed by a space and the new score you wish to give it. Then you simply save the new configuration. Note that this technique will change the score for all email accounts for the domain, not just a single user. For other control panel types, please consult your user documentation. Obviously, one must take care in which rules they choose for increased scores, so that they do not impact legitimate email.

Note that forwarders, which are commonly used by companies to forward mail addressed to a generic department like "marketing" or "abuse" to specific people, generally DO NOT run through SpamAssassin prior to delivery. To ensure that forwarders pass through SpamAssassin, users must create a mail account for each forwarder.

So why aren't these methods deployed at the server level, so individual users or domain owners have to do the customization? Because different users may have different email patterns, due to usage, clients, recipients, and preferences. So the server implementations must stay on the conservative side, with the aggessive filtering performed at the domain or user level.

Will these modifications stop all spam? No. The battle between spammers and email recipients is an ongoing issue, and as new filtering methods become available, spammers make changes to avoid them. Unfortunately, this also affects legitimate bulk email senders, as the increase in blacklisting, filtering, and other spam reduction schemes often impacts a company's ability to reach its customers. So bulk emailers must stay aware of tools such as SpamAssassin, and construct their mailings in a manner that avoids errant filtering.










  • 619 Users Found This Useful
Was this answer helpful?

Related Articles

How do I set up my email in Outlook Express?

MediaCatch has made it easy to set up your e-mail on Outlook Express and can be set up by just...

How to set up a default address that receives all unrouted e-mail

With MediaCatch, you can set up so that all unrouted e-mail can be sent to one account. For...

How does e-mail forwarding work?

E-mail forwarding lets you create an e-mail address which will redirect all incoming mail into...

Troubleshooting Outlook Express Error 0x800CCC92

Symptom - you get the following (or similar) error message when trying to retrieve email from a...

Why do I get this error? Protocol: SMTP, Port: 25, Secure(SSL): No, Error Number: 0x800CCC0B

You may receive this error in your email client settings at some point. In Outlook this is a bug...