|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
"http://www.w3.org/TR/html4/strict.dtd">
|
|
Packit Service |
8f0814 |
<html>
|
|
Packit Service |
8f0814 |
<head>
|
|
Packit Service |
8f0814 |
<title>Tuning bogofilter</title>
|
|
Packit Service |
8f0814 |
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
|
|
Packit Service |
8f0814 |
</head>
|
|
Packit Service |
8f0814 |
<body>
|
|
Packit Service |
8f0814 |
TUNING BOGOFILTER'S ROBINSON-FISHER METHOD -- an updated
|
|
Packit Service |
8f0814 |
HOWTO
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
<address>(Greg Louis,
|
|
Packit Service |
8f0814 |
September 2004)</address>
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
NB: Bogotune is a tool (shipped with bogofilter) that automates
|
|
Packit Service |
8f0814 |
the tuning process. Its "full search" mode performs a
|
|
Packit Service |
8f0814 |
five-dimensional grid search over possible values of the parameters
|
|
Packit Service |
8f0814 |
to be described below, and comes up with recommendations for
|
|
Packit Service |
8f0814 |
optimal settings. There's also a "partial search" mode that is only
|
|
Packit Service |
8f0814 |
three-dimensional. If you have enough spam and nonspam messages (at
|
|
Packit Service |
8f0814 |
least 2,500 of each), using bogotune is highly recommended for
|
|
Packit Service |
8f0814 |
optimizing bogofilter's accuracy.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
CONTENTS
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Introduction
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Robinson's x
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Robinson's s
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The minimum deviation
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Effective size factors
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The spam and nonspam cutoffs
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Overview of bogotune
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
How often to tune
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
A note on training
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
INTRODUCTION
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The bogofilter program has evolved through three classification
|
|
Packit Service |
8f0814 |
methods: the original as proposed by Paul Graham and implemented in
|
|
Packit Service |
8f0814 |
bogofilter by Eric S. Raymond; a variation proposed by Gary
|
|
Packit Service |
8f0814 |
Robinson and implemented in bogofilter by Greg Louis; and a further
|
|
Packit Service |
8f0814 |
variation, also proposed by Gary Robinson, which uses Fisher's
|
|
Packit Service |
8f0814 |
method (Fisher, R. A., 1950: Statistical Methods for Research
|
|
Packit Service |
8f0814 |
Workers, pp. 99ff. London: Oliver and Boyd) of
|
|
Packit Service |
8f0814 |
combining probabilities; for bogofilter, your author has
|
|
Packit Service |
8f0814 |
implemented this one too. Recently, Gary Robinson described a
|
|
Packit Service |
8f0814 |
further improvement, the application of effective size factors in
|
|
Packit Service |
8f0814 |
the scoring process; this is available by default in bogofilter,
|
|
Packit Service |
8f0814 |
but since it's relatively new, an option exists in both bogofilter
|
|
Packit Service |
8f0814 |
and bogotune to switch it off.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Each of Gary Robinson's proposed classification methods works
|
|
Packit Service |
8f0814 |
better than the earlier versions. For optimal results, they
|
|
Packit Service |
8f0814 |
(and the original) require some tuning. As distributed,
|
|
Packit Service |
8f0814 |
bogofilter attempts to supply good starting values for the tunable
|
|
Packit Service |
8f0814 |
parameters. Since the optimum values depend on the size and
|
|
Packit Service |
8f0814 |
content of the wordlists at your installation, the best
|
|
Packit Service |
8f0814 |
results can only be determined by some experimentation using
|
|
Packit Service |
8f0814 |
your wordlists. After several thousand each of spam
|
|
Packit Service |
8f0814 |
and nonspam messages have been fed to bogofilter for training, this
|
|
Packit Service |
8f0814 |
experimentation can be well worthwhile: you may be able to cut the
|
|
Packit Service |
8f0814 |
number of spams that are still getting through by more than
|
|
Packit Service |
8f0814 |
half.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The purpose of this document is to explain how to adjust
|
|
Packit Service |
8f0814 |
bogofilter's parameters for best spam filtering. A manual
|
|
Packit Service |
8f0814 |
tuning process is described; though you'll be wise to use bogotune
|
|
Packit Service |
8f0814 |
to help find your parameter values, understanding the manual
|
|
Packit Service |
8f0814 |
process will help you make sense of what bogotune does.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
With Robinson's changes as implemented in bogofilter, there are
|
|
Packit Service |
8f0814 |
seven (five without effective size factors) things to tune, six (or
|
|
Packit Service |
8f0814 |
four) of which are highly interdependent, as explained below:
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
ROBINSON'S x
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
First off, you should determine the value of x appropriate to
|
|
Packit Service |
8f0814 |
your training database.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The way bogofilter works, summarizing briefly, is that the
|
|
Packit Service |
8f0814 |
message being classified is separated into "tokens" -- words, IP
|
|
Packit Service |
8f0814 |
addresses and other logical units of information. Each token
|
|
Packit Service |
8f0814 |
is looked up in the wordlist that makes up the training
|
|
Packit Service |
8f0814 |
database. The number of times it's been seen in a spam
|
|
Packit Service |
8f0814 |
message is divided by the total number of times it's been seen, and
|
|
Packit Service |
8f0814 |
that gives an indication of the likelihood that the token is in a
|
|
Packit Service |
8f0814 |
spam message. The likelihood estimates for all the tokens are
|
|
Packit Service |
8f0814 |
combined to give a score between 0 and 1 -- 0 means the message is
|
|
Packit Service |
8f0814 |
not likely to be a spam, 1 means it is. That's fine, but what
|
|
Packit Service |
8f0814 |
happens if a token's never been seen before? That's where x comes
|
|
Packit Service |
8f0814 |
in: it's a "first guess" at what the presence of an unknown token
|
|
Packit Service |
8f0814 |
means, in terms of its contribution to the score. It's the
|
|
Packit Service |
8f0814 |
value used as the likelihood estimate when a new token is
|
|
Packit Service |
8f0814 |
found.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Obtaining a value for x is easy: bogoutil will calculate it for
|
|
Packit Service |
8f0814 |
you.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Assuming your bogofilter wordlist is in ~/.bogofilter, run
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
bogoutil -r
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
This will print out an x value. If it's in the range of 0.4 to
|
|
Packit Service |
8f0814 |
0.6, you can run
|
|
Packit Service |
8f0814 |
bogoutil -R ~/.bogofilter to install
|
|
Packit Service |
8f0814 |
the calculated value so bogofilter will use it from then on.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The value of x that bogoutil calculates is just the average
|
|
Packit Service |
8f0814 |
of
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
p(w) = badcount / (goodcount + badcount)
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
for every word in your training set that appears at least 10
|
|
Packit Service |
8f0814 |
times in spam and/or ham counts in your wordlist (i.e.
|
|
Packit Service |
8f0814 |
badcount + goodcount >= 10).
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
To be honest, that's an oversimplification, for the sake of
|
|
Packit Service |
8f0814 |
explaining the basic concept clearly. In real life, you have
|
|
Packit Service |
8f0814 |
to scale the counts somehow. If you had exactly the same
|
|
Packit Service |
8f0814 |
number of spam and nonspam messages contributing to your wordlist,
|
|
Packit Service |
8f0814 |
the formula for p(w) given above would be ok; but we actually have
|
|
Packit Service |
8f0814 |
to use
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
p(w) = (badcount / badlist_msgcount) /
|
|
Packit Service |
8f0814 |
(badcount / badlist_msgcount + goodcount / goodlist_msgcount)
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
where badlist_msgcount is the number of spam messages that were
|
|
Packit Service |
8f0814 |
fed into the training set, and goodlist_msgcount is the number of
|
|
Packit Service |
8f0814 |
nonspams used in training.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
An equivalent way of calculating x, that's a little easier to
|
|
Packit Service |
8f0814 |
read, is:
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
scalefactor = badlist_msgcount / goodlist_msgcount
|
|
Packit Service |
8f0814 |
p(w) = badcount / (badcount + goodcount * scalefactor)
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
In either case, x is the average of those p(w) values.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The calculated x is just a first guess, and it may be worth
|
|
Packit Service |
8f0814 |
while experimenting (after tuning s and the minimum deviation as
|
|
Packit Service |
8f0814 |
described below) to see what happens if you adjust it up or
|
|
Packit Service |
8f0814 |
downward within a range of about 0.1 either way.
|
|
Packit Service |
8f0814 |
"robs">
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
ROBINSON'S s
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
With a count of zero for a given token, we have only x to go on,
|
|
Packit Service |
8f0814 |
so that's what we use as the likelihood estimate for that
|
|
Packit Service |
8f0814 |
token. The question then arises, what if we have seen the
|
|
Packit Service |
8f0814 |
token before, but only a few times? Statistical variation will
|
|
Packit Service |
8f0814 |
result in the ratio of two small numbers being rather unreliable,
|
|
Packit Service |
8f0814 |
so perhaps we ought to compromise between that ratio and our "first
|
|
Packit Service |
8f0814 |
guess" value. This is what the Robinson method does.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The compromise works like this: a parameter s is defined, that
|
|
Packit Service |
8f0814 |
serves as a weighting factor; the larger the value of s, the more
|
|
Packit Service |
8f0814 |
importance is given to x in the presence of low token counts.
|
|
Packit Service |
8f0814 |
The token count is
|
|
Packit Service |
8f0814 |
n = badcount + goodcount and
|
|
Packit Service |
8f0814 |
the p(w) value for the token is modified as follows:
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
f(w) = (s * x + n * p(w))/(s + n)
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
As you see, if n is zero, f(w) is just x, as we have been saying
|
|
Packit Service |
8f0814 |
all along; but if n is nonzero and small, the s * x term
|
|
Packit Service |
8f0814 |
will influence the f(w) value.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Parameters s and x are only important when the counts are small,
|
|
Packit Service |
8f0814 |
and the value of s reflects what we think of as "small." If s is
|
|
Packit Service |
8f0814 |
large, then when counts are small we trust our x value more than we
|
|
Packit Service |
8f0814 |
do the p(w); if s is small, we give more weight to p(w) and less to
|
|
Packit Service |
8f0814 |
x.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
So how big should s be? Gary Robinson suggested, on a
|
|
Packit Service |
8f0814 |
theoretical basis, that we start with a value of 1; it might be
|
|
Packit Service |
8f0814 |
worth trying values in the range of 0.01 to 10, though I've never
|
|
Packit Service |
8f0814 |
had good results with values larger than 1. Making s smaller
|
|
Packit Service |
8f0814 |
than 0.01 or so is a bad idea, because of what happens when a token
|
|
Packit Service |
8f0814 |
is encountered that's been seen in spam before, but never in
|
|
Packit Service |
8f0814 |
nonspam, or vice versa. In that case, p(w) is exactly 1 or 0,
|
|
Packit Service |
8f0814 |
and f(w) will vary greatly as the value of s diminishes. As
|
|
Packit Service |
8f0814 |
an example, one spam that had about ten such tokens, out of 78 that
|
|
Packit Service |
8f0814 |
contributed to the spam score, scored 0.999 with s set to 0.001,
|
|
Packit Service |
8f0814 |
and was found to score 0.505 when s was 1.0e-8!
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Choosing a value for s is a matter of trial and error.
|
|
Packit Service |
8f0814 |
Experience suggests that 0.1 might be a better starting value than
|
|
Packit Service |
8f0814 |
1, at least if your training database is moderately large.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
THE MINIMUM DEVIATION
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
MIN_DEV is another thing you might need to vary. Paul
|
|
Packit Service |
8f0814 |
Graham's original method was based on looking only at the fifteen
|
|
Packit Service |
8f0814 |
tokens with p(w) values farthest from 0.5 (nearest to 0 or
|
|
Packit Service |
8f0814 |
1). We don't do it that way; instead, we set a minimum
|
|
Packit Service |
8f0814 |
deviation from 0.5 (Gary Robinson coined the term "exclusion
|
|
Packit Service |
8f0814 |
radius" for this parameter), and look at all the tokens with f(w)
|
|
Packit Service |
8f0814 |
values farther away than that. If the minimum is set to zero,
|
|
Packit Service |
8f0814 |
every token in the message contributes its spammishness value f(w)
|
|
Packit Service |
8f0814 |
to the final calculation. It might save time, and perhaps
|
|
Packit Service |
8f0814 |
improve discrimination, to ignore tokens with f(w) values near 0.5,
|
|
Packit Service |
8f0814 |
since those tokens obviously make less difference to the outcome of
|
|
Packit Service |
8f0814 |
the calculation. It seems helpful, at least once the training
|
|
Packit Service |
8f0814 |
database is a good size, to use a MIN_DEV value between 0.3 and
|
|
Packit Service |
8f0814 |
0.46. You might try 0.35 initially; one experiment suggests
|
|
Packit Service |
8f0814 |
that even 0.44 might be a good value, but that may not work for
|
|
Packit Service |
8f0814 |
everyone. In fact, some people are likely to find that quite
|
|
Packit Service |
8f0814 |
a small value of MIN_DEV (around 0.05) works best.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Note that higher values of MIN_DEV accentuate the distortion
|
|
Packit Service |
8f0814 |
caused by small s, because tokens appearing in only one of the spam
|
|
Packit Service |
8f0814 |
and nonspam counts will (if present) be a higher proportion of the
|
|
Packit Service |
8f0814 |
total number of tokens considered.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
EFFECTIVE SIZE FACTORS (ESF)
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Tokens tend to appear more than once in a given message.
|
|
Packit Service |
8f0814 |
If a spam contains the word "valium" it's likely to occur several
|
|
Packit Service |
8f0814 |
times. Bogofilter only uses any given token once in
|
|
Packit Service |
8f0814 |
calculating the message score, no matter how many times it appears
|
|
Packit Service |
8f0814 |
in the message; but since "spammy" and "non-spammy" tokens may
|
|
Packit Service |
8f0814 |
occur with very different frequency in the populations of spam and
|
|
Packit Service |
8f0814 |
nonspam messages respectively, it helps (as Gary Robinson pointed
|
|
Packit Service |
8f0814 |
out) to take this difference in redundancy into account. This
|
|
Packit Service |
8f0814 |
is done as follows: Without effective size factors, the score
|
|
Packit Service |
8f0814 |
is calculated with inverse chi-squared function
|
|
Packit Service |
8f0814 |
prbx thus:
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
P = prbx(-2 * sum(ln(1-f(w))), 2*N)
|
|
Packit Service |
8f0814 |
Q = prbx(-2 * sum(ln(f(w))), 2*N)
|
|
Packit Service |
8f0814 |
S = (1 + Q - P) / 2
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
and to apply ESF, we instead calculate:
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
P = prbx(-2 * ln(prod(1-f(w))^y), 2*N*y)
|
|
Packit Service |
8f0814 |
Q = prbx(-2 * ln(prod(f(w))^z), 2*N*z)
|
|
Packit Service |
8f0814 |
S = Q / (Q + P)
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
where y and z are the spam and nonspam ESF values, the
|
|
Packit Service |
8f0814 |
prod function returns the product of all its
|
|
Packit Service |
8f0814 |
arguments, and S is forced to 0.5 if Q and P are both very near
|
|
Packit Service |
8f0814 |
zero.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Determining the correct values to use for y and z is, like
|
|
Packit Service |
8f0814 |
choosing the value for s, a matter of trial and error. A
|
|
Packit Service |
8f0814 |
significant corpus of messages is required and users who don't have
|
|
Packit Service |
8f0814 |
access to large collections of their spam and nonspam messages
|
|
Packit Service |
8f0814 |
should probably opt to do without ESF. Useful values to try
|
|
Packit Service |
8f0814 |
seem to be in the range of 0.75 raised to powers between 1 and
|
|
Packit Service |
8f0814 |
20.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
THE SPAM AND NONSPAM CUTOFFS
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Most spam messages will have scores very close to 1 when the
|
|
Packit Service |
8f0814 |
Fisher method of combining the likelihood estimates is applied, and
|
|
Packit Service |
8f0814 |
most nonspam will have scores very close to 0. In between,
|
|
Packit Service |
8f0814 |
there is a grey area, where messages will fall that have both
|
|
Packit Service |
8f0814 |
spammish and nonspammish characteristics. The spam and
|
|
Packit Service |
8f0814 |
nonspam cutoffs are thresholds: bogofilter classes messages with
|
|
Packit Service |
8f0814 |
scores below the nonspam cutoff as nonspam, and those with scores
|
|
Packit Service |
8f0814 |
at or above the spam cutoff as spam. Messages with scores between
|
|
Packit Service |
8f0814 |
the two cutoff values are classed as uncertain. (Usually,
|
|
Packit Service |
8f0814 |
mail administrators will want to deliver mail classed as uncertain,
|
|
Packit Service |
8f0814 |
even though some of it may well be spam.)
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The best value for the spam cutoff depends strongly on the
|
|
Packit Service |
8f0814 |
values of the x, s, MIN_DEV and ESF parameters that are being used
|
|
Packit Service |
8f0814 |
in the calculation. For that reason, the way to test a given
|
|
Packit Service |
8f0814 |
parameter set is the following:
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Determine a suitable level of false positives (nonspams
|
|
Packit Service |
8f0814 |
classified as spam); this will probably need to fall somewhere in
|
|
Packit Service |
8f0814 |
the range of 0.05 to 0.3 percent (the lower the better, of course,
|
|
Packit Service |
8f0814 |
except that lowering the false-positive target too much gives too
|
|
Packit Service |
8f0814 |
many false negatives).
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Apply the parameters to classify a number of known nonspam
|
|
Packit Service |
8f0814 |
messages. From the distribution of scores, pick a spam cutoff value
|
|
Packit Service |
8f0814 |
that will give the selected proportion of false positives.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Apply the parameters and the chosen spam cutoff to classify a
|
|
Packit Service |
8f0814 |
number of spam messages; the parameter set should be judged on how
|
|
Packit Service |
8f0814 |
few false negatives (spam messages classed as uncertain or nonspam)
|
|
Packit Service |
8f0814 |
are obtained.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The value for the nonspam threshold should be such that no more
|
|
Packit Service |
8f0814 |
than about one in ten thousand spams is classified as
|
|
Packit Service |
8f0814 |
nonspam. A value of 0.2 to 0.25 might be a good starting
|
|
Packit Service |
8f0814 |
point for use with the recommended s and MIN_DEV (0.1 and 0.35
|
|
Packit Service |
8f0814 |
respectively).
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
AN OVERVIEW OF BOGOTUNE
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
As already mentioned, bogotune attempts to automate the above
|
|
Packit Service |
8f0814 |
process by doing a grid search over useful ranges of s, MIN_DEV, x,
|
|
Packit Service |
8f0814 |
y and z (an option exists to disable ESF, i.e. leave the ESF values
|
|
Packit Service |
8f0814 |
[y and z] at 1.0). Bogotune validates the test inputs,
|
|
Packit Service |
8f0814 |
calculates a suitable cache size for the training database,
|
|
Packit Service |
8f0814 |
calculates the starting x value as described above, and picks a
|
|
Packit Service |
8f0814 |
false-positive target with which to run the grid search.
|
|
Packit Service |
8f0814 |
There is an option to force a specific target (more about this
|
|
Packit Service |
8f0814 |
shortly). If not coerced, the target is calculated based on
|
|
Packit Service |
8f0814 |
the number of nonspam messages in the test set, and then adjusted
|
|
Packit Service |
8f0814 |
downward to give a cutoff between 0.5 and 0.975, above 0.55 if
|
|
Packit Service |
8f0814 |
possible. It's important to note that this false-positive
|
|
Packit Service |
8f0814 |
target is chosen to facilitate the grid search, and is usually very
|
|
Packit Service |
8f0814 |
much larger than one would want to see in production; there's no
|
|
Packit Service |
8f0814 |
relation between the two. At the end of its run, bogotune
|
|
Packit Service |
8f0814 |
attempts to suggest a reasonable production target.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
With these preliminaries completed, bogotune performs two
|
|
Packit Service |
8f0814 |
scans: The first is a coarse grid search over the entire
|
|
Packit Service |
8f0814 |
useful range of each parameter, which should locate an approximate
|
|
Packit Service |
8f0814 |
optimum. A finer grid, centered on the optimum derived from
|
|
Packit Service |
8f0814 |
the coarse search, is then scanned to produce the final
|
|
Packit Service |
8f0814 |
recommendations. At the end of each scan, outliers --
|
|
Packit Service |
8f0814 |
apparently good parameter sets from which even slight deviation
|
|
Packit Service |
8f0814 |
degrades performance significantly -- are rejected. Bogotune
|
|
Packit Service |
8f0814 |
usually manages to find a robust parameter set that gives good
|
|
Packit Service |
8f0814 |
discrimination between spam and nonspam messages, provided that
|
|
Packit Service |
8f0814 |
sufficient training and test messages are supplied.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
There are two reasons why one might want to force a bogotune run
|
|
Packit Service |
8f0814 |
to use a specific false-positive target. One is that
|
|
Packit Service |
8f0814 |
sometimes bogotune's target calculation algorithm is overly
|
|
Packit Service |
8f0814 |
optimistic, i.e. it occasionally sets the target too low. The
|
|
Packit Service |
8f0814 |
symptom of this problem is that many parameter combinations in the
|
|
Packit Service |
8f0814 |
coarse grid scan simply can't deliver that few false positives from
|
|
Packit Service |
8f0814 |
the test message corpus. Manually increasing the target by 20
|
|
Packit Service |
8f0814 |
to 30 percent usually fixes this. The other reason to coerce
|
|
Packit Service |
8f0814 |
the target to a specific value is that one might want to compare
|
|
Packit Service |
8f0814 |
two bogotune runs -- with and without ESF, for example -- and
|
|
Packit Service |
8f0814 |
comparisons aren't valid unless both runs use the same target.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Bogotune may produce very different recommendations for very
|
|
Packit Service |
8f0814 |
similar sets of spam and nonspam messages. That's not
|
|
Packit Service |
8f0814 |
necessarily a defect. For many message populations, the
|
|
Packit Service |
8f0814 |
scoring algorithm doesn't depend heavily on precise values of any
|
|
Packit Service |
8f0814 |
of the parameters but the spam cutoff. In such cases it's
|
|
Packit Service |
8f0814 |
common to see bogotune's coarse scan settle on one of several local
|
|
Packit Service |
8f0814 |
optima that may have quite different parameter values. In
|
|
Packit Service |
8f0814 |
general, the parameter values have more influence when a single
|
|
Packit Service |
8f0814 |
training database is used for a large number of users, and are less
|
|
Packit Service |
8f0814 |
crucial when the wordlist is for just one or a few users.
|
|
Packit Service |
8f0814 |
"freq">
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
HOW OFTEN TO TUNE
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
It's probably wise to review the spam cutoff frequently.
|
|
Packit Service |
8f0814 |
If the false-negative count is gratifyingly low, or if false
|
|
Packit Service |
8f0814 |
positives are occurring, increasing the cutoff will reduce the
|
|
Packit Service |
8f0814 |
false-positive rate. If, on the other hand, there are absolutely no
|
|
Packit Service |
8f0814 |
false positives but the false-negative rate is high, lowering the
|
|
Packit Service |
8f0814 |
cutoff a bit may improve discrimination.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
How often to tune the other parameters depends on how fussy you
|
|
Packit Service |
8f0814 |
are about optimizing performance. If you're eager to get the
|
|
Packit Service |
8f0814 |
best discrimination you can, here are my recommendations: In
|
|
Packit Service |
8f0814 |
the early stage, while the training database is still small and
|
|
Packit Service |
8f0814 |
growing rapidly, it's probably wise to experiment with tuning x, s
|
|
Packit Service |
8f0814 |
and MIN_DEV once a month or so. Once the training database is
|
|
Packit Service |
8f0814 |
a good size (over 5000 spams and 5000 nonspams), this can be
|
|
Packit Service |
8f0814 |
reduced to quarterly or half-yearly. If you use ESF (which I
|
|
Packit Service |
8f0814 |
recommend you do), retuning should probably happen a bit more
|
|
Packit Service |
8f0814 |
often, especially if you see that the characteristics of spam
|
|
Packit Service |
8f0814 |
you're receiving seem to be changing.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
FWIW my own practice, after two years' experience, is to review
|
|
Packit Service |
8f0814 |
the spam cutoff monthly and do a bogotune run about quarterly.
|
|
Packit Service |
8f0814 |
name="train">
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
A NOTE ON TRAINING
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Bogofilter's ability to distinguish accurately between spam and
|
|
Packit Service |
8f0814 |
nonspam messages depends on the quality of its training
|
|
Packit Service |
8f0814 |
database. Here is a way of maximizing that quality with
|
|
Packit Service |
8f0814 |
relatively little effort.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
When starting afresh, feed every spam and every nonspam you get
|
|
Packit Service |
8f0814 |
into bogofilter. Do not use bogofilter's -u option to do
|
|
Packit Service |
8f0814 |
this: there will be far too many errors when your training database
|
|
Packit Service |
8f0814 |
is small. Instead, classify messages manually and train
|
|
Packit Service |
8f0814 |
bogofilter with the -n and -s options appropriately. You can
|
|
Packit Service |
8f0814 |
do it in batches: if you work with standard mbox files, use a mail
|
|
Packit Service |
8f0814 |
reader to move spam and nonspam into separate files, then do
|
|
Packit Service |
8f0814 |
bogofilter -s < spambox
|
|
Packit Service |
8f0814 |
and bogofilter -n < nonspambox
|
|
Packit Service |
8f0814 |
to register the messages.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
To find out how many spam and nonspam messages have gone into
|
|
Packit Service |
8f0814 |
your wordlist, assuming it's kept in ~/.bogofilter, do
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
bogoutil -w ~/.bogofilter .MSG_COUNT
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Once you've accumulated about 5,000 spam or nonspam messages in
|
|
Packit Service |
8f0814 |
the list, you need to let the other count catch up unless they're
|
|
Packit Service |
8f0814 |
growing at about the same rate. Stop adding every message to
|
|
Packit Service |
8f0814 |
the larger of the two counts, and instead, add only messages that
|
|
Packit Service |
8f0814 |
bogofilter got wrong or was unsure about. To do this, you
|
|
Packit Service |
8f0814 |
need to start classifying messages into 4 sets instead of 2: spam,
|
|
Packit Service |
8f0814 |
nonspam, unsures that were actually spam, and unsures that were
|
|
Packit Service |
8f0814 |
nonspam.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Once the counts are both over 5,000 and fairly similar, you can
|
|
Packit Service |
8f0814 |
train only on errors and unsures. By this time there should
|
|
Packit Service |
8f0814 |
be very few errors (spams classed as nonspam or vice versa), but
|
|
Packit Service |
8f0814 |
there will still be a proportion of unsure spam and unsure nonspam
|
|
Packit Service |
8f0814 |
messages. Training on these will keep bogofilter working well, as
|
|
Packit Service |
8f0814 |
you're telling it what it needs to learn, and not so much of what
|
|
Packit Service |
8f0814 |
it already knows. If one list grows faster than the other, extra
|
|
Packit Service |
8f0814 |
(correctly classified) messages may be added from time to time to
|
|
Packit Service |
8f0814 |
equalize them again; try to keep the smaller list's message count
|
|
Packit Service |
8f0814 |
at least two thirds of the larger's.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The use of bogofilter's -u option is convenient but
|
|
Packit Service |
8f0814 |
dangerous. Even when well trained, bogofilter will
|
|
Packit Service |
8f0814 |
misclassify a small number of messages. With the -u option,
|
|
Packit Service |
8f0814 |
these mistakes are entered into the database and decrease its
|
|
Packit Service |
8f0814 |
accuracy. More mistakes result and the process
|
|
Packit Service |
8f0814 |
snowballs. You therefore need to review and correct the
|
|
Packit Service |
8f0814 |
databases frequently, and in the interval between such reviews,
|
|
Packit Service |
8f0814 |
bogofilter's effectiveness will keep falling off. The method
|
|
Packit Service |
8f0814 |
described above has the advantage that (human error excepted) no
|
|
Packit Service |
8f0814 |
wrongly classified messages are ever entered into the database, and
|
|
Packit Service |
8f0814 |
if you have to leave it for a time without updating it, its
|
|
Packit Service |
8f0814 |
effectiveness doesn't diminish.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Much of the advice given here arises out of experiments reported
|
|
Packit Service |
8f0814 |
on the author's bogofilter web pages; in particular, the report at
|
|
Packit Service |
8f0814 |
http://www.bgl.nu/bogofilter/smindev3.html
|
|
Packit Service |
8f0814 |
might interest those who'd like more information about the basis of
|
|
Packit Service |
8f0814 |
bogofilter tuning.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Thanks go to David Relson for reviewing this document in draft
|
|
Packit Service |
8f0814 |
and suggesting several improvements.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
<address>[© Greg
|
|
Packit Service |
8f0814 |
Louis, 2004; last modified 2004-09-09]</address>
|
|
Packit Service |
8f0814 |
</body>
|
|
Packit Service |
8f0814 |
</html>
|
|
Packit Service |
8f0814 |
|