Blame doc/bogofilter-tuning.HOWTO.html

Packit Service 8f0814
Packit Service 8f0814
"http://www.w3.org/TR/html4/strict.dtd">
Packit Service 8f0814
<html>
Packit Service 8f0814
<head>
Packit Service 8f0814
<title>Tuning bogofilter</title>
Packit Service 8f0814
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
Packit Service 8f0814
</head>
Packit Service 8f0814
<body>
Packit Service 8f0814

TUNING BOGOFILTER'S ROBINSON-FISHER METHOD -- an updated

Packit Service 8f0814
HOWTO
Packit Service 8f0814
Packit Service 8f0814
<address>(Greg Louis,
Packit Service 8f0814
September 2004)</address>
Packit Service 8f0814
Packit Service 8f0814

NB: Bogotune is a tool (shipped with bogofilter) that automates

Packit Service 8f0814
the tuning process. Its "full search" mode performs a
Packit Service 8f0814
five-dimensional grid search over possible values of the parameters
Packit Service 8f0814
to be described below, and comes up with recommendations for
Packit Service 8f0814
optimal settings. There's also a "partial search" mode that is only
Packit Service 8f0814
three-dimensional. If you have enough spam and nonspam messages (at
Packit Service 8f0814
least 2,500 of each), using bogotune is highly recommended for
Packit Service 8f0814
optimizing bogofilter's accuracy.

Packit Service 8f0814
Packit Service 8f0814

CONTENTS

Packit Service 8f0814
Packit Service 8f0814
    Packit Service 8f0814
  • Introduction
  • Packit Service 8f0814
    Packit Service 8f0814
  • Robinson's x
  • Packit Service 8f0814
    Packit Service 8f0814
  • Robinson's s
  • Packit Service 8f0814
    Packit Service 8f0814
  • The minimum deviation
  • Packit Service 8f0814
    Packit Service 8f0814
  • Effective size factors
  • Packit Service 8f0814
    Packit Service 8f0814
  • The spam and nonspam cutoffs
  • Packit Service 8f0814
    Packit Service 8f0814
  • Overview of bogotune
  • Packit Service 8f0814
    Packit Service 8f0814
  • How often to tune
  • Packit Service 8f0814
    Packit Service 8f0814
  • A note on training
  • Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814

    Packit Service 8f0814
    Packit Service 8f0814

    INTRODUCTION

    Packit Service 8f0814
    Packit Service 8f0814

    The bogofilter program has evolved through three classification

    Packit Service 8f0814
    methods: the original as proposed by Paul Graham and implemented in
    Packit Service 8f0814
    bogofilter by Eric S. Raymond; a variation proposed by Gary
    Packit Service 8f0814
    Robinson and implemented in bogofilter by Greg Louis; and a further
    Packit Service 8f0814
    variation, also proposed by Gary Robinson, which uses Fisher's
    Packit Service 8f0814
    method (Fisher, R. A., 1950: Statistical Methods for Research
    Packit Service 8f0814
    Workers, pp. 99ff.  London: Oliver and Boyd) of
    Packit Service 8f0814
    combining probabilities; for bogofilter, your author has
    Packit Service 8f0814
    implemented this one too.  Recently, Gary Robinson described a
    Packit Service 8f0814
    further improvement, the application of effective size factors in
    Packit Service 8f0814
    the scoring process; this is available by default in bogofilter,
    Packit Service 8f0814
    but since it's relatively new, an option exists in both bogofilter
    Packit Service 8f0814
    and bogotune to switch it off.

    Packit Service 8f0814
    Packit Service 8f0814

    Each of Gary Robinson's proposed classification methods works

    Packit Service 8f0814
    better than the earlier versions.  For optimal results, they
    Packit Service 8f0814
    (and the original) require some tuning.  As distributed,
    Packit Service 8f0814
    bogofilter attempts to supply good starting values for the tunable
    Packit Service 8f0814
    parameters.  Since the optimum values depend on the size and
    Packit Service 8f0814
    content of the wordlists at your installation, the best
    Packit Service 8f0814
    results can only be determined by some experimentation using
    Packit Service 8f0814
    your wordlists.  After several thousand each of spam
    Packit Service 8f0814
    and nonspam messages have been fed to bogofilter for training, this
    Packit Service 8f0814
    experimentation can be well worthwhile: you may be able to cut the
    Packit Service 8f0814
    number of spams that are still getting through by more than
    Packit Service 8f0814
    half.

    Packit Service 8f0814
    Packit Service 8f0814

    The purpose of this document is to explain how to adjust

    Packit Service 8f0814
    bogofilter's parameters for best spam filtering.  A manual
    Packit Service 8f0814
    tuning process is described; though you'll be wise to use bogotune
    Packit Service 8f0814
    to help find your parameter values, understanding the manual
    Packit Service 8f0814
    process will help you make sense of what bogotune does.

    Packit Service 8f0814
    Packit Service 8f0814

    With Robinson's changes as implemented in bogofilter, there are

    Packit Service 8f0814
    seven (five without effective size factors) things to tune, six (or
    Packit Service 8f0814
    four) of which are highly interdependent, as explained below:

    Packit Service 8f0814
    Packit Service 8f0814

    Packit Service 8f0814
    Packit Service 8f0814

    ROBINSON'S x

    Packit Service 8f0814
    Packit Service 8f0814

    First off, you should determine the value of x appropriate to

    Packit Service 8f0814
    your training database.

    Packit Service 8f0814
    Packit Service 8f0814

    The way bogofilter works, summarizing briefly, is that the

    Packit Service 8f0814
    message being classified is separated into "tokens" -- words, IP
    Packit Service 8f0814
    addresses and other logical units of information.  Each token
    Packit Service 8f0814
    is looked up in the wordlist that makes up the training
    Packit Service 8f0814
    database.  The number of times it's been seen in a spam
    Packit Service 8f0814
    message is divided by the total number of times it's been seen, and
    Packit Service 8f0814
    that gives an indication of the likelihood that the token is in a
    Packit Service 8f0814
    spam message.  The likelihood estimates for all the tokens are
    Packit Service 8f0814
    combined to give a score between 0 and 1 -- 0 means the message is
    Packit Service 8f0814
    not likely to be a spam, 1 means it is.  That's fine, but what
    Packit Service 8f0814
    happens if a token's never been seen before? That's where x comes
    Packit Service 8f0814
    in: it's a "first guess" at what the presence of an unknown token
    Packit Service 8f0814
    means, in terms of its contribution to the score.  It's the
    Packit Service 8f0814
    value used as the likelihood estimate when a new token is
    Packit Service 8f0814
    found.

    Packit Service 8f0814
    Packit Service 8f0814

    Obtaining a value for x is easy: bogoutil will calculate it for

    Packit Service 8f0814
    you.

    Packit Service 8f0814
    Packit Service 8f0814

    Assuming your bogofilter wordlist is in ~/.bogofilter, run

    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
      bogoutil -r
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814

    This will print out an x value. If it's in the range of 0.4 to

    Packit Service 8f0814
    0.6, you can run 
    Packit Service 8f0814
    bogoutil -R ~/.bogofilter  to install
    Packit Service 8f0814
    the calculated value so bogofilter will use it from then on.

    Packit Service 8f0814
    Packit Service 8f0814

    The value of x that bogoutil calculates is just the average

    Packit Service 8f0814
    of

    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
      p(w) = badcount / (goodcount + badcount)
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814

    for every word in your training set that appears at least 10

    Packit Service 8f0814
    times in spam and/or ham counts in your wordlist (i.e.
    Packit Service 8f0814
    badcount + goodcount >= 10).

    Packit Service 8f0814
    Packit Service 8f0814

    To be honest, that's an oversimplification, for the sake of

    Packit Service 8f0814
    explaining the basic concept clearly.  In real life, you have
    Packit Service 8f0814
    to scale the counts somehow.  If you had exactly the same
    Packit Service 8f0814
    number of spam and nonspam messages contributing to your wordlist,
    Packit Service 8f0814
    the formula for p(w) given above would be ok; but we actually have
    Packit Service 8f0814
    to use

    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
      p(w) = (badcount / badlist_msgcount) /
    Packit Service 8f0814
             (badcount / badlist_msgcount + goodcount / goodlist_msgcount)
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814

    where badlist_msgcount is the number of spam messages that were

    Packit Service 8f0814
    fed into the training set, and goodlist_msgcount is the number of
    Packit Service 8f0814
    nonspams used in training.

    Packit Service 8f0814
    Packit Service 8f0814

    An equivalent way of calculating x, that's a little easier to

    Packit Service 8f0814
    read, is:

    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
      scalefactor = badlist_msgcount / goodlist_msgcount
    Packit Service 8f0814
      p(w) = badcount / (badcount + goodcount * scalefactor)
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814

    In either case, x is the average of those p(w) values.

    Packit Service 8f0814
    Packit Service 8f0814

    The calculated x is just a first guess, and it may be worth

    Packit Service 8f0814
    while experimenting (after tuning s and the minimum deviation as
    Packit Service 8f0814
    described below) to see what happens if you adjust it up or
    Packit Service 8f0814
    downward within a range of about 0.1 either way. 
    Packit Service 8f0814
    "robs">

    Packit Service 8f0814
    Packit Service 8f0814

    ROBINSON'S s

    Packit Service 8f0814
    Packit Service 8f0814

    With a count of zero for a given token, we have only x to go on,

    Packit Service 8f0814
    so that's what we use as the likelihood estimate for that
    Packit Service 8f0814
    token.  The question then arises, what if we have seen the
    Packit Service 8f0814
    token before, but only a few times? Statistical variation will
    Packit Service 8f0814
    result in the ratio of two small numbers being rather unreliable,
    Packit Service 8f0814
    so perhaps we ought to compromise between that ratio and our "first
    Packit Service 8f0814
    guess" value.  This is what the Robinson method does.

    Packit Service 8f0814
    Packit Service 8f0814

    The compromise works like this: a parameter s is defined, that

    Packit Service 8f0814
    serves as a weighting factor; the larger the value of s, the more
    Packit Service 8f0814
    importance is given to x in the presence of low token counts. 
    Packit Service 8f0814
    The token count is 
    Packit Service 8f0814
    n = badcount + goodcount  and
    Packit Service 8f0814
    the p(w) value for the token is modified as follows:

    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
      f(w) = (s * x + n * p(w))/(s + n)
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814

    As you see, if n is zero, f(w) is just x, as we have been saying

    Packit Service 8f0814
    all along; but if n is nonzero and small, the s * x term
    Packit Service 8f0814
    will influence the f(w) value.

    Packit Service 8f0814
    Packit Service 8f0814

    Parameters s and x are only important when the counts are small,

    Packit Service 8f0814
    and the value of s reflects what we think of as "small." If s is
    Packit Service 8f0814
    large, then when counts are small we trust our x value more than we
    Packit Service 8f0814
    do the p(w); if s is small, we give more weight to p(w) and less to
    Packit Service 8f0814
    x.

    Packit Service 8f0814
    Packit Service 8f0814

    So how big should s be? Gary Robinson suggested, on a

    Packit Service 8f0814
    theoretical basis, that we start with a value of 1; it might be
    Packit Service 8f0814
    worth trying values in the range of 0.01 to 10, though I've never
    Packit Service 8f0814
    had good results with values larger than 1.  Making s smaller
    Packit Service 8f0814
    than 0.01 or so is a bad idea, because of what happens when a token
    Packit Service 8f0814
    is encountered that's been seen in spam before, but never in
    Packit Service 8f0814
    nonspam, or vice versa.  In that case, p(w) is exactly 1 or 0,
    Packit Service 8f0814
    and f(w) will vary greatly as the value of s diminishes.  As
    Packit Service 8f0814
    an example, one spam that had about ten such tokens, out of 78 that
    Packit Service 8f0814
    contributed to the spam score, scored 0.999 with s set to 0.001,
    Packit Service 8f0814
    and was found to score 0.505 when s was 1.0e-8!

    Packit Service 8f0814
    Packit Service 8f0814

    Choosing a value for s is a matter of trial and error. 

    Packit Service 8f0814
    Experience suggests that 0.1 might be a better starting value than
    Packit Service 8f0814
    1, at least if your training database is moderately large.

    Packit Service 8f0814
    Packit Service 8f0814

    Packit Service 8f0814
    Packit Service 8f0814

    THE MINIMUM DEVIATION

    Packit Service 8f0814
    Packit Service 8f0814

    MIN_DEV is another thing you might need to vary.  Paul

    Packit Service 8f0814
    Graham's original method was based on looking only at the fifteen
    Packit Service 8f0814
    tokens with p(w) values farthest from 0.5 (nearest to 0 or
    Packit Service 8f0814
    1).  We don't do it that way; instead, we set a minimum
    Packit Service 8f0814
    deviation from 0.5 (Gary Robinson coined the term "exclusion
    Packit Service 8f0814
    radius" for this parameter), and look at all the tokens with f(w)
    Packit Service 8f0814
    values farther away than that.  If the minimum is set to zero,
    Packit Service 8f0814
    every token in the message contributes its spammishness value f(w)
    Packit Service 8f0814
    to the final calculation.  It might save time, and perhaps
    Packit Service 8f0814
    improve discrimination, to ignore tokens with f(w) values near 0.5,
    Packit Service 8f0814
    since those tokens obviously make less difference to the outcome of
    Packit Service 8f0814
    the calculation.  It seems helpful, at least once the training
    Packit Service 8f0814
    database is a good size, to use a MIN_DEV value between 0.3 and
    Packit Service 8f0814
    0.46.  You might try 0.35 initially; one experiment suggests
    Packit Service 8f0814
    that even 0.44 might be a good value, but that may not work for
    Packit Service 8f0814
    everyone.  In fact, some people are likely to find that quite
    Packit Service 8f0814
    a small value of MIN_DEV (around 0.05) works best.

    Packit Service 8f0814
    Packit Service 8f0814

    Note that higher values of MIN_DEV accentuate the distortion

    Packit Service 8f0814
    caused by small s, because tokens appearing in only one of the spam
    Packit Service 8f0814
    and nonspam counts will (if present) be a higher proportion of the
    Packit Service 8f0814
    total number of tokens considered.

    Packit Service 8f0814
    Packit Service 8f0814

    Packit Service 8f0814
    Packit Service 8f0814

    EFFECTIVE SIZE FACTORS (ESF)

    Packit Service 8f0814
    Packit Service 8f0814

    Tokens tend to appear more than once in a given message. 

    Packit Service 8f0814
    If a spam contains the word "valium" it's likely to occur several
    Packit Service 8f0814
    times.  Bogofilter only uses any given token once in
    Packit Service 8f0814
    calculating the message score, no matter how many times it appears
    Packit Service 8f0814
    in the message; but since "spammy" and "non-spammy" tokens may
    Packit Service 8f0814
    occur with very different frequency in the populations of spam and
    Packit Service 8f0814
    nonspam messages respectively, it helps (as Gary Robinson pointed
    Packit Service 8f0814
    out) to take this difference in redundancy into account.  This
    Packit Service 8f0814
    is done as follows:  Without effective size factors, the score
    Packit Service 8f0814
    is calculated with inverse chi-squared function
    Packit Service 8f0814
    prbx thus:

    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    P = prbx(-2 * sum(ln(1-f(w))), 2*N)
    Packit Service 8f0814
    Q = prbx(-2 * sum(ln(f(w))), 2*N)
    Packit Service 8f0814
    S = (1 + Q - P) / 2
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814

    and to apply ESF, we instead calculate:

    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    P = prbx(-2 * ln(prod(1-f(w))^y), 2*N*y)
    Packit Service 8f0814
    Q = prbx(-2 * ln(prod(f(w))^z), 2*N*z)
    Packit Service 8f0814
    S = Q / (Q + P)
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814
    Packit Service 8f0814

    where y and z are the spam and nonspam ESF values, the

    Packit Service 8f0814
    prod function returns the product of all its
    Packit Service 8f0814
    arguments, and S is forced to 0.5 if Q and P are both very near
    Packit Service 8f0814
    zero.

    Packit Service 8f0814
    Packit Service 8f0814

    Determining the correct values to use for y and z is, like

    Packit Service 8f0814
    choosing the value for s, a matter of trial and error.  A
    Packit Service 8f0814
    significant corpus of messages is required and users who don't have
    Packit Service 8f0814
    access to large collections of their spam and nonspam messages
    Packit Service 8f0814
    should probably opt to do without ESF.  Useful values to try
    Packit Service 8f0814
    seem to be in the range of 0.75 raised to powers between 1 and
    Packit Service 8f0814
    20.

    Packit Service 8f0814
    Packit Service 8f0814

    Packit Service 8f0814
    Packit Service 8f0814

    THE SPAM AND NONSPAM CUTOFFS

    Packit Service 8f0814
    Packit Service 8f0814

    Most spam messages will have scores very close to 1 when the

    Packit Service 8f0814
    Fisher method of combining the likelihood estimates is applied, and
    Packit Service 8f0814
    most nonspam will have scores very close to 0.  In between,
    Packit Service 8f0814
    there is a grey area, where messages will fall that have both
    Packit Service 8f0814
    spammish and nonspammish characteristics.  The spam and
    Packit Service 8f0814
    nonspam cutoffs are thresholds: bogofilter classes messages with
    Packit Service 8f0814
    scores below the nonspam cutoff as nonspam, and those with scores
    Packit Service 8f0814
    at or above the spam cutoff as spam. Messages with scores between
    Packit Service 8f0814
    the two cutoff values are classed as uncertain.  (Usually,
    Packit Service 8f0814
    mail administrators will want to deliver mail classed as uncertain,
    Packit Service 8f0814
    even though some of it may well be spam.)

    Packit Service 8f0814
    Packit Service 8f0814

    The best value for the spam cutoff depends strongly on the

    Packit Service 8f0814
    values of the x, s, MIN_DEV and ESF parameters that are being used
    Packit Service 8f0814
    in the calculation.  For that reason, the way to test a given
    Packit Service 8f0814
    parameter set is the following:

    Packit Service 8f0814
    Packit Service 8f0814
      Packit Service 8f0814
    1. Determine a suitable level of false positives (nonspams
    2. Packit Service 8f0814
      classified as spam); this will probably need to fall somewhere in
      Packit Service 8f0814
      the range of 0.05 to 0.3 percent (the lower the better, of course,
      Packit Service 8f0814
      except that lowering the false-positive target too much gives too
      Packit Service 8f0814
      many false negatives).
      Packit Service 8f0814
      Packit Service 8f0814
    3. Apply the parameters to classify a number of known nonspam
    4. Packit Service 8f0814
      messages. From the distribution of scores, pick a spam cutoff value
      Packit Service 8f0814
      that will give the selected proportion of false positives.
      Packit Service 8f0814
      Packit Service 8f0814
    5. Apply the parameters and the chosen spam cutoff to classify a
    6. Packit Service 8f0814
      number of spam messages; the parameter set should be judged on how
      Packit Service 8f0814
      few false negatives (spam messages classed as uncertain or nonspam)
      Packit Service 8f0814
      are obtained.
      Packit Service 8f0814
      Packit Service 8f0814
      Packit Service 8f0814

      The value for the nonspam threshold should be such that no more

      Packit Service 8f0814
      than about one in ten thousand spams is classified as
      Packit Service 8f0814
      nonspam.  A value of 0.2 to 0.25 might be a good starting
      Packit Service 8f0814
      point for use with the recommended s and MIN_DEV (0.1 and 0.35
      Packit Service 8f0814
      respectively).

      Packit Service 8f0814
      Packit Service 8f0814

      Packit Service 8f0814
      Packit Service 8f0814

      AN OVERVIEW OF BOGOTUNE

      Packit Service 8f0814
      Packit Service 8f0814

      As already mentioned, bogotune attempts to automate the above

      Packit Service 8f0814
      process by doing a grid search over useful ranges of s, MIN_DEV, x,
      Packit Service 8f0814
      y and z (an option exists to disable ESF, i.e. leave the ESF values
      Packit Service 8f0814
      [y and z] at 1.0).  Bogotune validates the test inputs,
      Packit Service 8f0814
      calculates a suitable cache size for the training database,
      Packit Service 8f0814
      calculates the starting x value as described above, and picks a
      Packit Service 8f0814
      false-positive target with which to run the grid search. 
      Packit Service 8f0814
      There is an option to force a specific target (more about this
      Packit Service 8f0814
      shortly).  If not coerced, the target is calculated based on
      Packit Service 8f0814
      the number of nonspam messages in the test set, and then adjusted
      Packit Service 8f0814
      downward to give a cutoff between 0.5 and 0.975, above 0.55 if
      Packit Service 8f0814
      possible.  It's important to note that this false-positive
      Packit Service 8f0814
      target is chosen to facilitate the grid search, and is usually very
      Packit Service 8f0814
      much larger than one would want to see in production; there's no
      Packit Service 8f0814
      relation between the two.  At the end of its run, bogotune
      Packit Service 8f0814
      attempts to suggest a reasonable production target.

      Packit Service 8f0814
      Packit Service 8f0814

      With these preliminaries completed, bogotune performs two

      Packit Service 8f0814
      scans:  The first is a coarse grid search over the entire
      Packit Service 8f0814
      useful range of each parameter, which should locate an approximate
      Packit Service 8f0814
      optimum.  A finer grid, centered on the optimum derived from
      Packit Service 8f0814
      the coarse search, is then scanned to produce the final
      Packit Service 8f0814
      recommendations.  At the end of each scan, outliers --
      Packit Service 8f0814
      apparently good parameter sets from which even slight deviation
      Packit Service 8f0814
      degrades performance significantly -- are rejected.  Bogotune
      Packit Service 8f0814
      usually manages to find a robust parameter set that gives good
      Packit Service 8f0814
      discrimination between spam and nonspam messages, provided that
      Packit Service 8f0814
      sufficient training and test messages are supplied.

      Packit Service 8f0814
      Packit Service 8f0814

      There are two reasons why one might want to force a bogotune run

      Packit Service 8f0814
      to use a specific false-positive target.  One is that
      Packit Service 8f0814
      sometimes bogotune's target calculation algorithm is overly
      Packit Service 8f0814
      optimistic, i.e. it occasionally sets the target too low.  The
      Packit Service 8f0814
      symptom of this problem is that many parameter combinations in the
      Packit Service 8f0814
      coarse grid scan simply can't deliver that few false positives from
      Packit Service 8f0814
      the test message corpus.  Manually increasing the target by 20
      Packit Service 8f0814
      to 30 percent usually fixes this.  The other reason to coerce
      Packit Service 8f0814
      the target to a specific value is that one might want to compare
      Packit Service 8f0814
      two bogotune runs -- with and without ESF, for example -- and
      Packit Service 8f0814
      comparisons aren't valid unless both runs use the same target.

      Packit Service 8f0814
      Packit Service 8f0814

      Bogotune may produce very different recommendations for very

      Packit Service 8f0814
      similar sets of spam and nonspam messages.  That's not
      Packit Service 8f0814
      necessarily a defect.  For many message populations, the
      Packit Service 8f0814
      scoring algorithm doesn't depend heavily on precise values of any
      Packit Service 8f0814
      of the parameters but the spam cutoff.  In such cases it's
      Packit Service 8f0814
      common to see bogotune's coarse scan settle on one of several local
      Packit Service 8f0814
      optima that may have quite different parameter values.  In
      Packit Service 8f0814
      general, the parameter values have more influence when a single
      Packit Service 8f0814
      training database is used for a large number of users, and are less
      Packit Service 8f0814
      crucial when the wordlist is for just one or a few users. 
      Packit Service 8f0814
      "freq">

      Packit Service 8f0814
      Packit Service 8f0814

      HOW OFTEN TO TUNE

      Packit Service 8f0814
      Packit Service 8f0814

      It's probably wise to review the spam cutoff frequently. 

      Packit Service 8f0814
      If the false-negative count is gratifyingly low, or if false
      Packit Service 8f0814
      positives are occurring, increasing the cutoff will reduce the
      Packit Service 8f0814
      false-positive rate. If, on the other hand, there are absolutely no
      Packit Service 8f0814
      false positives but the false-negative rate is high, lowering the
      Packit Service 8f0814
      cutoff a bit may improve discrimination.

      Packit Service 8f0814
      Packit Service 8f0814

      How often to tune the other parameters depends on how fussy you

      Packit Service 8f0814
      are about optimizing performance.  If you're eager to get the
      Packit Service 8f0814
      best discrimination you can, here are my recommendations:  In
      Packit Service 8f0814
      the early stage, while the training database is still small and
      Packit Service 8f0814
      growing rapidly, it's probably wise to experiment with tuning x, s
      Packit Service 8f0814
      and MIN_DEV once a month or so.  Once the training database is
      Packit Service 8f0814
      a good size (over 5000 spams and 5000 nonspams), this can be
      Packit Service 8f0814
      reduced to quarterly or half-yearly.  If you use ESF (which I
      Packit Service 8f0814
      recommend you do), retuning should probably happen a bit more
      Packit Service 8f0814
      often, especially if you see that the characteristics of spam
      Packit Service 8f0814
      you're receiving seem to be changing.

      Packit Service 8f0814
      Packit Service 8f0814

      FWIW my own practice, after two years' experience, is to review

      Packit Service 8f0814
      the spam cutoff monthly and do a bogotune run about quarterly. 
      Packit Service 8f0814
      name="train">

      Packit Service 8f0814
      Packit Service 8f0814

      A NOTE ON TRAINING

      Packit Service 8f0814
      Packit Service 8f0814

      Bogofilter's ability to distinguish accurately between spam and

      Packit Service 8f0814
      nonspam messages depends on the quality of its training
      Packit Service 8f0814
      database.  Here is a way of maximizing that quality with
      Packit Service 8f0814
      relatively little effort.

      Packit Service 8f0814
      Packit Service 8f0814

      When starting afresh, feed every spam and every nonspam you get

      Packit Service 8f0814
      into bogofilter.  Do not use bogofilter's -u option to do
      Packit Service 8f0814
      this: there will be far too many errors when your training database
      Packit Service 8f0814
      is small.  Instead, classify messages manually and train
      Packit Service 8f0814
      bogofilter with the -n and -s options appropriately.  You can
      Packit Service 8f0814
      do it in batches: if you work with standard mbox files, use a mail
      Packit Service 8f0814
      reader to move spam and nonspam into separate files, then do 
      Packit Service 8f0814
      bogofilter -s < spambox 
      Packit Service 8f0814
      and  bogofilter -n < nonspambox
      Packit Service 8f0814
      to register the messages.

      Packit Service 8f0814
      Packit Service 8f0814

      To find out how many spam and nonspam messages have gone into

      Packit Service 8f0814
      your wordlist, assuming it's kept in ~/.bogofilter, do

      Packit Service 8f0814
      Packit Service 8f0814
      Packit Service 8f0814
      Packit Service 8f0814
        bogoutil -w ~/.bogofilter .MSG_COUNT
      Packit Service 8f0814
      Packit Service 8f0814
      Packit Service 8f0814
      Packit Service 8f0814

      Once you've accumulated about 5,000 spam or nonspam messages in

      Packit Service 8f0814
      the list, you need to let the other count catch up unless they're
      Packit Service 8f0814
      growing at about the same rate.  Stop adding every message to
      Packit Service 8f0814
      the larger of the two counts, and instead, add only messages that
      Packit Service 8f0814
      bogofilter got wrong or was unsure about.  To do this, you
      Packit Service 8f0814
      need to start classifying messages into 4 sets instead of 2: spam,
      Packit Service 8f0814
      nonspam, unsures that were actually spam, and unsures that were
      Packit Service 8f0814
      nonspam.

      Packit Service 8f0814
      Packit Service 8f0814

      Once the counts are both over 5,000 and fairly similar, you can

      Packit Service 8f0814
      train only on errors and unsures.  By this time there should
      Packit Service 8f0814
      be very few errors (spams classed as nonspam or vice versa), but
      Packit Service 8f0814
      there will still be a proportion of unsure spam and unsure nonspam
      Packit Service 8f0814
      messages. Training on these will keep bogofilter working well, as
      Packit Service 8f0814
      you're telling it what it needs to learn, and not so much of what
      Packit Service 8f0814
      it already knows. If one list grows faster than the other, extra
      Packit Service 8f0814
      (correctly classified) messages may be added from time to time to
      Packit Service 8f0814
      equalize them again; try to keep the smaller list's message count
      Packit Service 8f0814
      at least two thirds of the larger's.

      Packit Service 8f0814
      Packit Service 8f0814

      The use of bogofilter's -u option is convenient but

      Packit Service 8f0814
      dangerous.  Even when well trained, bogofilter will
      Packit Service 8f0814
      misclassify a small number of messages.  With the -u option,
      Packit Service 8f0814
      these mistakes are entered into the database and decrease its
      Packit Service 8f0814
      accuracy.  More mistakes result and the process
      Packit Service 8f0814
      snowballs.  You therefore need to review and correct the
      Packit Service 8f0814
      databases frequently, and in the interval between such reviews,
      Packit Service 8f0814
      bogofilter's effectiveness will keep falling off.  The method
      Packit Service 8f0814
      described above has the advantage that (human error excepted) no
      Packit Service 8f0814
      wrongly classified messages are ever entered into the database, and
      Packit Service 8f0814
      if you have to leave it for a time without updating it, its
      Packit Service 8f0814
      effectiveness doesn't diminish.

      Packit Service 8f0814
      Packit Service 8f0814

      Packit Service 8f0814

      Much of the advice given here arises out of experiments reported

      Packit Service 8f0814
      on the author's bogofilter web pages; in particular, the report at
      Packit Service 8f0814
      http://www.bgl.nu/bogofilter/smindev3.html
      Packit Service 8f0814
      might interest those who'd like more information about the basis of
      Packit Service 8f0814
      bogofilter tuning.

      Packit Service 8f0814
      Packit Service 8f0814

      Thanks go to David Relson for reviewing this document in draft

      Packit Service 8f0814
      and suggesting several improvements.

      Packit Service 8f0814
      Packit Service 8f0814
      <address>[© Greg
      Packit Service 8f0814
      Louis, 2004; last modified 2004-09-09]</address>
      Packit Service 8f0814
      </body>
      Packit Service 8f0814
      </html>
      Packit Service 8f0814