Tree - source-git/bogofilter - CentOS Git server

source-git / bogofilter

Blame doc/bogofilter-tuning.HOWTO.html

Blob History Raw

Packit Service	8f0814
Packit Service	8f0814	`"http://www.w3.org/TR/html4/strict.dtd">`
Packit Service	8f0814	`<html>`
Packit Service	8f0814	`<head>`
Packit Service	8f0814	`<title>Tuning bogofilter</title>`
Packit Service	8f0814	`<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">`
Packit Service	8f0814	`</head>`
Packit Service	8f0814	`<body>`
Packit Service	8f0814	`TUNING BOGOFILTER'S ROBINSON-FISHER METHOD -- an updated`
Packit Service	8f0814	`HOWTO`
Packit Service	8f0814
Packit Service	8f0814	`<address>(Greg Louis,`
Packit Service	8f0814	`September 2004)</address>`
Packit Service	8f0814
Packit Service	8f0814	`NB: Bogotune is a tool (shipped with bogofilter) that automates`
Packit Service	8f0814	`the tuning process. Its "full search" mode performs a`
Packit Service	8f0814	`five-dimensional grid search over possible values of the parameters`
Packit Service	8f0814	`to be described below, and comes up with recommendations for`
Packit Service	8f0814	`optimal settings. There's also a "partial search" mode that is only`
Packit Service	8f0814	`three-dimensional. If you have enough spam and nonspam messages (at`
Packit Service	8f0814	`least 2,500 of each), using bogotune is highly recommended for`
Packit Service	8f0814	`optimizing bogofilter's accuracy.`
Packit Service	8f0814
Packit Service	8f0814	`CONTENTS`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`Introduction`
Packit Service	8f0814
Packit Service	8f0814	`Robinson's x`
Packit Service	8f0814
Packit Service	8f0814	`Robinson's s`
Packit Service	8f0814
Packit Service	8f0814	`The minimum deviation`
Packit Service	8f0814
Packit Service	8f0814	`Effective size factors`
Packit Service	8f0814
Packit Service	8f0814	`The spam and nonspam cutoffs`
Packit Service	8f0814
Packit Service	8f0814	`Overview of bogotune`
Packit Service	8f0814
Packit Service	8f0814	`How often to tune`
Packit Service	8f0814
Packit Service	8f0814	`A note on training`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`INTRODUCTION`
Packit Service	8f0814
Packit Service	8f0814	`The bogofilter program has evolved through three classification`
Packit Service	8f0814	`methods: the original as proposed by Paul Graham and implemented in`
Packit Service	8f0814	`bogofilter by Eric S. Raymond; a variation proposed by Gary`
Packit Service	8f0814	`Robinson and implemented in bogofilter by Greg Louis; and a further`
Packit Service	8f0814	`variation, also proposed by Gary Robinson, which uses Fisher's`
Packit Service	8f0814	`method (Fisher, R. A., 1950: Statistical Methods for Research`
Packit Service	8f0814	`Workers, pp. 99ff. London: Oliver and Boyd) of`
Packit Service	8f0814	`combining probabilities; for bogofilter, your author has`
Packit Service	8f0814	`implemented this one too. Recently, Gary Robinson described a`
Packit Service	8f0814	`further improvement, the application of effective size factors in`
Packit Service	8f0814	`the scoring process; this is available by default in bogofilter,`
Packit Service	8f0814	`but since it's relatively new, an option exists in both bogofilter`
Packit Service	8f0814	`and bogotune to switch it off.`
Packit Service	8f0814
Packit Service	8f0814	`Each of Gary Robinson's proposed classification methods works`
Packit Service	8f0814	`better than the earlier versions. For optimal results, they`
Packit Service	8f0814	`(and the original) require some tuning. As distributed,`
Packit Service	8f0814	`bogofilter attempts to supply good starting values for the tunable`
Packit Service	8f0814	`parameters. Since the optimum values depend on the size and`
Packit Service	8f0814	`content of the wordlists at your installation, the best`
Packit Service	8f0814	`results can only be determined by some experimentation using`
Packit Service	8f0814	`your wordlists. After several thousand each of spam`
Packit Service	8f0814	`and nonspam messages have been fed to bogofilter for training, this`
Packit Service	8f0814	`experimentation can be well worthwhile: you may be able to cut the`
Packit Service	8f0814	`number of spams that are still getting through by more than`
Packit Service	8f0814	`half.`
Packit Service	8f0814
Packit Service	8f0814	`The purpose of this document is to explain how to adjust`
Packit Service	8f0814	`bogofilter's parameters for best spam filtering. A manual`
Packit Service	8f0814	`tuning process is described; though you'll be wise to use bogotune`
Packit Service	8f0814	`to help find your parameter values, understanding the manual`
Packit Service	8f0814	`process will help you make sense of what bogotune does.`
Packit Service	8f0814
Packit Service	8f0814	`With Robinson's changes as implemented in bogofilter, there are`
Packit Service	8f0814	`seven (five without effective size factors) things to tune, six (or`
Packit Service	8f0814	`four) of which are highly interdependent, as explained below:`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`ROBINSON'S x`
Packit Service	8f0814
Packit Service	8f0814	`First off, you should determine the value of x appropriate to`
Packit Service	8f0814	`your training database.`
Packit Service	8f0814
Packit Service	8f0814	`The way bogofilter works, summarizing briefly, is that the`
Packit Service	8f0814	`message being classified is separated into "tokens" -- words, IP`
Packit Service	8f0814	`addresses and other logical units of information. Each token`
Packit Service	8f0814	`is looked up in the wordlist that makes up the training`
Packit Service	8f0814	`database. The number of times it's been seen in a spam`
Packit Service	8f0814	`message is divided by the total number of times it's been seen, and`
Packit Service	8f0814	`that gives an indication of the likelihood that the token is in a`
Packit Service	8f0814	`spam message. The likelihood estimates for all the tokens are`
Packit Service	8f0814	`combined to give a score between 0 and 1 -- 0 means the message is`
Packit Service	8f0814	`not likely to be a spam, 1 means it is. That's fine, but what`
Packit Service	8f0814	`happens if a token's never been seen before? That's where x comes`
Packit Service	8f0814	`in: it's a "first guess" at what the presence of an unknown token`
Packit Service	8f0814	`means, in terms of its contribution to the score. It's the`
Packit Service	8f0814	`value used as the likelihood estimate when a new token is`
Packit Service	8f0814	`found.`
Packit Service	8f0814
Packit Service	8f0814	`Obtaining a value for x is easy: bogoutil will calculate it for`
Packit Service	8f0814	`you.`
Packit Service	8f0814
Packit Service	8f0814	`Assuming your bogofilter wordlist is in ~/.bogofilter, run`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`bogoutil -r`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`This will print out an x value. If it's in the range of 0.4 to`
Packit Service	8f0814	`0.6, you can run`
Packit Service	8f0814	`bogoutil -R ~/.bogofilter to install`
Packit Service	8f0814	`the calculated value so bogofilter will use it from then on.`
Packit Service	8f0814
Packit Service	8f0814	`The value of x that bogoutil calculates is just the average`
Packit Service	8f0814	`of`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`p(w) = badcount / (goodcount + badcount)`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`for every word in your training set that appears at least 10`
Packit Service	8f0814	`times in spam and/or ham counts in your wordlist (i.e.`
Packit Service	8f0814	`badcount + goodcount >= 10).`
Packit Service	8f0814
Packit Service	8f0814	`To be honest, that's an oversimplification, for the sake of`
Packit Service	8f0814	`explaining the basic concept clearly. In real life, you have`
Packit Service	8f0814	`to scale the counts somehow. If you had exactly the same`
Packit Service	8f0814	`number of spam and nonspam messages contributing to your wordlist,`
Packit Service	8f0814	`the formula for p(w) given above would be ok; but we actually have`
Packit Service	8f0814	`to use`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`p(w) = (badcount / badlist_msgcount) /`
Packit Service	8f0814	`(badcount / badlist_msgcount + goodcount / goodlist_msgcount)`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`where badlist_msgcount is the number of spam messages that were`
Packit Service	8f0814	`fed into the training set, and goodlist_msgcount is the number of`
Packit Service	8f0814	`nonspams used in training.`
Packit Service	8f0814
Packit Service	8f0814	`An equivalent way of calculating x, that's a little easier to`
Packit Service	8f0814	`read, is:`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`scalefactor = badlist_msgcount / goodlist_msgcount`
Packit Service	8f0814	`p(w) = badcount / (badcount + goodcount * scalefactor)`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`In either case, x is the average of those p(w) values.`
Packit Service	8f0814
Packit Service	8f0814	`The calculated x is just a first guess, and it may be worth`
Packit Service	8f0814	`while experimenting (after tuning s and the minimum deviation as`
Packit Service	8f0814	`described below) to see what happens if you adjust it up or`
Packit Service	8f0814	`downward within a range of about 0.1 either way.`
Packit Service	8f0814	`"robs">`
Packit Service	8f0814
Packit Service	8f0814	`ROBINSON'S s`
Packit Service	8f0814
Packit Service	8f0814	`With a count of zero for a given token, we have only x to go on,`
Packit Service	8f0814	`so that's what we use as the likelihood estimate for that`
Packit Service	8f0814	`token. The question then arises, what if we have seen the`
Packit Service	8f0814	`token before, but only a few times? Statistical variation will`
Packit Service	8f0814	`result in the ratio of two small numbers being rather unreliable,`
Packit Service	8f0814	`so perhaps we ought to compromise between that ratio and our "first`
Packit Service	8f0814	`guess" value. This is what the Robinson method does.`
Packit Service	8f0814
Packit Service	8f0814	`The compromise works like this: a parameter s is defined, that`
Packit Service	8f0814	`serves as a weighting factor; the larger the value of s, the more`
Packit Service	8f0814	`importance is given to x in the presence of low token counts.`
Packit Service	8f0814	`The token count is`
Packit Service	8f0814	`n = badcount + goodcount and`
Packit Service	8f0814	`the p(w) value for the token is modified as follows:`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`f(w) = (s * x + n * p(w))/(s + n)`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`As you see, if n is zero, f(w) is just x, as we have been saying`
Packit Service	8f0814	`all along; but if n is nonzero and small, the s * x term`
Packit Service	8f0814	`will influence the f(w) value.`
Packit Service	8f0814
Packit Service	8f0814	`Parameters s and x are only important when the counts are small,`
Packit Service	8f0814	`and the value of s reflects what we think of as "small." If s is`
Packit Service	8f0814	`large, then when counts are small we trust our x value more than we`
Packit Service	8f0814	`do the p(w); if s is small, we give more weight to p(w) and less to`
Packit Service	8f0814	`x.`
Packit Service	8f0814
Packit Service	8f0814	`So how big should s be? Gary Robinson suggested, on a`
Packit Service	8f0814	`theoretical basis, that we start with a value of 1; it might be`
Packit Service	8f0814	`worth trying values in the range of 0.01 to 10, though I've never`
Packit Service	8f0814	`had good results with values larger than 1. Making s smaller`
Packit Service	8f0814	`than 0.01 or so is a bad idea, because of what happens when a token`
Packit Service	8f0814	`is encountered that's been seen in spam before, but never in`
Packit Service	8f0814	`nonspam, or vice versa. In that case, p(w) is exactly 1 or 0,`
Packit Service	8f0814	`and f(w) will vary greatly as the value of s diminishes. As`
Packit Service	8f0814	`an example, one spam that had about ten such tokens, out of 78 that`
Packit Service	8f0814	`contributed to the spam score, scored 0.999 with s set to 0.001,`
Packit Service	8f0814	`and was found to score 0.505 when s was 1.0e-8!`
Packit Service	8f0814
Packit Service	8f0814	`Choosing a value for s is a matter of trial and error.`
Packit Service	8f0814	`Experience suggests that 0.1 might be a better starting value than`
Packit Service	8f0814	`1, at least if your training database is moderately large.`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`THE MINIMUM DEVIATION`
Packit Service	8f0814
Packit Service	8f0814	`MIN_DEV is another thing you might need to vary. Paul`
Packit Service	8f0814	`Graham's original method was based on looking only at the fifteen`
Packit Service	8f0814	`tokens with p(w) values farthest from 0.5 (nearest to 0 or`
Packit Service	8f0814	`1). We don't do it that way; instead, we set a minimum`
Packit Service	8f0814	`deviation from 0.5 (Gary Robinson coined the term "exclusion`
Packit Service	8f0814	`radius" for this parameter), and look at all the tokens with f(w)`
Packit Service	8f0814	`values farther away than that. If the minimum is set to zero,`
Packit Service	8f0814	`every token in the message contributes its spammishness value f(w)`
Packit Service	8f0814	`to the final calculation. It might save time, and perhaps`
Packit Service	8f0814	`improve discrimination, to ignore tokens with f(w) values near 0.5,`
Packit Service	8f0814	`since those tokens obviously make less difference to the outcome of`
Packit Service	8f0814	`the calculation. It seems helpful, at least once the training`
Packit Service	8f0814	`database is a good size, to use a MIN_DEV value between 0.3 and`
Packit Service	8f0814	`0.46. You might try 0.35 initially; one experiment suggests`
Packit Service	8f0814	`that even 0.44 might be a good value, but that may not work for`
Packit Service	8f0814	`everyone. In fact, some people are likely to find that quite`
Packit Service	8f0814	`a small value of MIN_DEV (around 0.05) works best.`
Packit Service	8f0814
Packit Service	8f0814	`Note that higher values of MIN_DEV accentuate the distortion`
Packit Service	8f0814	`caused by small s, because tokens appearing in only one of the spam`
Packit Service	8f0814	`and nonspam counts will (if present) be a higher proportion of the`
Packit Service	8f0814	`total number of tokens considered.`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`EFFECTIVE SIZE FACTORS (ESF)`
Packit Service	8f0814
Packit Service	8f0814	`Tokens tend to appear more than once in a given message.`
Packit Service	8f0814	`If a spam contains the word "valium" it's likely to occur several`
Packit Service	8f0814	`times. Bogofilter only uses any given token once in`
Packit Service	8f0814	`calculating the message score, no matter how many times it appears`
Packit Service	8f0814	`in the message; but since "spammy" and "non-spammy" tokens may`
Packit Service	8f0814	`occur with very different frequency in the populations of spam and`
Packit Service	8f0814	`nonspam messages respectively, it helps (as Gary Robinson pointed`
Packit Service	8f0814	`out) to take this difference in redundancy into account. This`
Packit Service	8f0814	`is done as follows: Without effective size factors, the score`
Packit Service	8f0814	`is calculated with inverse chi-squared function`
Packit Service	8f0814	`prbx thus:`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`P = prbx(-2 * sum(ln(1-f(w))), 2*N)`
Packit Service	8f0814	`Q = prbx(-2 * sum(ln(f(w))), 2*N)`
Packit Service	8f0814	`S = (1 + Q - P) / 2`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`and to apply ESF, we instead calculate:`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`P = prbx(-2 * ln(prod(1-f(w))^y), 2Ny)`
Packit Service	8f0814	`Q = prbx(-2 * ln(prod(f(w))^z), 2Nz)`
Packit Service	8f0814	`S = Q / (Q + P)`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`where y and z are the spam and nonspam ESF values, the`
Packit Service	8f0814	`prod function returns the product of all its`
Packit Service	8f0814	`arguments, and S is forced to 0.5 if Q and P are both very near`
Packit Service	8f0814	`zero.`
Packit Service	8f0814
Packit Service	8f0814	`Determining the correct values to use for y and z is, like`
Packit Service	8f0814	`choosing the value for s, a matter of trial and error. A`
Packit Service	8f0814	`significant corpus of messages is required and users who don't have`
Packit Service	8f0814	`access to large collections of their spam and nonspam messages`
Packit Service	8f0814	`should probably opt to do without ESF. Useful values to try`
Packit Service	8f0814	`seem to be in the range of 0.75 raised to powers between 1 and`
Packit Service	8f0814	`20.`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`THE SPAM AND NONSPAM CUTOFFS`
Packit Service	8f0814
Packit Service	8f0814	`Most spam messages will have scores very close to 1 when the`
Packit Service	8f0814	`Fisher method of combining the likelihood estimates is applied, and`
Packit Service	8f0814	`most nonspam will have scores very close to 0. In between,`
Packit Service	8f0814	`there is a grey area, where messages will fall that have both`
Packit Service	8f0814	`spammish and nonspammish characteristics. The spam and`
Packit Service	8f0814	`nonspam cutoffs are thresholds: bogofilter classes messages with`
Packit Service	8f0814	`scores below the nonspam cutoff as nonspam, and those with scores`
Packit Service	8f0814	`at or above the spam cutoff as spam. Messages with scores between`
Packit Service	8f0814	`the two cutoff values are classed as uncertain. (Usually,`
Packit Service	8f0814	`mail administrators will want to deliver mail classed as uncertain,`
Packit Service	8f0814	`even though some of it may well be spam.)`
Packit Service	8f0814
Packit Service	8f0814	`The best value for the spam cutoff depends strongly on the`
Packit Service	8f0814	`values of the x, s, MIN_DEV and ESF parameters that are being used`
Packit Service	8f0814	`in the calculation. For that reason, the way to test a given`
Packit Service	8f0814	`parameter set is the following:`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`Determine a suitable level of false positives (nonspams`
Packit Service	8f0814	`classified as spam); this will probably need to fall somewhere in`
Packit Service	8f0814	`the range of 0.05 to 0.3 percent (the lower the better, of course,`
Packit Service	8f0814	`except that lowering the false-positive target too much gives too`
Packit Service	8f0814	`many false negatives).`
Packit Service	8f0814
Packit Service	8f0814	`Apply the parameters to classify a number of known nonspam`
Packit Service	8f0814	`messages. From the distribution of scores, pick a spam cutoff value`
Packit Service	8f0814	`that will give the selected proportion of false positives.`
Packit Service	8f0814
Packit Service	8f0814	`Apply the parameters and the chosen spam cutoff to classify a`
Packit Service	8f0814	`number of spam messages; the parameter set should be judged on how`
Packit Service	8f0814	`few false negatives (spam messages classed as uncertain or nonspam)`
Packit Service	8f0814	`are obtained.`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`The value for the nonspam threshold should be such that no more`
Packit Service	8f0814	`than about one in ten thousand spams is classified as`
Packit Service	8f0814	`nonspam. A value of 0.2 to 0.25 might be a good starting`
Packit Service	8f0814	`point for use with the recommended s and MIN_DEV (0.1 and 0.35`
Packit Service	8f0814	`respectively).`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`AN OVERVIEW OF BOGOTUNE`
Packit Service	8f0814
Packit Service	8f0814	`As already mentioned, bogotune attempts to automate the above`
Packit Service	8f0814	`process by doing a grid search over useful ranges of s, MIN_DEV, x,`
Packit Service	8f0814	`y and z (an option exists to disable ESF, i.e. leave the ESF values`
Packit Service	8f0814	`[y and z] at 1.0). Bogotune validates the test inputs,`
Packit Service	8f0814	`calculates a suitable cache size for the training database,`
Packit Service	8f0814	`calculates the starting x value as described above, and picks a`
Packit Service	8f0814	`false-positive target with which to run the grid search.`
Packit Service	8f0814	`There is an option to force a specific target (more about this`
Packit Service	8f0814	`shortly). If not coerced, the target is calculated based on`
Packit Service	8f0814	`the number of nonspam messages in the test set, and then adjusted`
Packit Service	8f0814	`downward to give a cutoff between 0.5 and 0.975, above 0.55 if`
Packit Service	8f0814	`possible. It's important to note that this false-positive`
Packit Service	8f0814	`target is chosen to facilitate the grid search, and is usually very`
Packit Service	8f0814	`much larger than one would want to see in production; there's no`
Packit Service	8f0814	`relation between the two. At the end of its run, bogotune`
Packit Service	8f0814	`attempts to suggest a reasonable production target.`
Packit Service	8f0814
Packit Service	8f0814	`With these preliminaries completed, bogotune performs two`
Packit Service	8f0814	`scans: The first is a coarse grid search over the entire`
Packit Service	8f0814	`useful range of each parameter, which should locate an approximate`
Packit Service	8f0814	`optimum. A finer grid, centered on the optimum derived from`
Packit Service	8f0814	`the coarse search, is then scanned to produce the final`
Packit Service	8f0814	`recommendations. At the end of each scan, outliers --`
Packit Service	8f0814	`apparently good parameter sets from which even slight deviation`
Packit Service	8f0814	`degrades performance significantly -- are rejected. Bogotune`
Packit Service	8f0814	`usually manages to find a robust parameter set that gives good`
Packit Service	8f0814	`discrimination between spam and nonspam messages, provided that`
Packit Service	8f0814	`sufficient training and test messages are supplied.`
Packit Service	8f0814
Packit Service	8f0814	`There are two reasons why one might want to force a bogotune run`
Packit Service	8f0814	`to use a specific false-positive target. One is that`
Packit Service	8f0814	`sometimes bogotune's target calculation algorithm is overly`
Packit Service	8f0814	`optimistic, i.e. it occasionally sets the target too low. The`
Packit Service	8f0814	`symptom of this problem is that many parameter combinations in the`
Packit Service	8f0814	`coarse grid scan simply can't deliver that few false positives from`
Packit Service	8f0814	`the test message corpus. Manually increasing the target by 20`
Packit Service	8f0814	`to 30 percent usually fixes this. The other reason to coerce`
Packit Service	8f0814	`the target to a specific value is that one might want to compare`
Packit Service	8f0814	`two bogotune runs -- with and without ESF, for example -- and`
Packit Service	8f0814	`comparisons aren't valid unless both runs use the same target.`
Packit Service	8f0814
Packit Service	8f0814	`Bogotune may produce very different recommendations for very`
Packit Service	8f0814	`similar sets of spam and nonspam messages. That's not`
Packit Service	8f0814	`necessarily a defect. For many message populations, the`
Packit Service	8f0814	`scoring algorithm doesn't depend heavily on precise values of any`
Packit Service	8f0814	`of the parameters but the spam cutoff. In such cases it's`
Packit Service	8f0814	`common to see bogotune's coarse scan settle on one of several local`
Packit Service	8f0814	`optima that may have quite different parameter values. In`
Packit Service	8f0814	`general, the parameter values have more influence when a single`
Packit Service	8f0814	`training database is used for a large number of users, and are less`
Packit Service	8f0814	`crucial when the wordlist is for just one or a few users.`
Packit Service	8f0814	`"freq">`
Packit Service	8f0814
Packit Service	8f0814	`HOW OFTEN TO TUNE`
Packit Service	8f0814
Packit Service	8f0814	`It's probably wise to review the spam cutoff frequently.`
Packit Service	8f0814	`If the false-negative count is gratifyingly low, or if false`
Packit Service	8f0814	`positives are occurring, increasing the cutoff will reduce the`
Packit Service	8f0814	`false-positive rate. If, on the other hand, there are absolutely no`
Packit Service	8f0814	`false positives but the false-negative rate is high, lowering the`
Packit Service	8f0814	`cutoff a bit may improve discrimination.`
Packit Service	8f0814
Packit Service	8f0814	`How often to tune the other parameters depends on how fussy you`
Packit Service	8f0814	`are about optimizing performance. If you're eager to get the`
Packit Service	8f0814	`best discrimination you can, here are my recommendations: In`
Packit Service	8f0814	`the early stage, while the training database is still small and`
Packit Service	8f0814	`growing rapidly, it's probably wise to experiment with tuning x, s`
Packit Service	8f0814	`and MIN_DEV once a month or so. Once the training database is`
Packit Service	8f0814	`a good size (over 5000 spams and 5000 nonspams), this can be`
Packit Service	8f0814	`reduced to quarterly or half-yearly. If you use ESF (which I`
Packit Service	8f0814	`recommend you do), retuning should probably happen a bit more`
Packit Service	8f0814	`often, especially if you see that the characteristics of spam`
Packit Service	8f0814	`you're receiving seem to be changing.`
Packit Service	8f0814
Packit Service	8f0814	`FWIW my own practice, after two years' experience, is to review`
Packit Service	8f0814	`the spam cutoff monthly and do a bogotune run about quarterly.`
Packit Service	8f0814	`name="train">`
Packit Service	8f0814
Packit Service	8f0814	`A NOTE ON TRAINING`
Packit Service	8f0814
Packit Service	8f0814	`Bogofilter's ability to distinguish accurately between spam and`
Packit Service	8f0814	`nonspam messages depends on the quality of its training`
Packit Service	8f0814	`database. Here is a way of maximizing that quality with`
Packit Service	8f0814	`relatively little effort.`
Packit Service	8f0814
Packit Service	8f0814	`When starting afresh, feed every spam and every nonspam you get`
Packit Service	8f0814	`into bogofilter. Do not use bogofilter's -u option to do`
Packit Service	8f0814	`this: there will be far too many errors when your training database`
Packit Service	8f0814	`is small. Instead, classify messages manually and train`
Packit Service	8f0814	`bogofilter with the -n and -s options appropriately. You can`
Packit Service	8f0814	`do it in batches: if you work with standard mbox files, use a mail`
Packit Service	8f0814	`reader to move spam and nonspam into separate files, then do`
Packit Service	8f0814	`bogofilter -s < spambox`
Packit Service	8f0814	`and bogofilter -n < nonspambox`
Packit Service	8f0814	`to register the messages.`
Packit Service	8f0814
Packit Service	8f0814	`To find out how many spam and nonspam messages have gone into`
Packit Service	8f0814	`your wordlist, assuming it's kept in ~/.bogofilter, do`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`bogoutil -w ~/.bogofilter .MSG_COUNT`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`Once you've accumulated about 5,000 spam or nonspam messages in`
Packit Service	8f0814	`the list, you need to let the other count catch up unless they're`
Packit Service	8f0814	`growing at about the same rate. Stop adding every message to`
Packit Service	8f0814	`the larger of the two counts, and instead, add only messages that`
Packit Service	8f0814	`bogofilter got wrong or was unsure about. To do this, you`
Packit Service	8f0814	`need to start classifying messages into 4 sets instead of 2: spam,`
Packit Service	8f0814	`nonspam, unsures that were actually spam, and unsures that were`
Packit Service	8f0814	`nonspam.`
Packit Service	8f0814
Packit Service	8f0814	`Once the counts are both over 5,000 and fairly similar, you can`
Packit Service	8f0814	`train only on errors and unsures. By this time there should`
Packit Service	8f0814	`be very few errors (spams classed as nonspam or vice versa), but`
Packit Service	8f0814	`there will still be a proportion of unsure spam and unsure nonspam`
Packit Service	8f0814	`messages. Training on these will keep bogofilter working well, as`
Packit Service	8f0814	`you're telling it what it needs to learn, and not so much of what`
Packit Service	8f0814	`it already knows. If one list grows faster than the other, extra`
Packit Service	8f0814	`(correctly classified) messages may be added from time to time to`
Packit Service	8f0814	`equalize them again; try to keep the smaller list's message count`
Packit Service	8f0814	`at least two thirds of the larger's.`
Packit Service	8f0814
Packit Service	8f0814	`The use of bogofilter's -u option is convenient but`
Packit Service	8f0814	`dangerous. Even when well trained, bogofilter will`
Packit Service	8f0814	`misclassify a small number of messages. With the -u option,`
Packit Service	8f0814	`these mistakes are entered into the database and decrease its`
Packit Service	8f0814	`accuracy. More mistakes result and the process`
Packit Service	8f0814	`snowballs. You therefore need to review and correct the`
Packit Service	8f0814	`databases frequently, and in the interval between such reviews,`
Packit Service	8f0814	`bogofilter's effectiveness will keep falling off. The method`
Packit Service	8f0814	`described above has the advantage that (human error excepted) no`
Packit Service	8f0814	`wrongly classified messages are ever entered into the database, and`
Packit Service	8f0814	`if you have to leave it for a time without updating it, its`
Packit Service	8f0814	`effectiveness doesn't diminish.`
Packit Service	8f0814
Packit Service	8f0814
Packit Service	8f0814	`Much of the advice given here arises out of experiments reported`
Packit Service	8f0814	`on the author's bogofilter web pages; in particular, the report at`
Packit Service	8f0814	`http://www.bgl.nu/bogofilter/smindev3.html`
Packit Service	8f0814	`might interest those who'd like more information about the basis of`
Packit Service	8f0814	`bogofilter tuning.`
Packit Service	8f0814
Packit Service	8f0814	`Thanks go to David Relson for reviewing this document in draft`
Packit Service	8f0814	`and suggesting several improvements.`
Packit Service	8f0814
Packit Service	8f0814	`<address>[© Greg`
Packit Service	8f0814	`Louis, 2004; last modified 2004-09-09]</address>`
Packit Service	8f0814	`</body>`
Packit Service	8f0814	`</html>`
Packit Service	8f0814

source-git / bogofilter

Source Code

Blame doc/bogofilter-tuning.HOWTO.html

TUNING BOGOFILTER'S ROBINSON-FISHER METHOD -- an updated

CONTENTS

INTRODUCTION

ROBINSON'S x

ROBINSON'S s

THE MINIMUM DEVIATION

EFFECTIVE SIZE FACTORS (ESF)

THE SPAM AND NONSPAM CUTOFFS

AN OVERVIEW OF BOGOTUNE

HOW OFTEN TO TUNE

A NOTE ON TRAINING