|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
<html>
|
|
Packit Service |
8f0814 |
<head>
|
|
Packit Service |
8f0814 |
<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
|
|
Packit Service |
8f0814 |
<title>Bogotune FAQ</title>
|
|
Packit Service |
8f0814 |
<style type="text/css">
|
|
Packit Service |
8f0814 |
h2 {
|
|
Packit Service |
8f0814 |
margin-top: 1em;
|
|
Packit Service |
8f0814 |
font-size: 125%;
|
|
Packit Service |
8f0814 |
}
|
|
Packit Service |
8f0814 |
h3 {
|
|
Packit Service |
8f0814 |
margin-top: 1em;
|
|
Packit Service |
8f0814 |
font-size: 110%;
|
|
Packit Service |
8f0814 |
}
|
|
Packit Service |
8f0814 |
p {
|
|
Packit Service |
8f0814 |
margin-top : 0.5em;
|
|
Packit Service |
8f0814 |
margin-bottom: 0.5em;
|
|
Packit Service |
8f0814 |
}
|
|
Packit Service |
8f0814 |
ul {
|
|
Packit Service |
8f0814 |
margin-top: 1.5em;
|
|
Packit Service |
8f0814 |
margin-bottom: 0.5em;
|
|
Packit Service |
8f0814 |
}
|
|
Packit Service |
8f0814 |
ul ul {
|
|
Packit Service |
8f0814 |
margin-top: 0.25em;
|
|
Packit Service |
8f0814 |
margin-bottom: 0;
|
|
Packit Service |
8f0814 |
}
|
|
Packit Service |
8f0814 |
li {
|
|
Packit Service |
8f0814 |
margin-top: 0;
|
|
Packit Service |
8f0814 |
margin-bottom: 1em;
|
|
Packit Service |
8f0814 |
}
|
|
Packit Service |
8f0814 |
li li {
|
|
Packit Service |
8f0814 |
margin-bottom: 0.25em;
|
|
Packit Service |
8f0814 |
}
|
|
Packit Service |
8f0814 |
dt {
|
|
Packit Service |
8f0814 |
margin-top: 0.5em;
|
|
Packit Service |
8f0814 |
margin-bottom: 0;
|
|
Packit Service |
8f0814 |
}
|
|
Packit Service |
8f0814 |
hr {
|
|
Packit Service |
8f0814 |
margin-top: 1em;
|
|
Packit Service |
8f0814 |
margin-bottom: 1em;
|
|
Packit Service |
8f0814 |
}
|
|
Packit Service |
8f0814 |
</style>
|
|
Packit Service |
8f0814 |
</head>
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
<body>
|
|
Packit Service |
8f0814 |
Bogotune FAQ
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Official Versions: In
|
|
Packit Service |
8f0814 |
bogotune-faq
|
|
Packit Service |
8f0814 |
Maintainer: David Relson <relson@osagesoftware.com>
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
This document is intended to answer frequently asked questions
|
|
Packit Service |
8f0814 |
about bogotune.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Where did bogotune come from?
|
|
Packit Service |
8f0814 |
What's the message count format?
|
|
Packit Service |
8f0814 |
How does bogotune work?
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
How does bogotune ensure the messages it
|
|
Packit Service |
8f0814 |
works with are numerous enough, and well enough classified, to
|
|
Packit Service |
8f0814 |
deliver useful recommendations?
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Can I tell bogotune to do its work even
|
|
Packit Service |
8f0814 |
though it doesn't like the data?
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Where did bogotune come from?
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Greg Louis wrote the original Robinson geometric-mean and
|
|
Packit Service |
8f0814 |
Robinson-Fisher algorithm code for bogofilter. To determine the
|
|
Packit Service |
8f0814 |
optimal parameters for the Robinson-Fisher algorithm he wrote
|
|
Packit Service |
8f0814 |
bogotune. The initial implementation was written in the R
|
|
Packit Service |
8f0814 |
programming language. This was followed by the Perl
|
|
Packit Service |
8f0814 |
implementation. Both of these implementations were slow because
|
|
Packit Service |
8f0814 |
bogofilter had to be run for each message being scored. David
|
|
Packit Service |
8f0814 |
Relson translated bogotune from Perl to C to provide more
|
|
Packit Service |
8f0814 |
speed.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
What's the message count format?
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The parsing of a message by bogofilter takes some time. After
|
|
Packit Service |
8f0814 |
parsing, finding the spam and non-spam counts for each token takes
|
|
Packit Service |
8f0814 |
additional time. Having to repeate these steps every time
|
|
Packit Service |
8f0814 |
bogotune needed a score was slow. It was realized that parsing
|
|
Packit Service |
8f0814 |
and look-up could be done once with the results being saved in a
|
|
Packit Service |
8f0814 |
special format. Initially this was called the bogolex format
|
|
Packit Service |
8f0814 |
because the work was done by piping bogolexer output to bogoutil
|
|
Packit Service |
8f0814 |
and formatting the result. Since each processed message begins
|
|
Packit Service |
8f0814 |
with the .MSG_COUNT token the format became knowns as the message
|
|
Packit Service |
8f0814 |
count format. The convention is to use a .mc extension for these
|
|
Packit Service |
8f0814 |
files.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
How does bogotune work?
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
First it reads all the files into memory, i.e. the wordlist and
|
|
Packit Service |
8f0814 |
the ham messages and the spam messages. From the wordlist tokens,
|
|
Packit Service |
8f0814 |
it computes an initial robx value which is used in the initial
|
|
Packit Service |
8f0814 |
scan of the messages to ensure they're usable.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Given the total number of messages in the test set, a target
|
|
Packit Service |
8f0814 |
number of false positives is selected for use in determining spam
|
|
Packit Service |
8f0814 |
cutoff values in the individual scans.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Then comes the coarse scan. Using 225 combinations of values
|
|
Packit Service |
8f0814 |
chosen to span the potentially useful ranges for robs, robx, and
|
|
Packit Service |
8f0814 |
min_dev, all the ham messages are scored and the target value is
|
|
Packit Service |
8f0814 |
used to find a spam_cutoff score. Then the spam messages are
|
|
Packit Service |
8f0814 |
scored and the false negatives are counted. The scan finishes
|
|
Packit Service |
8f0814 |
with a listing of the ten best sets of parameters and their scores
|
|
Packit Service |
8f0814 |
(false negative and false positive counts and percent).
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
From the results, the best non-outlying result is picked and
|
|
Packit Service |
8f0814 |
these parameters become the starting point for the fine scan.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
The fine scan, as the name suggests, scans the region (range of
|
|
Packit Service |
8f0814 |
values of robs, robx and min_dev) surrounding the optimum found in
|
|
Packit Service |
8f0814 |
the coarse scan, with smaller intervals so as to determine the
|
|
Packit Service |
8f0814 |
optimum values more precisely.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
How does bogotune ensure the messages it works with
|
|
Packit Service |
8f0814 |
are numerous enough, and well enough classified, to deliver useful
|
|
Packit Service |
8f0814 |
recommendations?
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
It has certain minimum requirements that it checks for as it
|
|
Packit Service |
8f0814 |
starts up. It will complain (and halt) if there are fewer than
|
|
Packit Service |
8f0814 |
2,000 ham or 2,000 spam in the wordlist, or if there are fewer
|
|
Packit Service |
8f0814 |
than 500 ham or 500 spam in the set of test messages. It will
|
|
Packit Service |
8f0814 |
warn, but not halt, if there's too little scoring variation in the
|
|
Packit Service |
8f0814 |
ham messages or the spam messages or if too many of the ham
|
|
Packit Service |
8f0814 |
messages score as spam (or vice versa) on the initial pass. There
|
|
Packit Service |
8f0814 |
are additional checks, but I'm sure you get the idea from these
|
|
Packit Service |
8f0814 |
examples. For details, use the source :)
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
Can I tell bogotune to do its work even though it
|
|
Packit Service |
8f0814 |
doesn't like the data?
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
No. At one time we had a -F option to force bogotune to run
|
|
Packit Service |
8f0814 |
with unsuitable message data, but it was realized that this could
|
|
Packit Service |
8f0814 |
be misleading and had little chance of being helpful. Bogotune
|
|
Packit Service |
8f0814 |
will warn the operator if its conclusions are untrustworthy due to
|
|
Packit Service |
8f0814 |
marginal input, and will not run if its input data are detectably
|
|
Packit Service |
8f0814 |
inadequate.
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
|
|
Packit Service |
8f0814 |
</body>
|
|
Packit Service |
8f0814 |
</html>
|