Blame doc/bogotune-faq.html

Packit e8bc57
Packit e8bc57
<html>
Packit e8bc57
  <head>
Packit e8bc57
    <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
Packit e8bc57
    <title>Bogotune FAQ</title>
Packit e8bc57
    <style type="text/css">
Packit e8bc57
      h2 {
Packit e8bc57
	margin-top: 1em;
Packit e8bc57
	font-size: 125%;
Packit e8bc57
      }
Packit e8bc57
      h3 {
Packit e8bc57
	margin-top: 1em;
Packit e8bc57
	font-size: 110%;
Packit e8bc57
      }
Packit e8bc57
      p {
Packit e8bc57
        margin-top : 0.5em;
Packit e8bc57
        margin-bottom: 0.5em;
Packit e8bc57
      }
Packit e8bc57
      ul {
Packit e8bc57
	margin-top: 1.5em;
Packit e8bc57
	margin-bottom: 0.5em;
Packit e8bc57
      }
Packit e8bc57
      ul ul {
Packit e8bc57
	margin-top: 0.25em;
Packit e8bc57
	margin-bottom: 0;
Packit e8bc57
      }
Packit e8bc57
      li {
Packit e8bc57
	margin-top: 0;
Packit e8bc57
	margin-bottom: 1em;
Packit e8bc57
      }
Packit e8bc57
      li li {
Packit e8bc57
	margin-bottom: 0.25em;
Packit e8bc57
      }
Packit e8bc57
      dt {
Packit e8bc57
	margin-top: 0.5em;
Packit e8bc57
	margin-bottom: 0;
Packit e8bc57
      }
Packit e8bc57
      hr {
Packit e8bc57
	margin-top: 1em;
Packit e8bc57
	margin-bottom: 1em;
Packit e8bc57
      }
Packit e8bc57
    </style>
Packit e8bc57
  </head>
Packit e8bc57
Packit e8bc57
  <body>
Packit e8bc57
    

Bogotune FAQ

Packit e8bc57
Packit e8bc57
    

Official Versions: In

Packit e8bc57
    bogotune-faq
Packit e8bc57
    Maintainer: David Relson <relson@osagesoftware.com>

Packit e8bc57
Packit e8bc57
    

This document is intended to answer frequently asked questions

Packit e8bc57
    about bogotune.

Packit e8bc57
Packit e8bc57
    
    Packit e8bc57
        
  • Where did bogotune come from?
  • Packit e8bc57
        
  • What's the message count format?
  • Packit e8bc57
        
  • How does bogotune work?
  • Packit e8bc57
    Packit e8bc57
        
  • How does bogotune ensure the messages it
  • Packit e8bc57
        works with are numerous enough, and well enough classified, to
    Packit e8bc57
        deliver useful recommendations?
    Packit e8bc57
    Packit e8bc57
        
  • Can I tell bogotune to do its work even
  • Packit e8bc57
        though it doesn't like the data?
    Packit e8bc57
        
    Packit e8bc57
    Packit e8bc57
        
    Packit e8bc57
        

    Where did bogotune come from?

    Packit e8bc57
    Packit e8bc57
        

    Greg Louis wrote the original Robinson geometric-mean and

    Packit e8bc57
        Robinson-Fisher algorithm code for bogofilter.  To determine the
    Packit e8bc57
        optimal parameters for the Robinson-Fisher algorithm he wrote
    Packit e8bc57
        bogotune.  The initial implementation was written in the R
    Packit e8bc57
        programming language.  This was followed by the Perl
    Packit e8bc57
        implementation.  Both of these implementations were slow because
    Packit e8bc57
        bogofilter had to be run for each message being scored.  David
    Packit e8bc57
        Relson translated bogotune from Perl to C to provide more
    Packit e8bc57
        speed.

    Packit e8bc57
    Packit e8bc57
        
    Packit e8bc57
        

    What's the message count format?

    Packit e8bc57
    Packit e8bc57
        

    The parsing of a message by bogofilter takes some time. After

    Packit e8bc57
        parsing, finding the spam and non-spam counts for each token takes
    Packit e8bc57
        additional time.  Having to repeate these steps every time
    Packit e8bc57
        bogotune needed a score was slow.  It was realized that parsing
    Packit e8bc57
        and look-up could be done once with the results being saved in a
    Packit e8bc57
        special format.  Initially this was called the bogolex format
    Packit e8bc57
        because the work was done by piping bogolexer output to bogoutil
    Packit e8bc57
        and formatting the result.  Since each processed message begins
    Packit e8bc57
        with the .MSG_COUNT token the format became knowns as the message
    Packit e8bc57
        count format.  The convention is to use a .mc extension for these
    Packit e8bc57
        files.

    Packit e8bc57
    Packit e8bc57
        
    Packit e8bc57
        

    How does bogotune work?

    Packit e8bc57
    Packit e8bc57
        

    First it reads all the files into memory, i.e. the wordlist and

    Packit e8bc57
        the ham messages and the spam messages.  From the wordlist tokens,
    Packit e8bc57
        it computes an initial robx value which is used in the initial
    Packit e8bc57
        scan of the messages to ensure they're usable.

    Packit e8bc57
    Packit e8bc57
        

    Given the total number of messages in the test set, a target

    Packit e8bc57
        number of false positives is selected for use in determining spam
    Packit e8bc57
        cutoff values in the individual scans.

    Packit e8bc57
    Packit e8bc57
        

    Then comes the coarse scan. Using 225 combinations of values

    Packit e8bc57
        chosen to span the potentially useful ranges for robs, robx, and
    Packit e8bc57
        min_dev, all the ham messages are scored and the target value is
    Packit e8bc57
        used to find a spam_cutoff score.  Then the spam messages are
    Packit e8bc57
        scored and the false negatives are counted.  The scan finishes
    Packit e8bc57
        with a listing of the ten best sets of parameters and their scores
    Packit e8bc57
        (false negative and false positive counts and percent).

    Packit e8bc57
    Packit e8bc57
        

    From the results, the best non-outlying result is picked and

    Packit e8bc57
        these parameters become the starting point for the fine scan.

    Packit e8bc57
    Packit e8bc57
        

    The fine scan, as the name suggests, scans the region (range of

    Packit e8bc57
        values of robs, robx and min_dev) surrounding the optimum found in
    Packit e8bc57
        the coarse scan, with smaller intervals so as to determine the
    Packit e8bc57
        optimum values more precisely. 

    Packit e8bc57
    Packit e8bc57
        
    Packit e8bc57
    Packit e8bc57
        

    How does bogotune ensure the messages it works with

    Packit e8bc57
        are numerous enough, and well enough classified, to deliver useful
    Packit e8bc57
        recommendations?
    Packit e8bc57
    Packit e8bc57
        

    It has certain minimum requirements that it checks for as it

    Packit e8bc57
        starts up.  It will complain (and halt) if there are fewer than
    Packit e8bc57
        2,000 ham or 2,000 spam in the wordlist, or if there are fewer
    Packit e8bc57
        than 500 ham or 500 spam in the set of test messages.  It will
    Packit e8bc57
        warn, but not halt, if there's too little scoring variation in the
    Packit e8bc57
        ham messages or the spam messages or if too many of the ham
    Packit e8bc57
        messages score as spam (or vice versa) on the initial pass.  There
    Packit e8bc57
        are additional checks, but I'm sure you get the idea from these
    Packit e8bc57
        examples.  For details, use the source :)

    Packit e8bc57
    Packit e8bc57
        
    Packit e8bc57
        

    Can I tell bogotune to do its work even though it

    Packit e8bc57
        doesn't like the data?
    Packit e8bc57
    Packit e8bc57
        

    No. At one time we had a -F option to force bogotune to run

    Packit e8bc57
        with unsuitable message data, but it was realized that this could
    Packit e8bc57
        be misleading and had little chance of being helpful.  Bogotune
    Packit e8bc57
        will warn the operator if its conclusions are untrustworthy due to
    Packit e8bc57
        marginal input, and will not run if its input data are detectably
    Packit e8bc57
        inadequate.

    Packit e8bc57
    Packit e8bc57
        
    Packit e8bc57
    </body>
    Packit e8bc57
    </html>