Tree - source-git/bogofilter - CentOS Git server

source-git / bogofilter

Files

Blob Blame History Raw
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
    <title>Bogofilter FAQ</title>
    <style type="text/css">
      h2 {
	margin-top: 1em;
	font-size: 125%;
      }
      h3 {
	margin-top: 1em;
	font-size: 110%;
      }
      p {
	margin-top: 0.5em;
	margin-bottom: 0.5em;
      }
      ul {
	margin-top: 1.5em;
	margin-bottom: 0.5em;
      }
      ul ul {
	margin-top: 0.25em;
	margin-bottom: 0;
      }
      li {
	margin-top: 0;
	margin-bottom: 1em;
      }
      li li {
	margin-bottom: 0.25em;
      }
      dt {
	margin-top: 0.5em;
	margin-bottom: 0;
      }
      hr {
	margin-top: 1em;
	margin-bottom: 1em;
      }
    </style>
  </head>

  <body>
    <h1>Bogofilter FAQ</h1>

    <p>Official Versions: In
    <a href="http://bogofilter.sourceforge.net/faq.shtml">English</a> or
    <a href="http://bogofilter.sourceforge.net/faq_fr.shtml">French</a> or
    <a href="http://bogofilter.sourceforge.net/faq_it.shtml">Italian</a> or
    <a href="http://bogofilter.sourceforge.net/faq_bg.shtml">Bulgarian</a><br>
    Current maintainer since 2013: Matthias Andree &lt;m-a@users.sf.net&gt;<br>
    Previous maintainer of ten years: David Relson &lt;relson@osagesoftware.com&gt;</p>

    <p>This document is intended to answer frequently asked
    questions about bogofilter.</p>

    <h2>Typographic conventions</h2>
    <ul><li>If we show example commands that start with a
    dollar sign ($), this means that these commands should be executed
    by an unprivileged user, NOT the root user.</li>
    <li>If we show example commands that start with a hash mark (#),
    this means that these commands need to be executed by the root
    user.</li>
    </ul>

    <h1>Frequently asked questions and their answers</h1>

    <ul>
      <li>
        General Information
        <ul>
          <li><a href="#what-is-bogofilter">What is bogofilter?</a></li>
          <li><a href="#bogo-what">Bogo what?</a></li>
          <li><a href="#bogo-how">How does bogofilter work?</a></li>
          <li><a href="#lists">Bogofilter Mailing Lists</a></li>
        </ul>
      </li>

      <li>
        Operational Questions
        <ul>
          <li><a href="#training">How do I start my bogofilter training?</a></li>
          <li><a href="#training-maildirs">How do I train using maildirs?</a></li>
          <li><a href="#production">How can I keep the scoring accuracy high?</a></li>
          <li><a href="#mboxformats">What mailbox (file) formats does bogofilter understand?</a></li>
          <li><a href="#vvv">What does bogofilter's verbose output mean?</a></li>
          <li><a href="#unsure">What is <i>Unsure</i> mode?</a></li>
          <li><a href="#train-on-error">What are "training on error" and
          "training to exhaustion"?</a></li>
          <li><a href="#autoupdate">What does the '-u' (autoupdate) switch do?</a></li>
          <li><a href="#spamassassin">How can I use SpamAssassin to train Bogofilter?</a></li>
          <li><a href="#asian-spam">What can I do about Asian spam?</a></li>
        </ul>
      </li>

      <li>
        Database Questions
        <ul>
          <li><a href="#compact-database">How can I compact my database?</a></li>
          <li><a href="#query-database">How do I manually query the database?</a></li>
          <li><a href="#multiple">Can I use multiple wordlists?</a></li>
          <li><a href="#ignore">Can I tell bogofilter to ignore certain tokens?</a></li>
          <li><a href="#update">How do I upgrade from separate word
          databases to the combined wordlist format?</a></li>
          <li><a href="#unicode">How can I convert my wordlist to/from unicode?</a></li>
          <li><a href="#rescue">How can I tell if my wordlists are corrupted?</a></li>
        </ul>
      </li>

      <li>
        Berkeley DB Questions
        <ul>
          <li><a href="#enable-transactions">How can I switch from
          non-transaction to transaction mode?</a></li>
          <li><a href="#disable-transactions">How can I switch from
          transaction to non-transaction mode?</a></li>
          <li><a href="#locksize">Why does bogofilter die after printing<br>
              "Lock table is out of available locks" or<br>
              "Lock table is out of available object entries"?</a></li>
          <li><a href="#page-notfound">Why am I getting DB_PAGE_NOTFOUND messages?</a></li>
          <li><a href="#db-private">Why am I getting &quot;Berkeley DB
              library configured to support only DB_PRIVATE
              environments&quot; or<br>
              &quot;Berkeley DB library configured to support only
              private environments&quot;?</a>
        </ul>
      </li>

      <li>
        Technical problems
        <ul>
          <li><a href="#multi-user">Can bogofilter be used in a multi-user environment?</a></li>
          <li><a href="#nfs">Can I share wordlists over NFS?</a></li>
          <li><a href="#return-codes">Why does bogofilter give return codes like 0 and
          256 when it's run from inside a program?</a></li>
          <li><a href="#changed-options">Now that I've upgraded why are my scripts broken?</a></li>
          <li><a href="#changed-tagging">Now that I've upgraded why is bogofilter working less well?</a></li>
          <li><a href="#remove-spam-or-nonspam">How can I delete all the spam (or non-spam) tokens?</a></li>
        </ul>
      </li>

      <li>
        Build and Portability Problems
        <ul>
          <li><a href="#port-notes">How do I get bogofilter working on Solaris, BSD, etc.?</a></li>
          <li><a href="#make-notes">Can I use the make command on my operating system?</a></li>
          <li><a href="#build">How do I build bogofilter as non-root
          user or for a non-standard installation prefix?</a></li>
          <li><a href="#patch">How do I build bogofilter with patches?</a></li>
          <li><a href="#smaller">How do I make the executables smaller?</a></li>
          <li><a href="#relativepath">datastore_db.c does not compile!</a></li>
        </ul>
      </li>

      <li>
        Using Bogofilter with different mail programs
        <ul>
          <li><a href="#which-muas">With which mail programs does
          bogofilter work?</a></li>

          <li><a href="#with-mutt">How do I use bogofilter with
          mutt?</a></li>

          <li><a href="#with-sc">How do I use bogofilter with Sylpheed
          Claws?</a></li>

          <li><a href="#with-vm">How do I use bogofilter with VM (an
          Emacs Mail tool)?</a></li>

          <li><a href="#with-mh-e">How do I use bogofilter with MH-E
          (the Emacs interface to the MH mail system)?</a></li>

        </ul>
      </li>
    </ul>

    <hr>

    <h2 id="what-is-bogofilter">What is bogofilter?</h2>

    <p>Bogofilter is a fast Bayesian spam filter along the lines
    suggested by <a href="http://www.paulgraham.com/">Paul Graham</a>
    in his article
    <a href="http://www.paulgraham.com/spam.html">A Plan For Spam</a>.
    bogofilter uses
    <a href="http://radio-weblogs.com/0101454/stories/2002/09/16/spamDetection.html">Gary Robinson</a>'s
    geometric-mean algorithm with the
    <a href="http://www.linuxjournal.com/article/6467">Fisher's method modification</a>
    to classify email as spam or non-spam.</p>

    <p>The bogofilter
    <a href="http://bogofilter.sourceforge.net/">home page</a> at
    SourceForge is the central clearinghouse for bogofilter
    resources.</p>

    <p>Bogofilter was started by
    <a href="http://catb.org/~esr/">Eric S. Raymond</a> on August 19,
    2002. It gained popularity in September 2002, and a number of other
    authors have started to contribute to the project.</p>

    <p>The <a href="http://bogofilter.sourceforge.net/NEWS">NEWS file</a>
    describes bogofilter's version history starting with version 1.0.0.
    Older news (before release 1.0.0) are in the <a
	href="http://bogofilter.sourceforge.net/NEWS.0">NEWS.0 file.</a></p>

    <hr>

    <h2 id="bogo-what">Bogo-what?</h2>

    <p>Bogofilter is some kind of a
    <a href="http://www.catb.org/~esr/jargon/html/B/bogometer.html">bogometer</a> or
    <a href="http://www.catb.org/~esr/jargon/html/B/bogon-filter.html">bogon filter</a>,
    i.e., it tries to identify
    <a href="http://www.catb.org/~esr/jargon/html/B/bogus.html">bogus</a>
    mail by measuring the
    <a href="http://www.catb.org/~esr/jargon/html/B/bogosity.html">bogosity</a>.</p>

    <hr>

    <h2 id="bogo-how">How does bogofilter work?</h2>

    <p>See the man page's
    <a href="http://bogofilter.sourceforge.net/man_page.shtml#theory">THEORY OF OPERATION</a>
    section for an introduction. The main source for understanding
    this is Gary Robinson's Linux Journal article
    <a href="http://www.linuxjournal.com/article/6467">"A Statistical Approach to the Spam Problem"</a>.</p>

    <p>After you read all this you might ask some questions. The first
    could be "Is bogofilter really a Bayesian spam filter?"
    Bogofilter is based on Bayes' theorem and uses it in the initial
    calculations and other statistical methods later.  Without doubt it
    is a statistical spam filter with a Bayesian flavor.</p>

    <p>Other questions you might have might concern the basic
    assumptions of Bayes' theory. Two short answers are: "No, they are
    not satisfied" and "We don't care as long as it works". A longer
    answer will mention that the basic assumption that "an e-mail is a
    random collection of words, each independent of the others" is
    violated.  There are several places where practice doesn't follow
    theory.  Some are always present, and some which will depend on
    the way you use bogofilter:</p>

    <ul>
    <li>Words in an e-mail are by no means independent.  In all languages,
    the opposite is true.</li>
    <li>The words used are not random, though some spammers include random
    words.</li>
    <li>Full training or using a random sample, follows Bayes'
    principles. Choosing messages to use for training violates
    the assumption that the training messages are a random sample of
    the messages received.  This principle is also violated by
    Bogofilter's auto-update function (with the thresh_update
    parameter), <a href="#train-on-error">training on error</a>, or
    anything similar to these approaches.</li>
    <li>The same applies if you train with the same message more than once.</li>
    <li>Other problems arise if you modify your database by removing tokens
    (like using bogoutil with -a or -c).</li>
    <li>Undoubtedly there are more.</li>
    </ul>

    <p>As the man page explains, bogofilter tries to understand how
    badly the null hypothesis fails. Some people argue that "those
    departures from reality usually work in our favor" (from Gary's
    article). Some argue that, even then, we should not violate too
    much. Nobody really <em>knows</em>. Just keep in mind that
    problems might occur if you push too hard. The key to bogofilter's
    approach is: What matters most is simply what works in the real
    world.</p>

    <p>Now that you have been warned, have fun and use bogofilter as
    suits you best.</p>

    <hr>

    <h2 id="lists">Mailing Lists</h2>

    <p>There are currently four mailing lists for bogofilter:</p>

    <table border="1" width="100%" summary="[You need a table-capable
        browser to read this overview of mailing lists]">
      <tr>
        <th>List Address</th>
        <th>Links</th>
        <th>Description</th>
      </tr>

      <tr>
        <td>bogofilter-announce@bogofilter.org</td>
        <td><a href="http://www.bogofilter.org/mailman/listinfo/bogofilter-announce">[subscribe]</a>
        [archives: <a
        href="http://www.bogofilter.org/pipermail/bogofilter-announce/">mailman</a>]
        </td>
        <td>An announcement-only list where new versions are
        announced.</td>
      </tr>

      <tr>
        <td>bogofilter@bogofilter.org</td>
        <td><a href="http://www.bogofilter.org/mailman/listinfo/bogofilter">[subscribe]</a>
        [archives: <a
        href="http://www.bogofilter.org/pipermail/bogofilter/">mailman</a>]
        </td>
        <td>A discussion list where any conversation about
        bogofilter may take place.</td>
      </tr>

      <tr>
        <td>bogofilter-dev@bogofilter.org</td>
        <td><a href="http://www.bogofilter.org/mailman/listinfo/bogofilter-dev">[subscribe]</a>
        [archives: <a
        href="http://www.bogofilter.org/pipermail/bogofilter-dev/">mailman</a>]
        </td>
        <td>A list for sharing patches, development, and technical
        discussions.</td>
      </tr>

      <tr>
        <td>bogofilter-cvs@lists.sourceforge.net</td>
        <td><a href="https://lists.sourceforge.net/lists/listinfo/bogofilter-cvs">[subscribe]</a>
        <a href="http://sourceforge.net/p/bogofilter/mailman/bogofilter-cvs/">[archive]</a></td>
        <td>Mailing list for announcing code changes to the SVN
        archive. (The CVS name is a leftover from before the migration 
        for our users' convenience.)</td>
      </tr>
    </table>

    <p>The bogofilter-announce list is moderated and is used only for
    important announcements (eg: new versions).  It is low traffic.
    If you have subscribed to the user's list or the developer's list,
    you don't need to subscribe to the announce list.  Messages posted
    to the announce list are also distributed to the others.</p>

    <hr>

    <h2 id="training">How do I start my bogofilter training?</h2>

    <p>To classify messages as ham (non-spam) or spam, bogofilter
    needs to learn from your mail. To start with it is best to have
    collections (that are as large as possible) of messages you know
    for sure are ham or spam. (Errors here will cause problems later,
    so try hard<code>;-)</code>. Warning: Only use your mail; using other
    collections (like a spam collection found on the web), might cause
    bogofilter to draw a wrong conclusion &#8212; after all you want it to
    understand <em>your</em> mail.</p>

    <p>Once you have the spam and ham collections, you have basically
    four choices. In all cases it works better if your training base
    (the above collections) is bigger, rather than smaller. The
    smaller your training collection is, the higher the number of
    errors bogofilter will make in production. Let's assume your
    collection is two mbox files:  ham.mbox and spam.mbox.</p>

    <ul>
    <li><p>Method 1) Full training. Train bogofilter with all your messages. In
    our example:</p>

    <pre>    bogofilter -s &lt; spam.mbox
    bogofilter -n &lt; ham.mbox</pre></li>
    </ul>

    <p>Note: Bogofilter's contrib directory includes two scripts that
    both use a train-on-error technique. This technique scores each
    message and adds to the database only those messages that were
    scored incorrectly (messages scored as uncertain, ham scored as
    spam, or spam scored as ham). The goal is to build a database of
    those words <em>needed</em> to correctly classify messages. The
    resulting database is smaller than the one build using full
    training.</p>

    <ul>
    <li><p>Method 2) Use the script bogominitrain.pl (in the contrib
    directory). It checks the messages in the same order as your
    mailbox files. You can use the <code>-f</code> option which will
    repeat this until all messages in your training collection are
    classified correctly (you can even adjust the level of
    certainty). Since the script makes sure the database understands
    your training collection &quot;exactly&quot; (with your chosen
    precision), it works very well. You can use <code>-o</code> to
    create a security margin around your spam_cutoff. Assuming
    spam_cutoff=0.6 you might want to score all ham in your
    collection below 0.3 and all spam above 0.9. Our example is:</p>

    <pre>    bogominitrain.pl -fnv ~/.bogofilter ham.mbox spam.mbox '-o 0.9,0.3'</pre></li>

    <li><p>Method 3) Use the script randomtrain (in the contrib
    directory). The script generates a list of all the messages in the
    mailboxes, randomly shuffles the list, and then scores each
    message, with training as needed. In our example:</p>

    <pre>    randomtrain -s spam.mbox -n ham.mbox</pre>

    <p>As with method 4, it works better if you start with full
    training using several thousand messages.  This will give a
    database that is more comprehensive and significantly
    bigger.</p></li>

    <li><p>Method 4) If you have enough spams and non-spams in your
    training collection, separate out some 10,000 spams and 10,000
    non-spams into separate mbox files, and train as in method 1. Then
    use bogofilter to classify the remaining spams and non-spams. Take
    any messages that it classifies as unsure or classifies
    incorrectly, and train with those. Here are two little scripts you
    can use to classify the train-on-error messages:</p>

    <pre>    #! /bin/sh
    #  class3 -- classify one message as bad, good or unsure
    cat &gt;msg.$$
    bogofilter $* &lt;msg.$$
    res=$?
    if [ $res = 0 ]; then
        cat msg.$$ &gt;&gt;corpus.bad
    elif [ $res = 1 ]; then
        cat msg.$$ &gt;&gt;corpus.good
    elif [ $res = 2 ]; then
        cat msg.$$ &gt;&gt;corpus.unsure
    fi
    rm msg.$$</pre>

    <pre>    #! /bin/sh
    # classify -- put all messages in mbox through class3
    src=$1;
    shift
    formail -s class3 $* &lt;$src</pre>

    <p>In our example (after the initial full training):</p>

    <pre>    classify spam.mbox [bogofilter options]
    bogofilter -s &lt; corpus.good
    rm -f corpus.*
    classify ham.mbox [bogofilter options]
    bogofilter -n &lt; corpus.bad
    rm -f corpus.*</pre></li>
    </ul>

    <h3>Comparing these methods</h3>

    <p>It is important to understand the consequences of the methods
    just described. Doing full training as in methods 1 and 4 produces
    a larger database than does training with methods 2 or 3.  If your
    database size needs to be small (for example due to quota
    limitations), use methods 2 or 3.</p>

    <p>Full training with method 1 is fastest. Training on error (as
    in methods 2, 3 and 4) is effective, but the initial training takes
    longer.</p>

    <hr>

    <h2 id="training-maildirs">How do I train using maildirs?</h2>

    <h3>Initial training from mbox:</h3>

<pre>    bogofilter -M -s -I ~/mail/Spam
    bogofilter -M -n -I ~/mail/NonSpam</pre>

    <h3>Initial training from maildir:</h3>

<pre>    bogofilter -s -B ~/Maildir/.Spam
    bogofilter -n -B ~/Maildir/.NonSpam</pre>

    <h3>Corrective training from mbox:</h3>

<pre>    bogofilter -M -Ns -I ~/mail/Missed_Spam
    bogofilter -M -Sn -I ~/mail/False_Spam</pre>

    <h3>Corrective training from maildir:</h3>

<pre>    bogofilter -s -B ~/Maildir/.Missed_Spam
    bogofilter -n -B ~/Maildir/.False_Spam</pre>

    <hr>

    <h2 id="production">How can I keep the scoring accuracy high?</h2>

    <p>Bogofilter will make mistakes once in a while. So ongoing
    training is important. There are two main methodologies for doing this.
    First, you can train with every incoming message (using the -u
    option). Second, you can train on error only.</p>

    <p>Since you might want to rebuild your database at some point,
    for example when a major new feature is implemented in bogofilter,
    it can be very useful to update your training collection
    continuously.</p>

    <p>Bogofilter always does the best it can with the information
    available to it.  However, it will make mistakes, i.e., classify
    ham as spam (false positives) or spam as ham (false negatives). To
    reduce the likelihood of repeating the mistake, it is necessary to
    train bogofilter with the errant message.  If a message is
    incorrectly classified as spam, use switch <code>-n</code> to
    train with it as ham.  Use switch <code>-s</code> to train with a
    spam message.</p>

    <p>Bogofilter has a <code>-u</code> switch that automatically
    updates the wordlists after scoring each message.  As bogofilter
    sometimes misclassifies a message, monitoring is necessary to
    correct any mistakes. Corrections can be done using
    <code>-Sn</code> to change a message's classification from spam to
    non-spam and <code>-Ns</code> to change it from non-spam to spam.</p>

    <p>Correcting a misclassified message may affect classification for
    other message.  The smaller your database is, the higher is the
    likelihood that a training error will cause a misclassification.</p>

    <p>Using a method like #2 or #3 (above) can compensate for this
    effect.  Repeat the training with your complete training
    collection (including all the new messages added since the earlier
    training). This will add messages to the database which show that
    adverse effect on both sides until you have a new equilibrium.</p>

    <p>An alternative strategy, based on method 4 in the previous
    section, is the following: Periodically take blocks of messages
    and use the scripts in method 4 above to classify them. Then
    manually review the good, bad and unsure files, correct any
    errors, and split the unsures into spam and non-spam.  Until you
    have accumulated some 10,000 spam and 10,000 non-spam in your
    training database, train with the good, the bad, and the separated
    errors and unsures; thereafter, train with only the separated and
    unsures, discarding the messages that bogofilter already
    classifies correctly.</p>

    <hr>

    <h2 id="mboxformats">What mailbox (file) formats does bogofilter understand?</h2>

    <p>Bogofilter understands the traditional Unix mbox format, the
    Maildir and MH formats. Note though that bogofilter does not support
    subfolders, you will have to explicitly list them in MH or Maildir++
    folders - just mention the full path to the subfolder.</p>

    <p>For unsupported formats, you will have to convert the mailbox to
    a format bogofilter understands. Mbox is often convenient because it can
    be piped into bogofilter.</p>

    <p>For example, to convert UW-IMAP/PINE mbx format to mbox:

    <pre>    mailtool copy /full/path/to/mail.mbox '#driver.unix//full/path/to/mbox'</pre>

    <p>or:</p>

    <pre>    for MSG in /full/path/to/maildir/* ; do 
        formail -I Status: < "$MSG" >> /full/path/to/mbox
    done</pre>

    <hr>

    <h2 id="vvv">What does bogofilter's verbose output mean?</h2>

    <p>Bogofilter can instructed to display information on the
    scoring of a message by running it with flags "-v", "-vv",
    "-vvv", or "-R".</p>

    <ul>
      <li>
        Using "-v" causes bogofilter to generate the "X-Bogosity:"
        header line, i.e.
        <pre>    X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000</pre></li>

      <li>
        Using "-vv" causes bogofilter to generate a histogram, i.e.
<pre>    X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000
      int  cnt    prob   spamicity  histogram
     0.00   29  0.000209  0.000052  #############################
     0.10    2  0.179065  0.003425  ##
     0.20    2  0.276880  0.008870  ##
     0.30   18  0.363295  0.069245  ##################
     0.40    0  0.000000  0.069245
     0.50    0  0.000000  0.069245
     0.60   37  0.667823  0.257307  #####################################
     0.70    5  0.767436  0.278892  #####
     0.80   13  0.836789  0.334980  #############
     0.90   32  0.984903  0.499835  ################################</pre>

        <p>Each row shows an interval, the count of tokens with
        scores in that interval, the average spam probability for
        those tokens, the message's spamicity score (for those
        tokens and all lesser valued tokens), and a bar graph
        corresponding to the token count.</p>

        <p>In the above histogram there are a lot of low scoring
        tokens and a lot of high scoring tokens. They "balance" one
        another to give the spamicity score of 0.5000</p>
      </li>

      <li>
        Using "-vvv" produces a list of <em>all</em> the tokens in
        the messages with information on each one, i.e.
        <pre>    X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000
                          n    pgood     pbad      fw     U
    "which"              10  0.208333  0.000000  0.000041 +
    "own"                 7  0.145833  0.000000  0.000059 +
    "having"              6  0.125000  0.000000  0.000069 +
    ...
    "unsubscribe.asp"     2  0.000000  0.095238  0.999708 +
    "million"             4  0.000000  0.190476  0.999854 +
    "copy"                5  0.000000  0.238095  0.999883 +
    N_P_Q_S_s_x_md      138  0.00e+00  0.00e+00  5.00e-01
                             1.00e-03  4.15e-01  0.100</pre>
        The columns printed contain the following information:

        <dl>
          <dt>"&hellip;"</dt>

          <dd>the token in question</dd>

          <dt>n</dt>

          <dd>number of times this token was encountered in
          training</dd>

          <dt>pgood</dt>

          <dd>proportion of good messages that contained this
          token</dd>

          <dt>pbad</dt>

          <dd>proportion of spam messages that contained this
          token</dd>

          <dt>fw</dt>

          <dd>Robinson's weighted index, which combines pgood and
          pbad to give a value that will be close to zero if a
          message containing this token is likely to be non-spam
          and close to one if it's likely to be spam</dd>

          <dt>U</dt>

          <dd>'<b>+</b>' if this token contributes to the final
          bogosity value, '<b>-</b>' otherwise. A token is excluded
          when its score is closer to 0.5 than min_dev.</dd>
        </dl>

        <p>The final lines show:</p>

        <ul style="margin-top:0;margin-bottom:1em;">
          <li>The cumulative results of the columns</li>

          <li>The values of Robinson's <b>s</b> and <b>x</b>
          parameters and of <b>min_dev</b></li>
        </ul>
      </li>

      <li>
        Using "-R" produces the "-vvv" output described above plus
        two additional columns:

        <dl>
          <dt>invfwlog</dt>

          <dd>logarithm of fw</dd>

          <dt>fwlog</dt>

          <dd>logarithm of (1-fw)</dd>
        </dl>

        <p>The "-R" output is formatted for use with the R language
        for statistical computing. More information is available at
        <a href="http://www.r-project.org/">The R Project for
        Statistical Computing</a>.</p>
      </li>
    </ul>

    <hr>

    <h2 id="unsure">What is <i>Unsure</i> mode?</h2>

    <p>Bogofilter's default configuration will classify a message as
    spam or non-spam.  The SPAM_CUTOFF parameter is used for this.
    Messages with scores greater than or equal to SPAM_CUTOFF are
    classified as spam.  Other messages are classified as ham.</p>

    <p>There is also a HAM_CUTOFF parameter.  When used, messages must
    have scores less than or equal to HAM_CUTOFF to be classified as
    ham.  Messages with scores between HAM_CUTOFF and SPAM_CUTOFF are
    classified as unsure.  If you look in bogofilter.cf, you will see
    the following lines:</p>

    <pre>    #### CUTOFF Values
    #
    #    both ham_cutoff and spam_cutoff are allowed.
    #    setting ham_cutoff to a non-zero value will
    #    enable tri-state results (Spam/Ham/Unsure).
    #
    #ham_cutoff  = 0.45
    #spam_cutoff = 0.99
    #
    #    for two-state classification:
    #
    ## ham_cutoff = 0.00
    ## spam_cutoff= 0.99</pre>

    <p>To turn on Yes/No/Unsure classification, remove the #'s from the last
    two lines.</p>

    <p>Alternatively, if you'd rather use labels Yes/No/Unsure
    instead of Spam/Ham/Unsure, remove the #'s from the following
    bogofilter.cf line:

    <pre>    ## spamicity_tags = Yes, No, Unsure</pre>

    <p>Once that's done, you may want to set the filtering rules for your mail
    program to include rules like:</p>

    <pre>    if header contains "X-Bogosity: Spam", put in Spam folder
    if header contains "X-Bogosity: Unsure", put in Unsure folder</pre>

    <p>Alternatively, bogofilter.cf has directives for modifying the
    Subject: line, i.e.</p>

    <pre>    #### SPAM_SUBJECT_TAG
    #
    #    tag added to "Subject: " line for identifying spam or unsure
    #    default is to add nothing.
    #
    ##spam_subject_tag=***SPAM***
    ##unsure_subject_tag=???UNSURE???</pre>

    <p>With these subject tags, the filtering rules would look like:</p>

    <pre>    if subject contains "***SPAM***", put in Spam folder
    if subject contains "???UNSURE???", put in Unsure folder</pre>

    <hr>

    <h2 id="train-on-error">What are "training on error" and "training to exhaustion"?</h2>

    <p>"Training on error" involves scanning a corpus of known spam
    and non-spam messages; only those that are misclassified, or
    classed as unsure, get registered in the training database.  It's
    been found that sampling just messages prone to misclassification
    is an effective way to train; if you train bogofilter on the hard
    messages, it learns to handle obvious spam and non-spam too.</p>

    <p>This method can be enhanced by using a "security margin".  By
    increasing the spam cutoff value and decreasing the ham cutoff
    value, messages which are close to a cutoff will be used for
    training.  Using security margins improves results when training
    on error.  In general, greater margins help more (although too
    much also isn't optimal).  As a rule of thumb spam cutoff +/- 0.3 gives good
    results.  For tristate mode, you might try the middle of the unsure
    interval +/- 0.3 for training.</p>

    <p>Repeating training on error on the same message corpus can
    improve accuracy.  The idea is that messages which were rated
    correctly in the first place might after some more training be
    rated wrongly which will then be corrected.</p>

    <p>"Training to exhaustion" is repeating training on error, with
    the same message corpus, until no errors remain.  Also this method
    can be improved with security margins. See
    <a href="http://www.garyrobinson.net/2004/02/spam_filtering_.html">Gary Robinson's Rants</a>
    on this topic for more details.</p>

    <p>Note: <code>bogominitrain.pl</code> has a <code>-f</code> option
    to do "training to exhaustion".  Using <code>-fn</code> avoids
    repeated training for each message.</p>

    <hr>

    <h2 id="autoupdate">What does the '-u' (autoupdate) switch do?</h2>

    <p>The "-u" switch (autoupdate) is used to automatically expand the
    wordlist.  When this switch is used and bogofilter classifies a message
    as Spam or Ham, the message's tokens are added to the wordlist with a
    ham/spam tag (as appropriate).</p>

    <p>As an example, suppose a new "Refinance now - best Mortgage rates"
    message comes in.  It will have some words that bogofilter has seen and
    (probably) some new ones as well.  Using '-u' the new words will be
    added to the wordlist so that bogofilter can better recognize the next,
    related message.</p>

    <p>If/when you use to use '-u', you need to be on the lookout for
    classification errors and retrain bogofilter with any messages that have
    been classified incorrectly.  An incorrectly classified message that is
    auto-updated _may_ cause bogofilter to make additional classification
    errors in the future.   This is the same problem as when you (the sys
    admin) incorrectly register a ham message as spam (or vice versa).</p>

    <hr>

    <h2 id="spamassassin">How can I use SpamAssassin to train Bogofilter?</h2>

    <p>If you have a working SpamAssassin installation (or care to
    create one), you can use its return codes to train bogofilter.
    The easiest way is to create a script for your MDA that runs
    SpamAssassin, tests the spam/non-spam return code, and runs
    bogofilter to register the message as spam (or non-spam). The
    sample procmail recipe below shows one way to do this:</p>

    <pre>    BOGOFILTER     = "/usr/bin/bogofilter"
    BOGOFILTER_DIR = "training"
    SPAMASSASSIN  = "/usr/bin/spamassassin"

    :0 HBc
    * ? $SPAMASSASSIN -e
    #spam yields non-zero
    #non-spam yields zero
    | $BOGOFILTER -n -d $BOGOFILTER_DIR
    #else (E)
    :0Ec
    | $BOGOFILTER -s -d $BOGOFILTER_DIR

    :0fw
    | $BOGOFILTER -p -e

    :0:
    * ^X-Bogosity:.Spam
    spam

    :0:
    * ^X-Bogosity:.Ham
    non-spam</pre>

    <hr>

    <h2 id="asian-spam">What can I do about Asian spam?</h2>

    <p>Many people get unsolicited email using Asian language
    charsets. Since they don't know the languages and don't know
    people there, they assume it's spam.</p>

    <p>The good news is that bogofilter does detect them quite
    successfully. The bad news is that this can be expensive. You
    have basically two choices:</p>

    <ul>
      <li>
        <p>You can simply let bogofilter handle it. Just train
        bogofilter with the Asian language messages identified as
        spam. Bogofilter will parse the messages as best it can and
        will add tokens to the spam wordlist. The wordlist will
        contain many tokens which don't make sense to you (since
        the charset cannot be displayed), but bogofilter can work
        with them and successfully identify Asian spam.</p>

        <p>A second method is to use the
        "replace_nonascii_characters" config file option. This will
        replace high-bit characters, i.e. those between 0x80 and
        0xFF, with question marks, '?'. This keeps the database
        much smaller. Unfortunately this conflicts with European
        language which have many accented vowels and consonant in
        the high-bit range.</p>
      </li>

      <li>
        <p>If you are sure you will not receive any legitimate
        messages in those languages, you can kill them right away.
        This will keep the database smaller. You can do this with
        an MDA script.</p>

        <p>Here's a procmail recipe that will sideline messages
        written with Asian charsets:</p>
        <pre>    ## Silently drop all Asian language mail
    UNREADABLE='[^?"]*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987'
    :0:
    * 1^0 $ ^Subject:.*=\?($UNREADABLE)
    * 1^0 $ ^Content-Type:.*charset="?($UNREADABLE)
    spam-unreadable

    :0:
    * ^Content-Type:.*multipart
    * B ?? $ ^Content-Type:.*^?.*charset="?($UNREADABLE)
    spam-unreadable</pre>

        <p>With the above recipe, bogofilter will <em>never</em>
        see the message.</p>
      </li>
    </ul>

    <hr>

    <h2 id="compact-database">How can I compact my database?</h2>

    <p>You can periodically compact the database so it occupies a
    minimum of disk space.  Assuming your wordlist is in directory
    ~/.bogofilter, for bogofilter 0.93.0 (or newer) use:</p>

    <pre>    bf_compact ~/.bogofilter wordlist.db</pre>

    <p>For bogofilter older than 0.93.0, use:</p>

    <pre>    cd ~/.bogofilter
    bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
    mv wordlist.db wordlist.db.prv
    mv wordlist.db.new wordlist.db</pre>

    <p>The script is needed to duplicate your database environment (in
    order to support BerkeleyDB transaction processing).  Your
    original directory will be renamed to ~/.bogofilter.old and
    ~/.bogofilter will contain the new database environment.</p>

    <p>Since older versions of bogofilter don't use Berkeley DB
    transactions, the database is just a single file (wordlist.db) and
    it isn't necessary to use the script.  The commands shown above
    create a new compact database and rename the original file to
    wordlist.db.prv</p>

    <p>Note: it's O.K. to use the script with old versions of
    bogofilter.</p>

    <hr>

    <h2 id="query-database">How do I manually query the database?</h2>

    <p>To find the spam and ham counts for a token (word) use
    bogoutil's '-w' option. For example, "bogoutil -w
    $BOGOFILTER_DIR/wordlist.db example.com" gives the good and bad
    counts for "example.com".</p>

    <p>If you want the spam score in addition to the spam and ham
    counts for a token (word) use bogoutil's '-p' option. For example,
    "bogoutil -p $BOGOFILTER_DIR/wordlist.db example.com" gives the
    good and bad counts for "example.com".</p>

    <p>To find out how many messages are in your wordlists query the
    special token .MSG_COUNT, i.e., run command "bogoutil -w
    $BOGOFILTER_DIR/wordlist.db .MSG_COUNT" to see the counts for the
    spam and ham wordlists.</p>

    <p>To tell how many tokens are in your wordlists pipe the output
    of bogoutil's dump command to command "wc", i.e. use "bogoutil -d
    $BOGOFILTER_DIR/wordlist.db | wc -l " to display the count.</p>

    <hr>

    <h2 id="multiple">Can I use multiple wordlists?</h2>

    <p>Yes.  Bogofilter can be run with multiple wordlists.  For
    example, if you have both <code>user</code> and
    <code>system</code> wordlists, bogofilter can be instructed to
    check the user list and, if the word isn't there, then check the
    system list.  Alternatively, it can be instructed to add together
    the information from the two lists.</p>

    <p>Following are the config file options and some examples:</p>

    <p>A wordlist has several attributes, notably type, name,
    filename, and precedence.</p>

    <ul>

      <li>Type: 'R' and 'I' (for "regular" and "ignore").  Current
      wordlists are of type 'R'. Type 'I' means "don't score the token
      if found in the ignore list".</li>

      <li>Name: a short identifying symbol used when printing error
      messages.  Examples are "global", "user", and "ignore", but you
      can use any identifier you want.</li>

      <li>Filename: the name (path) of the file. When opening the
      wordlist, if the name is fully qualifified (with a leading '/'
      or '~'), that name is used.  If the name isn't fully qualified,
      bogofilter will prepend the directory, using the usual search
      order is used, i.e. $BOGOFILTER_DIR, $BOGODIR, $HOME.</li>

      <li>Precedence: an integer like 1, 2, 3, ...  Wordlists are
      searched in ascending order for the token.  If the search token
      is found, lists with the same precedence number will be checked
      (and counts added together).  Lists with higher precedence
      numbers will not be checked.</li>

    </ul>

    <p>Example 1 - merge user and system lists:</p>

    <pre>    wordlist R,user,~/wordlist.db,1
    wordlist R,system,/var/spool/bogofilter/wordlist.db,1</pre>

    <p>Example 2 - prefer user to system list:</p>

    <pre>    wordlist R,user,~/wordlist.db,2
    wordlist R,system,/var/spool/bogofilter/wordlist.db,3</pre>

    <p>Example 3 - prefer system to user list:</p>

    <pre>    wordlist R,user,~/wordlist.db,5
    wordlist R,system,/var/spool/bogofilter/wordlist.db,4</pre>

    <p>Note 1: bogofilter's registration flags ('-s', '-n', '-u',
    '-S', '-N' ) will apply to the lowest numbered list.</p>

    <p>Note 2: having lists of types 'R' and 'I' of the same
    precedence won't be allowed because the types are
    contradictory.</p>

    <hr>

    <h2 id="ignore">Can I tell bogofilter to ignore certain tokens?</h2>

    <p>Through the use of an ignore list, bogofilter will ignore the
    listed tokens when scoring the message.</p>

    <p>Example:</p>

    <pre>    wordlist I,ignore,~/ignorelist.db,7
    wordlist R,system,/var/spool/bogofilter/wordlist.db,8</pre>

    <p>Because <code>ignorelist.db</code> has a lower index (7), than
    <code>wordlist.db</code> (8), bogofilter will stop looking when
    finds a token in <code>ignorelist.db</code>.</p>

    <p>Note: Technically, bogofilter gives a score of ROBX to the
    tokens and expects the min_dev parameter to drop them from the
    scoring.</p>

    <p>There are two main methods for building/maintaining an ignore list.</p>

    <p>First, a text file can be created and maintained using any text
    editor.  Bogoutil can convert the text file to database format,
    e.g. "bogoutil -l ignorelist.db &lt; ignorelist.txt".</p>

    <p>Alternatively, <code>echo ... | bogoutil ...</code> can be used
    to add a single token, for example "ignore.me", as in:</p>

    <pre>  echo ignore.me | bogoutil -l ~/ignorelist.db</pre>

    <hr>

    <h2 id="update">How do I upgrade from separate word databases to
    the combined wordlist format?</h2>

    <p>Run script bogoupgrade.  For more info, run "bogoupgrade -h" to
    see its help message or run "man bogoupgrade" and read its man
    page.</p>

    <hr>

    <h2 id="rescue">How can I tell if my wordlists are corrupted?</h2>

    <p><strong>NOTE:</strong> some distributors rename all the db_
    utilities given below by inserting or appending the version number,
    with or without dot, for instance db4.1_verify or db_verify-4.2.
    There is no standard on the renaming of these utilities.</p>

    <p>If you think your wordlists are hosed, you can see what
    BerkeleyDB thinks by running:</p>
    <pre>    db_verify wordlist.db</pre>

    <p>You may be able to recover some (or all) of the tokens and
    their counts with the following commands:</p>

    <pre>    bogoutil -d wordlist.db | bogoutil -l wordlist.new.db</pre>

    <p>or - if there has been more damage to the token list - with</p>

    <pre>    db_dump -r wordlist.db &gt; wordlist.txt
    db_load wordlist.new.db &lt; wordlist.txt</pre>

    <p>You can also use a text file instead of a pipe, as in:</p>

    <pre>    bogoutil -d wordlist.db &gt; wordlist.txt
    bogoutil -l wordlist.db.new &lt; wordlist.txt</pre>

    <hr>

    <h2 id="unicode">How can I convert my wordlist to/from unicode?</h2>

    <p>Wordlists can be converted from raw storage to unicode using:</p>

    <pre>    bogoutil -d wordlist.db &gt; wordlist.raw.txt
    iconv -f iso-8859-1 -t utf-8 &lt; wordlist.raw.txt &gt; wordlist.utf8.txt
    bogoutil -l wordlist.db.new &lt; wordlist.utf8.txt</pre>

    <p>or:</p>

    <pre>    bogoutil --unicode=yes -m wordlist.db</pre>

    <p>Wordlists can be converted from unicode to raw storage using:</p>

    <pre>    bogoutil -d wordlist.db &gt; wordlist.utf8.txt
    iconv -f utf-8  -t iso-8859-1 &lt; wordlist.utf8.txt &gt; wordlist.raw.txt
    bogoutil -l wordlist.db.new &lt; wordlist.raw.txt</pre>

    <p>or:</p>

    <pre>    bogoutil --unicode=no -m wordlist.db</pre>

    <p>The above methods work best when the wordlist is based on the
    iso-8859-1 charset.  If your wordlist is based on a different
    charset, for example CP866 or KOI8-R, use that charset in the
    above commands.</p>

    <p>For a wordlist containing tokens from multiple languages,
    particularly non-european languages, the conversion methods
    described above may not work well.  Building a new wordlist (from
    scratch) will likely work better as the new wordlist will be based
    solely on unicode.</p>

    <hr>

    <h2 id="enable-transactions">How can I switch from non-transaction
    to transaction mode?</h2>

    <p>How to do this is fully documented in file doc/README.db section
    2.2.1.  We suggest you read the whole section.</p>

    <p>In brief, use these commands:
    <pre>    cd ~/.bogofilter
    bogoutil -d wordlist.db &gt; wordlist.txt
    mv wordlist.db wordlist.db.old
    bogoutil --db-transaction=yes -l wordlist.db &lt; wordlist.txt</pre>
    <p>If everything went well, you can remove the backup files:</p>
    <pre>    rm wordlist.db.old wordlist.txt</pre>
    <hr>

    <h2 id="disable-transactions">How can I switch from transaction to
    non-transaction mode?</h2>

    <p>How to do this is fully documented in file doc/README.db section
    2.2.2.  We suggest you read the whole section.</p>

    <p>In brief, you can use bogoutil to dump/load the wordlist, for example:
    <pre>    cd ~/.bogofilter
    bogoutil -d wordlist.db &gt; wordlist.txt
    mv wordlist.db wordlist.db.old
    rm -f log.?????????? __db.???
    bogoutil --db-transaction=no -l wordlist.db &lt; wordlist.txt</pre>

    <hr>

    <h2 id="locksize">Why does bogofilter die after printing
    "Lock table is out of available locks" or
    "Lock table is out of available object entries"</h2>

    <p>The transactional and concurrent modes of BerkeleyDB require a
    lock table that corresponds to the data base in size. See the
    <samp>README.db</samp> file for a detailed explanation and a
    remedy.</p>

    <p>The size of the lock table can be set in bogofilter.cf or in
    DB_CONFIG.  Bogofilter.cf uses the db_lk_max_locks and
    db_lk_max_objects directives, while DB_CONFIG uses the
    set_lk_max_objects and set_lk_max_locks directives.</p>

    <p>After changing these values in DB_CONFIG, run command
    <pre>  bogoutil --db-recover /your/bogofilter/directory</pre>
    <p>to rebuild the lock tables.</p>

    <hr>

    <h2 id="page-notfound">Why am I getting DB_PAGE_NOTFOUND messages?</h2>

    <p>You have a problem with your BerkeleyDB database.  There are
    two likely causes: either you've hit a max size limit or the
    database is corrupt.</p>

    <p>Some mail transfer agents, such as Postfix, impose file size
    limits. When bogofilter's database reaches that limit, write
    problems will occur.

    <p>To show the database size use:</p>
    <pre>    ls -lh $BOGOFILTER_DIR/wordlist.db</pre>

    <p>To show the postfix setting:</p>
    <pre>    postconf | grep mailbox_size_limit</pre>

    <p>To set the limit to 73MB (or whatever size is right for you):</p>
    <pre>    postconf -e mailbox_size_limit=73000000</pre>

    <p>If you think your database may be corrupt, read
    <a href="#rescue">How can I tell if my wordlists are corrupted?</a>
    FAQ entry.</p>

    <hr>

      <h2 id="db-private">Why am I getting &quot;Berkeley DB
          library configured to support only DB_PRIVATE
          environments&quot; or<br>
          &quot;Berkeley DB library configured to support only
          private environments&quot;?</h2>

      <p>Some distributors (for instance the Fedora Project) package
      Berkeley DB with support for POSIX threading and hence POSIX
      mutexes, but your system does not support POSIX mutexes
      (whether it
      does, depends on the kernel version and exact processor
      type).</p>

      <p>To work around this problem:
      <ol>
          <li>download, compile and install <a
          href="http://www.sleepycat.com/products/db.shtml">Berkeley
          DB</a> on your own and the reconfigure bogofilter:
          <ol>
          <li><kbd>cd build_unix</kbd></li>
          <li><kbd>../dist/configure --enable-cxx</kbd></li>
          <li><kbd>make</kbd></li>
          <li><kbd>make install</kbd></li>
          </ol>
          <li>recompile and install bogofilter:
          <ol>
          <li><kbd>./configure
              --with-libdb-prefix=/usr/local/BerkeleyDB.4.3</kbd>
          <em>(replace your Berkeley DB version number)</em></li>
          <li><kbd>make &amp;&amp; make check</kbd></li>
          <li><kbd>make install</kbd> <em>(if space is a
              premium, use <kbd>make install-strip)</kbd></em></li>
          </ol>
      </ol>

    <hr>

    <h2 id="multi-user">Can bogofilter be used in a multi-user environment?</h2>

    <p>Yes, it can.  There are multiple, distinct strategies for doing
    this.  The two extremes are:</p>

    <ul>
        <li>Having a bogofilter administrator who maintains a global
            wordlist that everybody uses.</li>
        <li>Having each user maintain his/her own wordlist.</li>
    </ul>

    <p>As a middle ground, the bogofilter administrator can create and
    maintain the global wordlists and each user can be given the
    choice of using the global wordlist or a private wordlist.  An
    MDA, such as procmail, can be programmed to first apply the global
    wordlist (with a very stringent spam cutoff) and then (if
    necessary) apply the user's wordlist.</p>

    <hr>

    <h2 id="nfs">Can I share wordlists over NFS?</h2>

    <p>If you're just reading from them, there are no problems.
    When you're updating them, you need to use the correct file
    locking to avoid data corruption. When you compile bogofilter, you
    will need to verify that the configure script has set "#define
    HAVE_FCNTL 1" in your config.h file. Popular UNIX operating
    systems will all support this. If you are running an unusual, or
    an older version of an operating system, make sure it supports
    fcntl().  If your system does not
    support fcntl(), then you will not be able to share wordlist
    files over NFS without the risk of data corruption.</p>

    <p>Next, make sure you have NFS set up properly, with "lockd"
    running. Refer to your NFS documentation for more information
    about running "lockd" or "rpc.lockd". Most operating systems
    with NFS turn this on by default.</p>

    <p>For shared directories (NFS directories used by multiple
    machines, for instance, Sparc/Itanium/Alpha and x86), the
    architecture-specific parts can be installed separately by giving
    a different <code>--exec-prefix</code> (it will default to
    <code>--prefix</code>)

    <hr>

    <h2 id="return-codes">Why does bogofilter give return codes
    like 0 and 256 when it's run from inside a program?</h2>

    <p>Likely the return codes are being reformatted by waitpid(2).
    In C use WEXITSTATUS(status) in sys/wait.h, or comparable macro,
    to get the correct value.  In Perl you can just use
    'system("bogofilter $input") &gt;&gt; 8'.  If you want more info, run
    <code>"man waitpid"</code>.</p>

    <hr>

    <h2 id="changed-options">Now that I've upgraded why are
    my scripts broken?</h2>

    <p>Over time bogofilter accumulated a large number of functions.
    Some of those were discontinued or changed. Please read the
    <a href="http://bogofilter.sourceforge.net/NEWS">NEWS</a> file
    for details.</p>

    <hr>

    <h2 id="changed-tagging">Now that I've upgraded why is
    bogofilter working less well?</h2>

    <p>The lexer, i.e., that part of bogofilter which extracts tokens
    from a message, evolves. This results in different readings of messages
    with the consequence that some tokens in the database can no longer be
    used.</p>

    <p>If you encounter this problem, you are strongly advised to rebuild your
    database. If this is not an option for you, you might want to use version
    <a href="http://sourceforge.net/project/showfiles.php?group_id=62265&amp;package_id=59357">0.15.13</a>
    and read the documentation which comes with it for how to migrate your
    database.</p>

    <hr>

    <h2 id="remove-spam-or-nonspam">How can I
    delete all the spam (or non-spam) tokens?</h2>

    <p>Bogoutil lets you dump a wordlist and load the tokens into a
    new wordlist.  With the added use of awk and grep, counts can be
    zeroed and tokens with zero counts for both spam and non-spam can be
    deleted.</p>

    <p>The following commands will delete the tokens from spam messages:</p>

    <pre>    bogoutil -d wordlist.db | \
    awk '{print $1 " " $2 " 0"}' | grep -v " 0 0" | \
    bogoutil -l wordlist.new.db</pre>

    <p>The following commands will delete the tokens from non-spam messages:</p>

    <pre>    bogoutil -d wordlist.db | \
    awk '{print $1 " 0 " $3}' | grep -v " 0 0" | \
    bogoutil -l wordlist.new.db</pre>

    <hr>

    <h2 id="port-notes">How do I get bogofilter working on Solaris, BSD, etc?</h2>

    <p>If you don't already have a v3.0+ version of
    <a href="http://www.sleepycat.com/">BerkeleyDB</a>, then
    <a href="http://www.sleepycat.com/download/db/">download it (take
        one of the 4.4.X, 4.3.X or 4.2.X versions)</a>,
    unpack it, and do these commands in the db directory:</p>
    <pre>    $ cd build_unix
    $ sh ../dist/configure
    $ make
    # make install</pre>

    <p>Next, download a
    <a href="http://sourceforge.net/projects/bogofilter/files/">portable version</a>
    of bogofilter.</p>

    <h3>On Solaris</h3>

    <p>Be sure that your PATH environment variable begins with
    /usr/xpg6/bin:/usr/xpg4/bin:/usr/ccs/bin (/usr/xpg6/bin is only
    present on Solaris 10 and can be omitted on Solaris 9 and older
    versions). That is required for POSIX compliance.</p>

    <p>Unpack it, and then do:</p>
    <pre>    $ ./configure --with-libdb-prefix=/usr/local/BerkeleyDB.4.4
    $ make
    # make install-strip</pre>

    <p>You will either want to put a symlink to libdb.so in
    /usr/lib, or use a modified LD_LIBRARY_PATH environment
    variable before you start bogofilter. On newer systems, the most
    convenient way is probably to use the crle(1) tool to set the path
    permanently so BerkeleyDB is available to all applications.</p>
    <pre>    $ LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/usr/local/BerkeleyDB.4.4
    $ export LD_LIBRARY_PATH</pre>

    <p>Note that some "make" versions shipped with older Solaris version
    break when you try to build bogofilter outside of its source
    directory.  Either build in the source directory (as suggested
    above) or use GNU make (gmake).</p>

    <p>If your Solaris GCC complains with "ld: fatal: file values-Xa.o:
    open failed: No such file or directory", install the SUNWarc
    package.</p>

    <h3>On FreeBSD</h3>

    <p>The FreeBSD ports collection carries the latest stable versions of
    bogofilter to be compiled from source. The bogofilter ports are also auto-built and provided as binary packages for you to install.</p>

    <p>The <a href="https://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/pkgng-intro.html">binary
    packages approach</a> uses default installed software. To install
    bogofilter from binary package, type, as the privileged user:</p>
    <pre>    pkg install -y bogofilter</pre>

    <p>The ports from-source approach uses the highly recommended
    portmaster and portsnap software packages. To install portmaster,
    type (you need to do this only once), as root:</p>
    <pre>    pkg install -y portmaster</pre>

    <p>To install or upgrade bogofilter, just
    <a
        href="https://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/ports-using.html">upgrade
        your portstree using portsnap</a>,
    then type, as root:</p>
    <pre>    portmaster mail/bogofilter</pre>

    <p><em>Note: This assumes you are root.</em> If not, read through
    the remainder of this FreeBSD section and then see how you can
    <a href="#build">build</a> if you haven't got root privileges.</p>

    <h3>On NetBSD and other systems that use "pkgsrc"</h3>

    <p>pkgsrc should be offering a reasonably recent stable bogofilter
    release. See <a
        href="http://www.pkgsrc.org/">http://www.pkgsrc.org/</a> for
    information on pkgsrc.</p>

    <h3>On HP-UX</h3>

    <p>See the file
    <a
        href="http://bogofilter.cvs.sourceforge.net/*checkout*/bogofilter/bogofilter/doc/programmer/README.hp-ux?content-type=text%2Fplain&amp;revision=HEAD">doc/programmer/README.hp-ux</a>
    in the source distribution.</p>

    <hr>

    <h2 id="make-notes">Can I use the make command on my operating system?</h2>

    <p>Bogofilter has been successfully built on many operating
    systems using GNU make and the native make commands.  However,
    bogofilter's Makefile doesn't work with some make commands.</p>

    <p>GNU make is recommended for building bogofilter because we
    know it works.  We cannot support less capable make commands.  If
    your non-GNU make command can successfully build bogofilter,
    that's great.  If you encounter problems, the right thing to do is
    install GNU make.  If your non-GNU make can't build bogofilter,
    we're sorry but you're on your own. If it takes just a minor and clean
    patch to make it compatible, we might take it.</p>

    <hr>

    <h2 id="build">How do I build bogofilter as non-root user or for
    a non-standard installation prefix?</h2>

    <p>To install bogofilter to a non-standard path (as non-root user
    you don't have the permission to the normal paths), you need to
    provide the installation prefix when you run <code>./configure</code>.

    <p> After downloading and unpacking the <a
    href="http://sourceforge.net/projects/bogofilter/files/">
    source code</a>, run <code>./configure --prefix=PATH</code> where
    PATH is the installation prefix for the generated files (binaries,
    man pages etc.).  Then run the usual build commands &#8212;
    <code>make &amp;&amp; make check &amp;&amp; make install</code>.
    </p>

    <hr>

    <h2 id="patch">How do I build bogofilter with patches?</h2>

    <p>If you need to apply patches, get the <a
    href="http://sourceforge.net/projects/bogofilter/files/">source
    code</a> and unpack it using <code>tar -xzf</code> or <code>gunzip
    | tar -xf -</code> (as appropriate). Change to the
    source directory and run <code>./configure --prefix=PATH</code>
    where PATH is the installation prefix for the generated files
    (binaries, man pages etc.).  Apply your patches than run
    <code>make &amp;&amp; make install</code>.

    <hr>

    <h2 id="smaller">How do I make the executables smaller?</h2>

    <p>When space is tight, you can use <code>make
    install-strip</code> instead of <code>make install</code>.  Doing
    this will save space, but crashes can't be debugged unless more
    information on reproducing the bug is provided to the
    developers.</p>

    <hr>

    <h2 id="relativepath">datastore_db.c does not compile!</h2>

    <p>If you are configuring a data base path for instance with
    --with-libdb-prefix or via CPPFLAGS and LIBS, be sure to pass in an
    <em>absolute path</em> (with leading slash), a relative path will
    not work. Example: use
    <kbd>--with-libdb-prefix=/usr/local/BerkeleyDB.4.2</kbd>, but
    <em>not</em> <kbd>--with-libdb-prefix=../BerkeleyDB.4.2</kbd></p>

    <hr>

    <h2 id="which-muas">With which mail programs does bogofilter work?</h2>

    <p>Bogofilter is known to work with kmail, mozilla-mail, mutt,
    alpine, sylpheed-claws.  A google search will help you
    find more information on using bogofilter with the mail program
    you use.</p>

    <hr>

    <h2 id="with-mutt">How do I use bogofilter with mutt?</h2>

    <p>Use a mail filter (procmail, maildrop, etc.) to filter mail
    into different folders based on bogofilter's return code and set
    mutt key bindings to train bogofilter on errors:</p>

<pre>    macro index S "|bogofilter -s\ns=junkmail"  "Learn as spam and save to junk"
    macro pager S "|bogofilter -s\ns=junkmail"  "Learn as spam and save to junk"
    macro index H "|bogofilter -n\ns="          "Learn as ham and save"
    macro pager H "|bogofilter -n\ns="          "Learn as ham and save"</pre>

    <p>These will pipe the selected message through bogofilter,
    training a false-ham as spam or vice versa, then offer to save the
    message to a different folder.</p>

    <hr>

    <h2 id="with-sc">How do I use bogofilter with Sylpheed Claws?</h2>

    <p> Add a filtering rule to run bogofilter on incoming messages
    and an action to perform if it's spam </p>

    <pre>    condition:
    * test "bogofilter &lt; %F"
    action:
    * move "#mh/YOUR_SPAM_BOX"</pre>

    <p>Note: this assumes that bogofilter is in your path!</p>

    <p> Create two Claws actions - one for marking messages as spam
    and one for marking messages as ham.  Use the "Mark As Spam"
    action for messages incorrectly classified as ham and use the "Mark As Ham"
    action for messages incorrectly classified as spam.</p>

<pre>    Mark as ham / spam:
    * bogofilter -n -v -B "%f" (mark ham)
    * bogofilter -s -v -B "%f" (mark spam)</pre>

    <p>Another approach is to save incorrectly classified messages in
    a folder (or folders) and run a script like:</p>

<pre>    #!/bin/sh
    CONFIGDIR=~/.bogofilter
    SPAMDIRS="$CONFIGDIR/spamdirs"
    MARKFILE="$CONFIGDIR/lastbogorun"
    for D in `cat "$SPAMDIRS"`; do
        find "$D" -type f -newer "$MARKFILE" -not -name ".sylpheed*"
    done|bogofilter -bNsv
    touch "$MARKFILE"</pre>

    <p>This script can be used as an action and/or made into a toolbar
    button.  It will register as spam the messages in ${SPAMDIRS} that
    are newer than ${MARKFILE}.</p>

    <p>Additional information is available at the <a
    href="http://www.sylpheed-claws.net/faq/index.php/Using_Sylpheed-Claws_with_other_programs">
    Sylpheed-Claws's wiki</a>.</p>

    <hr>

    <p>Another approach is to run bogofilter from procmail, maildrop,
    etc and have Claws check the X-Bogosity header and filter messages
    into Spam and Unsure folders, e.g.:</p>

    <pre>    Condition:
        header "X-Bogosity" matchcase "Spam"
    Action:
        move "#mh/Mailbox/Spam"
    Condition:
        header "X-Bogosity" matchcase "Unsure"
    Action:
        move "#mh/Mailbox/Unsure"</pre>

    <p>Any messages in the Unsure folder should be used for training,
    as should messages incorrectly classified as ham or spam.  The
    actions below will handle these cases:</p>

    <pre>    Register Spam:
        bogofilter -s &lt; "%f"

    Register Ham:
        bogofilter -n &lt; "%f"

    Unregister Spam:
        bogofilter -S &lt; "%f"

    Unregister Ham:
        bogofilter -N &lt; "%f"</pre>

    <p>To look inside the bogofilter scoring mechanism, the following
    diagnostic are useful:</p>

    <pre>    BogoTest -vv:
        bogofilter -vv &lt; "%f"

    BogoTest -vvv:
        bogofilter -vvv &lt; "%f"</pre>

    <p>Additional information on this approach is available <a
    href="http://www.bogofilter.org/pipermail/bogofilter/2005-March/007815.html">here</a>.</p>

    <hr>

    <h2 id="with-vm">How do I use bogofilter with VM (an Emacs Mail
    tool)?</h2>

    <p>You need to include the separate file vm-bogofilter.el
    (included in bogofilter's contrib directory).  The latest version
    of the file is at
    http://www.cis.upenn.edu/~bjornk/bogofilter/vm-bogofilter.el) in
    your emacs path.</p>

    <p>Then, just add in your ~/.vm configuration file:</p>

<pre>;; load bogofilter capabilities (spam)
;;
(require 'vm-bogofilter)

;; short-key for bogofilter
;; C (shift-c) means spam message
;; K (shift-k) means ham message
(define-key vm-mode-map "K" 'vm-bogofilter-is-spam)
(define-key vm-mode-map "C" 'vm-bogofilter-is-clean)
</pre>

    <p>All the messages are filtered by bogofilter each time you check
    newly arrived e-mail.  When you change the status of an e-mail,
    the bogofilter header is changed (X-Bogosity: header).</p>

    <p>There is a limit: you cannot change multiple message headers at
    one time in VM; you have to do it message by message.</p>

    <hr>

    <h2 id="with-mh-e">How do I use bogofilter with MH-E (the Emacs
    interface to the MH mail system)?</h2>

    <p>The default setting of the 'mh-junk-program' option is
    'Auto-detect' which means that MH-E will automatically choose one
    of SpamAssassin, Bogofilter, or SpamProbe in that order. If, for
    example, you have both SpamAssassin and Bogofilter installed and
    you want to use BogoFilter, then you can set this option to
    'Bogofilter'.</p>

    <p>The 'J b' ('mh-junk-blacklist') command trains the spam program
    in use with the content of the range and then handles the
    message(s) as specified by the 'mh-junk-disposition' option. By
    default, this option is set to 'Delete Spam' but you can also
    specify the name of the folder which is useful for building a
    corpus of spam for training purposes.</p>

    <p>In contrast, the 'J w' ('mh-junk-whitelist') command
    reclassifies a range of messages as ham if it were incorrectly
    classified as spam. It then refiles the message into the '+inbox'
    folder.</p>

    <p> For more information, see the <a
    href="http://mh-e.sourceforge.net/">MH-E home page</a>

  </body>
</html>
source-git / bogofilter

Source Code

Files