WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING ------------------------------------------------------------------------ POTENTIAL FOR DATA CORRUPTION DURING UPDATES If you plan to upgrade your database library, if only as a side effect of an operating system upgrade, DO HEED the relevant documentation, for instance, the doc/README.db file. You may need to prepare the upgrade with the old version of the software. Otherwise, you may cause irrecoverable damage to your databases. DO backup your databases before making the upgrade. ------------------------------------------------------------------------ WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING This file documents changes in bogofilter since version 0.11. In particular it describes: (1) Features, which are significant changes (noteworthy and compatible) and (2) Incompatibilities, which are changes that require action upon update. Caution: If upgrading from an old version and skipping several intervening versions of bogofilter, be smart and check all the changes of the versions you skipped! In particular, read the sections labeled "Incompat" and "Major". NOTE: the NEWS document has greater detail on some of these changes. ------------------------------------------------------------------------ [Major 1.2.5] Kyoto Cabinet and LMDB support added. Bogofilter, as of release 1.2.5, supports: + Kyoto Cabinet databases, courtesy of Denny Lin. The Kyoto Cabinet database is written and maintained by the same author as the Toyko Cabinet database, and they recommend to use Kyoto Cabinet instead of Tokyo Cabinet. + LMDB databases (Lightning Memory-Mapped Database Manager), courteously implemented and contributed by Steffen Nurpmeso. ------------------------------------------------------------------------ [Major 1.1.6] Tokyo Cabinet support (B+-trees with transactions) added Bogofilter, as of release 1.1.6, supports Tokyo Cabinet databases, courtesy of Pierre Habouzit. Tokyo Cabinet is the sequel to QDBM with support for larger files and also written by Mikio Hirabayashi. For new installations, if you considered using QDBM, consider using Tokyo Cabinet instead. ----------------------------------------------------------------------- [Incompat 0.96.0] TDB removed Support for the TDB database library has been removed. ------------------------------------------------------------------------ [Incompat 0.95.2] Applies to Berkeley DB Transactional ONLY: This release gives up on locking the databases at page granularity and locks whole environments, to overcome lock sizing requirements which are a major issue in unattended setups. This however means that a writer (token registration) will lock out readers (message scoring) and readers will prevent new writers from starting. This may be fixed in a future version. ------------------------------------------------------------------------ [Major 0.95.0] Unicode in UTF-8 This release supports Unicode (UTF-8). A new meta-token .ENCODING has been added to the wordlist so that bogofilter can determine if it's using Unicode or not. A value of 1 indicates raw storage and 2 indicates UTF-8 encoded tokens. Bogofilter checks for this meta-token and converts incoming text to UTF-8 as appropriate. Command line options "--unicode=yes" and "--unicode=no" can be used. - With bogofilter, they control encoding of newly created databases. - With bogoutil, --unicode=yes converts the wordlist to Unicode. - For bogolexer, they print parser results in new and old modes. ./configure options allow bogofilter customization: - "./configure --unicode=yes" will _always_ operate in Unicode mode - "./configure --unicode=no" will _never_ operate in Unicode mode Wordlists can be converted from raw storage to Unicode using: NOTE: Replace iso-8859-1 by the character set and encoding of the dominant input token character set! bogoutil -d wordlist.db > wordlist.raw.txt iconv -f iso-8859-1 -t UTF-8 < wordlist.raw.txt > wordlist.UTF-8.txt bogoutil -l wordlist.db.new < wordlist.UTF-8.txt For a wordlist containing tokens from multiple languages, particularly non-European languages, the conversion methods described above may not work well for you. Building a new wordlist (from scratch) will likely work better as the new wordlist will be based solely on Unicode. ------------------------------------------------------------------------ [Incompat 0.94.12] Changed Options Some options have been added or modified. If you use any of the changed options, you will probably need to modify your scripts, procmail recipes, etc. As an example, some bogoutil options which used to allow either filenames or directory names are now restricted to filenames. See the man pages and help messages if you have questions. ------------------------------------------------------------------------ [Incompat 0.94.0] Transactions The transactional mode now defaults to off because the lock table sizing issue is unresolved. Bogofilter and bogoutil now support both build-time and run-time choosing whether to operate with (or without) transaction support. They can also auto-detect whether you've been using transactions or not. Run-time Selection: For bogofilter and bogoutil, transactions can be enabled or disabled in 2 ways -- by command line options or config file options. Command line option "--db-transaction=yes" enables transactions and "--db-transaction=no" disables them. Config file options "db_transaction=yes" and "db_transaction=no" have the same effect. Auto-detection: If none of the above methods are used to enable/disable transactions, bogofilter and bogoutil will query Berkeley DB to see if a transaction environment already exists. If so, transactions will be enabled. If not, they will be disabled. Compile-time selection: A default build includes the run-time and auto-detect capabilities. If you wish to minimize program size, ./configure can be used to create single mode versions of bogofilter and bogoutil, i.e. programs that only run transactionally or non-transactionallly. Use "./configure --enable-transactions" to enable transactions and "./configure --disable-transactions" to disable them. These programs will be _slightly_ smaller than the default build. ----------------------------------------------------------------------- [Incompat 0.93] Summary for the hasty YOU MUST ADJUST YOUR SCRIPTS EVALUATING "X-Bogosity" HEADERS! YOU MAY NEED TO ADJUST YOUR SCRIPTS THAT PARSE 'bogofilter -V'! WHEN USING BERKELEY DB (DEFAULT), NFS NO LONGER WORKS AND YOU M U S T READ doc/README.db AND POSSIBLY CONFIGURE THE DATABASE! ----------------------------------------------------------------------- [Incompat 0.93] Defaults changed Bogofilter's defaults have been changed. It now operates in tri-state mode and will classify messages as Spam, Ham, or Unsure. If you're checking messages for "X-Bogosity: Yes" or "X-Bogosity: No", you _need_ to change your checks. Use "X-Bogosity: Spam" and "X-Bogosity: Ham" instead of the old forms. Also, checking for "X-Bogosity: Unsure" and putting those messages in a separate folder (or mailbox) will give you an excellent set of messages for training, as "Unsure" messages are messages that bogofilter has too little information to classify (with certainty) as spam or ham. ----------------------------------------------------------------------- [Incompat 0.93] Berkeley DB switched to Transactional Data Store Bogofilter will now use the Berkeley DB Transactional Data Store when compiled with Berkeley DB as the data base engine (the default). This means the Berkeley DB directory can no longer reside on a networked or otherwise shared file system (such as NFS, AFS, Coda). When using BerkeleyDB 4.1 - 4.3, it is recommended that you dump and load the data bases to add checksums, for enhanced reliablity. See section 2.2 in doc/README.db for details. This means that bogofilter programs now exhibit the A C I D traits: changes are atomic (all-or-nothing); the data base is always consistent; changes are always isolated from each other; and all changes that are acknowledged are durable. Bogofilter can support multiple writers at the same time, mixed freely with simultaneous readers, and the data base will not be corrupted by application or system crashes, except when the disk drive gets damaged. Note that this requires that the operating system and disk drive maintain proper write order on the disk, and that both be honest about synchronous I/O completion. Note also that this causes bogofilter to write additional "log" files to its ~/.bogofilter (or other) home directory. The log files need to be archived or deleted periodically. For detailed instructions, be sure to _read_ doc/README.db and check the BerkeleyDB documentation. As a backwards compatibility option, for instance when space and I/O bandwidth are tight, it is possible to use the old non-transactional, non-concurrent Berkeley DB Data Store, which can only register messages when there are NO scoring processes at all and that may not be able recover from application or system crashes. These benefits are not available when bogofilter is compiled to use the TDB or QDBM data bases. ----------------------------------------------------------------------- [Incompat 0.93] Berkeley DB version strings changed Bogofilter will now return the BerkeleyDB's actual DB_VERSION_STRING in the output of 'bogofilter -V'. The OLD format was: Database: BerkeleyDB (4.3.21) The NEW format is: Database: Sleepycat Software: Berkeley DB 4.3.21: (November 8, 2004) You may need to adjust your scripts. ----------------------------------------------------------------------- [Incompat 0.93] QDBM database format changed to B+ trees The QDBM database format has been changed from hash tables to B+ trees, i.e. from the Depot API to the Villa API. This results in significantly better performance, i.e. faster speed. Unfortunately, the two modes are incompatible, so upgrading to 0.93 requires running a special command to convert the database once: bogoQDBMupgrade wordlist.qdbm wordlist.tmp wordlist.qdbm.old If this command didn't print anything, everything has gone well and it has left your old data base in wordlist.qdbm.old. NOTE: bogoQDBMupgrade needs qdbm-1.7.23 or newer to build. ----------------------------------------------------------------------- [Incompat 0.93] Bogotune option parsing changes In bogotune 0.93.2 and newer, you must repeat the -n or -s option as prefix for the mailbox. Example: bogotune -n good1 good2 -s bad1 bad2 ... will be: bogotune -n good1 -n good2 -s bad1 -s bad2 ----------------------------------------------------------------------- [Major 0.93.3] SQLite 3.0.8 (and newer) is now supported. It isn't nearly as fast as Berkeley DB but uses only one permanent and one transient file (hence less maintenance work) and is supposed to be proof against application and system crashes. ----------------------------------------------------------------------- [Incompat 0.92] The formatting parameters have changed: '%A' is now the message's IP address. '%I' is now the Message-ID. '%Q' is now the Queue-ID. ----------------------------------------------------------------------- [Incompat 0.17] Support for --enable-deprecated-code (see the 0.16 release notes) has been removed. If you've run 0.16.X without that switch, nothing changes for you. Support for Berkeley DB 3.0 was removed in bogofilter 0.17.3 as a side effect of adding Concurrent Database support. ----------------------------------------------------------------------- [Incompat 0.16] A number of features have been deprecated. The relevant code is bracketed by "#ifdef ENABLE_DEPRECATED_CODE" and "#endif" statements. The default build will not include the deprecated features. For those who still need these features, configure option "--enable-deprecated-code" exists to allow them to be turned on. THIS MAY REQUIRE MAJOR CHANGES TO YOUR CONFIGURATION OR SCRIPTS! The following list is supposed to be complete. Let us know if we've omitted anything. We shall try to provide workarounds and migration paths whenever possible. 1) Scoring algorithms --------------------- Bogofilter will support only the Robinson-Fisher algorithm, commonly called the "Fisher algorithm". The Graham algorithm and Robinson geometric-mean algorithm, a.k.a. Robinson algorithm, have been deprecated. 2) Wordlist support ------------------- Bogofilter will now support only the combined wordlist, i.e. wordlist.db, which contains both the ham and spam counts for each token. The older, separate wordlists (spamlist.db and goodlist.db) are no longer supported. The bogoupgrade program can still be used to merge the separate databases for you. Type "bogoupgrade -d /you/wordlist/directory/". Ignore lists, i.e. ignorelist.db, are also being deprecated. The ignore list feature has never been thoroughly tested and is not used (as far as we know). 3) Command line switches ------------------------ Bogofilter will no longer support the switches listed in this section. If used, bogofilter will print an error message and exit. Scoring related switches: -g - select Graham algorithm -r - select Robinson Geometric-Mean algorithm -f - select Robinson-Fisher algorithm see section 1 above -2 - set binary classification mode -3 - set ternary classification mode Bogofilter will use binary mode if ham_cutoff is zero and will use ternary mode (Yes, No, Unsure) if ham_cutoff in non-zero and less than spam_cutoff. Wordlist modes: -W - use combined wordlist for spam and ham tokens -WW - use separate wordlists for spam and ham tokens Bogofilter will always operate in combined mode now. Backwards compatible token generation switches: -Pi and -PI - ignore_case -Pt and -PT - tokenize_html_tags -Pc and -PC - strict_check -Pd and -PD - degen_enabled -Pf and -PF - first_match Note: Since last May, the default values for these switches have been: ignore_case disabled tokenize_html_tags enabled strict_check disabled degen_enabled disabled first_match disabled There will be no change in the default values. 4) Configuration options ------------------------ The following configuration options (for the above switches) are deprecated: algorithm wordlist wordlist_mode ignore_case tokenize_html_tags tokenize_html_script header_degen degen_enabled first_match The following configuration options (which don't correspond to switches) are deprecated: thresh_stats thresh_rtable Note: Bogofilter will print a warning message if it sees any of these options, but will run fine anyhow. 5) Miscellany ------------- The user formatted SPAM_HEADER will no longer support format specification "%a" (for algorithm) since bogofilter now has only one algorithm. ----------------------------------------------------------------------- [Incompat 0.15.9] Bogofilter no longer allows disabling of algorithms, a feature which has never been well supported. ----------------------------------------------------------------------- [Incompat 0.15.4] All header line tokens are now tagged as: Subject: subj: To: to: From: from: Return-Path: rtrn: Received: rcvd: ***new*** any other: head: ***new*** Because existing wordlists don't have "head:???" tokens, the new tokens won't be found in the wordlist and bogofilter's accuracy will go down. To correct this you can do one of the following things: 1 - Use the new "-H" (for header-degen) option when scoring messages. This option tells bogofilter to check the wordlist twice for each header token - once for "head:xyz" and a second time for "xyz". The ham and spam counts are added together to give a cumulative result. Note that, with bogofilter 0.15.4 and later, during message registration, "head:xyz" tokens are added to the wordlist (for the header lines). The "-H" option is only applied during scoring. The "-H" option is meant for temporary usage to cover the period while bogofilter goes from having no "head:xyz" tokens in the wordlist to the time when there are enough such tokens to score messages effectively. After a few weeks, or perhaps months, of registering messages with the new bogofilter, use of the "-H" option can end and bogofilter will use the newly added "head:xyz" tokens. 2 - Retrain bogofilter with whatever ham and spam you have available. This will create "header:xyz" tokens and allow the new, more effective header tagging to be used to fullest advantage. ----------------------------------------------------------------------- [Major 0.15] The code for processing multiple messages has been rewritten. In addition to understanding mbox format files, bogofilter now understands maildirs and MH folders. ----------------------------------------------------------------------- [Incompat 0.14] The exit codes returned by bogofilter have been expanded. They are: Spam = 0 -- unchanged Ham = 1 -- unchanged Unsure = 2 -- *NEW* Error = 3 -- *CHANGED* ----------------------------------------------------------------------- [Major 0.14] Bogofilter now supports TDB (Trivial Data base). Instead of separate wordlists for spam and ham tokens, bogofilter can now use a single combined, wordlist that stores both all tokens. In the combined wordlist each token contains two counts - for spam and ham. The name of the new file is wordlist.db. However, this change broke the early versions (up to and including 0.14.2) of bogofilter. You should use at least bogofilter 0.14.3. Bogofilter will check in $BOGOFILTER_DIR and use the wordlist(s) that are there. If wordlist.db is present, bogofilter will use the combined mode. If wordlist.db is not present, but both spamlist.db and goodlist.db are present, bogofilter will use the separate wordlist mode. If no wordlists are present, bogofilter will create wordlist.db and use it. Command line switches '-W' and '-WW' can be used to tell bogofilter the mode you want. Also config file options "wordlist_mode=combined" and "wordlist_mode=separate" can be used. Upgrading from an old bogofilter environment with its two wordlists (spamlist.db and goodlist.db) to the new 0.14.x environment with its single, combined wordlist.db involves 3 main steps - dumping the current spamlist.db and goodlist.db files, formatting that output, and then loading the data into a new file wordlist.db. The script "bogoupgrade" is included with bogofilter and performs the task. Use command "bogoupgrade -d /path/to/your/wordlists" to do the upgrade. After running it, your BOGOFILTER_DIR will contain all 3 database files. When started, bogofilter checks for wordlist.db and will use it. ----------------------------------------------------------------------- [Incompat 0.13] Parsing has changed. As background, Paul Graham has done work to improve the results of his bayesian filter and has published them in "Better Bayesian Filtering" at http://www.paulgraham.com/better.html. He found the following definition of a token to be beneficial: 1. Case is preserved. 2. Exclamation points are constituent characters. 3. Periods and commas are constituents if they occur between two digits. This lets me get ip addresses and prices intact. 4. A price range like $20-25 yields two tokens, $20 and $25. 5. Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. Bogofilter has always done #3 and has tagged for Subject lines for a while. Its parser now does all of these things. Several command line switches and config file options have been added to allow enabling or disabling them. Here are the new switches and options: -Pi/-PI ignore_case default - disabled -Ph/-PH header_line_markup default - enabled -Pt/-PT tokenize_html_tags default - enabled The options can be enabled using the lower case switch or disabled using the upper case switch. When header_line_markup_is enabled, tokens in To:, From:, Subject:, and Return-Path: lines are prefixed by "to:", "from:", "subj:", and "rtrn:" respectively. When tokenize_html_tags_is enabled, tokens in A, IMG, and FONT tags are scored while classifying the message. NOTE: To take full advantage of these changes, additional training of bogofilter is necessary. Here's why: With bogofilter's use of upper and lower case, the wordlists won't match as many words as before. For example, "From" and "from" both used to match "from", but this is no longer the case. As additional training is done, words like these will be added to the wordlists and bogofilter will have a larger number of distinct tokens to use when classifying messages. This will improve its classification accuracy. Similarly, the use of header_line_markup will tokenize "Subject: great p0rn site" as "subj:great", "subj:p0rn", and "subj:site". At first these tokens won't be recognized, so bogofilter won't use them to score the message. After being trained, bogofilter will have these additional tokens to aid in the classification process. ----------------------------------------------------------------------- [Major 0.12] Directory bogofilter/tuning has been added and contains scripts for running tuning experiments as described in the new HOWTO. See file bogofilter/tuning/README for more information. Bogofilter's man page and help message describe the many command line switches. They have been divided into groups (help, classification, registration, general, algorithm, parameter, and info) in both places. Bogofilter 0.12.0 has three new command line switches for rapidly scoring large numbers of messages. These "bulk mode" switches are especially useful for the tuning process. The new switches are: -M - allows scoring all the messages in a mbox formatted file. If used with "-v", an X-Bogosity line is printed as each message is scored. Using the "-t" (terse) option is recommended to reduce the amount of output. -B - allows scoring of multiple message files, with each file containing a single message. With this option, bogofilter expects the file names to be at the end of the command line. If used with "-v", the file name is included in each printed line. Using "-t" is recommended. -b - allows scoring of multiple message files, with each file containing a single message. With this option, bogofilter reads the file names from stdin. This option can be used with maildirs, as in "ls Maildir/* | bogofilter -b ..." If used with "-v", the file name is included in each printed line. Using "-t" is recommended. New script bogolex.sh converts an email to a special file format that contains the information needed by bogofilter to score the email. Its use speeds up the message scoring done by the tuning scripts. The script is described in more detail in bogofilter/tuning/README. ----------------------------------------------------------------------- [Incompat 0.11] Command line flags: The meaning of command line flags '-S' and '-N' was changed in version 0.11.0. Previously '-S' meant to unregister a message from the spam wordlist and register the message in the non-spam wordlist and '-N' meant to unregister from non-spam and register as spam. Each of the flags now performs a single action. '-S' unregisters a message from the spam wordlist and '-N' unregisters a message from the non-spam wordlist. To duplicate the old (compound) actions, it is necessary to use two options - an unregister option ('-S' or '-N') and a register option ('-s' or '-n'). To duplicate the effect of the old '-S' option, use '-N -s'. To duplicate the effect of the old '-N' option, use '-S -n'. The order of the options doesn't matter and they can be concatenated, as in '-Sn' and '-sN'. Config file processing ---------------------- The code to process config files now checks numeric values for validity. It complains when it detects something wrong. In particular, double precision values are no longer allowed to have a terminal 'f'. For example "spam_cutoff=0.95f" will generate a messages. ----------------------------------------------------------------------- [Major 0.11] New parameter query option: Using options "-q -v" in a bogofilter command line will run the query_config() function and will display bogofilter's various parameter values. This can be very useful in finding the reason for an unexpected message classification. ----------------------------------------------------------------------- End of RELEASE.NOTES