README.randomtrain
It seems that training bogofilter on its errors _only_ is a very good way to train, at least with the Robinson-Fisher or Bayes chain rule calculation methods. The way this works is: messages from the training corpus are picked at random (without replacement, ie no message is used more than once) and fed to bogofilter for evaluation. If bogofilter gets the classification right, nothing further is done. If it's wrong, or uncertain if ternary mode is in use, then the message is fed to bogofilter again with the -s or -n option, as appropriate. That's all very well, except that it's not an easy process to execute with just a couple of shell commands. I've now written a bash script that does the job [Matthias Andree: this has been changed so it may now work on a regular POSIX compliant sh, too, feedback is welcome]; you give it the directory in which to build the bogofilter database, and a list of files flagged with either -s or -n to indicate spam or nonspam, and it performs training-on-error using all the messages in all the files in random order. My production version of bogofilter returns the following exit codes: 0 for spam 1 for nonspam 2 for uncertain 3 for error Normal bogofilter returns (I think) 0 for spam 1 for nonspam 2 for error This script will work with either. You can use it to build from scratch; the first message evaluated will return the error exit code, and randomtrain (as this script is called) will train with that message, thus creating the databases. The script needs rather a lot of auxiliary commands (they're listed in the comments at the top of the file); in particular, perl is called for the randomization function. (The embedded perl script is "useful" in its own right: it takes text on standard input and returns the lines in random sequence.) Known portability issue: on HP-UX (10.20 at least), grep -b returns a block offset instead of a byte offset, so randomtrain won't work unless gnu grep is substituted for the HP-UX one. I rebuilt my training lists with randomtrain. The training corpus consists of 9878 spams and 7896 nonspams. The message-counts from bogoutil -w bogodir are 1475 and 408. The database sizes from full training were 10 and 4 Mb; randomtrain produced .db files of 7 and 1.2 Mb. I don't yet have figures comparing discrimination by bogofilter with these two training sets, but yesterday's smaller-scale test (which motivated me to write this script) clearly indicated an improvement could be expected. Greg Louis <glouis@dynamicro.on.ca>