Training the Engine
The primary scanning system employed by jASEN is based on a heuristic analysis of the contents of a Mime message, matched against a library of known heuristics and associated probabilities.
This library must be created via a "training" process in which the engine is shown a series of spam (and ham) emails from which the library is constructed.
We recommend that the core data file (library) be regularly updated to reflect the most recent trends in spam emails, and this updating requires re-training the engine.
Fortuntely this task is simplified by a training tool provided in the distributable.
The JasenTrainer class (org.jasen.core.engine.JasenTrainer) provides all the functionality required to train the engine and generate the required data file.
Training the engine requires a training set consisting of two distinct training sources or corpuses
Each corpus must consist of plain text, MIME formatted emails. If any of the messages in the corpus is malformed, it will be ignored and may halt training.
We strongly recommend that each corpus be of approximately equivalent size. Using significantly differing sized corpuses may lead to inaccurate scanning results.
It is also recommended that the each corpus consist of at least 2,500 emails and preferrably over 5,000 for each corpus. When collecting email for training purposes, we also recommend
excluding email newsletters from the training set as these can often cause confusion within the engine and may lead to inaccurate scanning.
It is critical that there is no pollution of each corpus. That is, there MUST NOT be ANY spam in the ham corpus, and vice versa. Take extreme care when compiling the corpus such that
corpus pollution is avoided.
JasenTrainer takes 3 required parameters, and one optional one:
JasenTrainer <spam corpus path> <ham corpus path> <store path> <command> (optional)
|spam corpus path
||The folder path containing the spam corpus
|ham corpus path
||The folder path containing the ham corpus
||The full path (including the file name) to the data file to be written
||Optionally provides the ability to load an existing data file and append to it.
MUST be one of 'new' or 'load'
Once the data file has been generated (assuming you haven't automatically overwritten the existing one), you can instruct the engine to use your new data file
by altering the map-path attribute in the RobinsonScanner plugin