MachineLearning

MailCleaner Support
Added 7 months ago

MailCleaner is currently in the processes of developing a new Anti-Spam filter module called Machine Learning. This module is being selectively deployed and tested on our Cloud service and some other partners. This document is meant to be a quick reference for those machines where this module is already installed, discussing what to expect from the module.

Enable/Disable

Machines with this module installed have a slightly customized version of the init script for the Filtering Engine (MailScanner). In order to enable or disable Machine Learning, the simplest way to do so is tocomment out or uncomment (respectively) line 66 of the file/usr/mailcleaner/lib/MailScanner/PreFilters/MachineLearning.pm:

  # return 0;

then restart MailScanner by running:

/usr/mailcleaner/etc/init.d/mailscanner restart

Logs

There are a few places where you will see evidence of MachineLearning having run. The primary location will be the MailScanner log, /var/mailcleaner/log/mailscanner/infolog .

The MachineLearning module is deployed as a standalone module which will include a dedicated log for each model that is deployed:

Aug 16 07:50:07 localhost MailScanner[32122]: MachineLearning0 result is spam (98.26251268386841%%) for 1mA2qP-0009vJ-HP (200728-ALL-10L) (en)
Aug 16 07:50:07 localhost MailScanner[32122]: MachineLearning1 result is spam (98.18589687347412%%) for 1mA2qP-0009vJ-HP (200728-ALL-50S) (en)

This result should also appear in the overall results log with a distinct block for each model:

Aug 16 07:50:07 mc MailScanner[32122]: Message 1mA2qP-0009vJ-HP from 1.1.1.1 (sender@domain.com) to example.org is not spam, MachineLearning (98.26251268386841%% en, position : 2, not decisive), MachineLearning (98.18589687347412%% en, position : 2, not decisive), ...

As shown, during initial testing, we are not enabling MachineLearing as a decisive standalone module. However, it is also deployed as a plugin for SpamC so that the score of a message can be increased without a guaranteed block on the same line:

... Spamc (score=-1.4, required=5.0, ... MC_ML1_98 0.3, MC_ML0_98 0.3, ..., position : 6, ham decisive)

In order to evaluate the overall increased load, you can also look for the scanning profile line which reports the time spent on each task:

Aug 16 07:50:07 mc MailScanner[32122]: Profiled SpamCheck for message 1mA2qP-0009vJ-HP: (ClamSpam_Check:0.0307s) (MachineLearning_Check:0.0009s) (MessageSniffer_Check:0.2228s) (Newsl_Check:0.0401s) (NiceBayes_Check:0.0884s) (PreRBLs_Check:0.273s) (Prefilters:9.2604s) (SpamCacheCheck:0.001s) (Spamc_Check:7.1935s) (TrustedSources_Check:0.07s) (UriRBLs_Check:1.3382s)

The scan time is unlikely to be as brief as the one that was reported here (MachineLearning_Check:0.0009s) but should not average more that 10% of the total (Prefilters:9.2604s). Time can vary significantly based on message size.

Some other incidental logs will note when MachineLearning is started and stopped in this file. If you do not see the results above and the only results when searching the file are the module initializing, it probably is not starting correctly:

grep MachineLearning /var/mailcleaner/log/mailscanner/infolog
Aug 16 07:31:21 localhost MailScanner[26745]: MachineLearning module initializing...
...

See Troubleshooting for more info on this.

Headers

The headers are laregly a duplicate of each entry shown in the logs. There will be a unique header for each model that is enabled:

X-MailCleaner-MachineLearning0: is not spam (93.7003493309021%) (200728-ALL-50S) (en)
X-MailCleaner-MachineLearning1: is spam (99.39507246017456%) (200728-ALL-10L) (en)

The longer line with the full results is an exact copy of the SpamCheck header with both the dedicated and SpamC results:

X-MailCleaner-SpamCheck: not spam, ...
    MachineLearning (93.7003493309021% en, position : 2, not decisive),
    MachineLearning (99.39507246017456% en, position : 2, not decisive),
    ...
    Spamc (score=1.9, required=5.0, MC_ML1_98 0.3, MC_ML1_99 1.5,
    ..., position : 10, ham decisive)

SpamC

As shown in the Log section, SpamC rules are configured to add score to the message based on the module's results. These rules are configurade in /usr/mailcleaner/share/spamassassin/70_ML_score.sh.

We are currently working on a means to customize the thresholds and score values assigned. For now, if you are unhappy with the rules provided, you can create another file in the same directory which is sorted alpha-numerically after 70_ML_score.sh. In that file you can use:

score MC_ML0_98 0

for example, to disable or change the score of an existing rule. You can also copy one of the entire rules and make tweaks if you would like to change the threshold.

Troubleshooting

As discussed in the Log section, it is possible that the module may not be starting correctly. This is indecated by nothing other than the "initializing..." line in the logs.

At the moment, starting Machine Learing via the MailScanner init script or directly with:

/usr/mailcleaner/etc/init.d/mld start

does not return proper error codes, so it should always show that MachineLearing started even if it did not. This is something which will be fixed once the module is integrated into the unified init system.

For now, if you see evidence that it is not starting correctly, you can try to trace the stack of execution. First thing to check is that Docker is starting:

Docker Service Status

/etc/init.d/docker status

Near the bottom of this output you should see a line like:

Aug 23 08:31:23 mailcleaner systemd[1]: Started Docker Application Container Engine.

if this is followed by an error or a subsequent stopping of the service, you can try to start it again or may need to resolve a stated issue.

Docker Container Status

If Docker is running, it is possible that the container is unable to be started. You can try to start it manually:

/usr/mailcleaner/etc/init.d/mld stop
docker run -d -v /usr/mailcleaner/scripts/ML/bridge:/bridge -p 8000:80 --name ml ml

Note that if you don't use the first command and the container is currently locked by another process, you will get a Conflict error, this is why it is included. It also means that if you try to start mld without previously stopping it, you could get a detached container.

If the above fails, it is possible that your docker image may not have been loaded and tagged. Check to see if there is a mcml_* entry with:

docker images

If none is listed, find your image by looking for the .tar file:

ls /var/mailcleaner/machine-learning/

The name of this file will vary based on your CPU architecture so you will need to substitute your file below. Load and tag the image:

docker load -i /var/mailcleaner/machine-learning/mcml_YOUR_CPU_ARCH.tar
docker tag mcml_YOUR_CPU_ARCH ml

Try to start it again to see if this resolved the issue.

Database Configuration

Check that there is an entry for MachineLearning in the 'prefilter' table of the database:

echo "select name, position from prefilter;" | /usr/mailcleaner/bin/mc_mysql -m mc_config

MailScanner Configuration

If the Container startup works without error, there must be a problem with MailScanner actually connecting to and executing the container. Check that the following file exists:

/usr/mailcleaner/etc/mailscanner/prefilters/MachineLearning.cf_template

Check that the configuration file is being properly dumped by seeing that it has a date at about the same time that MailScanner was restarted:

ls -l /usr/mailcleaner/etc/mailscanner/prefilters/MachineLearning.cf

Finally, check that you have not disabled the module as discussed in the first section.

If all of these steps pass, there is an unknown error happening and you'll need to reach out to support so that we can add some debugging code to trace the issue.