Spam or Ham: Some Experiments

What is Spam? The term “spam” originated for me at least at NCSA/UIUC in the early 90’s, where one researcher/graduate student I worked with remarked that [sending out unsolicited emails is like throwing a can of Spam into a running fan].

Here, by Spam, we mean unsolicited emails, the kind sent out to millions of internet users every day. In order to build a spam classifier, it is necessary, or at least extremely helpful, to have a large dataset of both real emails (ham) and unwanted emails (spam). Of course, large email providers like Google and Microsoft , have such datasets internally because they can ask their users to mark their email as spam or not and collect a massive archive. Fortunately for us, there are some commonly available datasets that we can use to build our classifier.

The Enron Spam Dataset
There is a vast amount of text data available from the Enron debacle, including millions of email messages. From a small set of these, researchers [espam] collected email from six Enron employees, arranged them chronologically, and combined them with spam from various sources. They are available from:

http://www.iit.demokritos.gr/skel/i-config
, and

http://www.aueb.gr/users/ion/publications.html

The six datasets have roughly 33K emails total, nearly half ham and half spam, broken down by employee as:

Dataset/Label 1 2 3 4 5 6 all (combined)
Spam 1500 (GP) 1496 1500 (BG) 4500 (GP) 3675 4500 (GP) 17171
Ham 3672 4361 4012 1500 1500 1500 16545
total = Spam or Ham 5975 5857 5512 6000 5175 6000 33716

The researchers originally used these datasets to create a spam filter using Bayesian statistical methods. Here, we will show how to apply state-of-the-art SVMs to do the same.

The Trec Spam Track 2005-2007

The Trec Spam Track was an international research competition, run for three years from 2005 to 2007, sponsored to evaluate and compare a wide number of different spam classification methods, including, among other methods, SVMs. The competition originally provided researchers with ten separate datasets (corpora), totally over 750K emails; four of the ten corpora are publicly available on the web (with a usage agreement), with ~205K emails total, which we describe below.

There are four available datasets: trec05p-1, trec06p, trec06c, and trec07p. The first. trec05p-, is the Trec 2005 Public 5.6 Corpus, and consists of 39399 ham emails obtained from the Enron corpus for 150 executives, and augmented with 52790 spam emails from a public source [spam-book], leaving a total of 92189 total email instances to work with. The Trec 2006 and 2007 corpora consist of privately obtained ham and spam emails, collected from web mailing lists and/or a private mail server. The numbers of ham and spam emails are summarized in the table below. (The Trec 2006 dataset also includes a copora of Chinese emails, trec06c, not used here).

Dataset/Label trec05p-1 trec06p trec07p all (combined)
Spam 52790 24912 50199 127901
Ham 39399 12910 25220 77529
total = Spam or Ham 92189 37822 75419 205430

Supervised Machine Learning: Open Source Tools

The machine learning algorithms we describe are data driven; we “train” them using very large text datasets, and preferably with some kind of human assigned labels. For example, in the spam or ham problem, we have nearly 250K emails, labeled either ham or spam by some user. Still, not every problem is so simple, and we don’t always have labels on our data, and sometimes we may need to discover how the data groups or clusters in order to assign any kind of label. Fortunately, we have access to very high performance, open source, state-of-the-art machine learning algorithms which we can apply to a wide variety of labeled, partially labeled, and even unlabeled data sets with great success.

The easiest case is if all our text data is labeled–and the labels are correct– then we can apply Supervised Learning algorithms, namely, just vanilla Support Vector Machines. There are two commonly used, open source, high performance Support Vector Machine packages: Liblinear, and SVMLight.

Liblinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
SvmLight: http://svmlight.joachims.org/

We use LibLinear for most of our analysis, however, it is noted that SvmLight would yield similar results with comparable performance. In fact, both use exactly the same input file formats, although the command line parameters and output formats are slightly different.

Vector Space Model of Semantics

In order to apply SVMs to text, we need to convert the text into numerical values because these techniques apply numerical techniques from vector calculus and computational linear algebra. We introduce the Vector Space Model of Semantics, also known as the Bag of Words (BOW) model. The model takes each document, or chunk of text, and places all the “terms” into a bag (an unordered collection), irrespective of word order or linguistic information. The “terms” include all the words, as well as useful features such as numbers, dates, prices, email addresses, etc. The choice is somewhat arbitrary, although the idea is to include everything that could even be slightly useful (and let the algorithm decide what to throw away)

For every term, we assign a decimal “term weight”, and, also a unique hash code, or index.

For example, suppose we have a collection (corpus) of 1,000,000 emails, and we want to represent one of these emails. If the entire corpus has 200,000 unique words and other features, this gives us a vector space with 200,000 dimensions. If our email has 300 unique terms in it, then it is represented as a 200,000 dimensional vector, with (at most) 300 non-zero values corresponding to the BOW term weights.

We say the vector is sparse because only a small fraction of the term weights,or individual components, values are non-zero.

Will you just give me some code already?!

You can find some ruby classes (and Liblinear)  here:  email-svm.zip

This code generates the input and runs the SVM on the trec corpora.  (We will provide a generator for the Enron dataset soon).

To run the  code, download (at least) the trec07p corpora and place it somewhere like ‘~/Data/spam’ or ‘C:/Data/spam’, and modify the files generate.rb and test.rb to reflect where the data is. You also  need to install mocha to run the tests, and a c++ compiler to compile Liblinear.

Let us look at generate.rb:

_________________________________________________________

#!/usr/bin/env ruby

path_to_trec =  File.expand_path('C:/book-svm/email/trec07p')
unless File.exists?(path_to_trec)
  puts "expected TREC07p data in #{path_to_trec}"
  exit 1
end

STDOUT.sync = true # don't buffer

require 'fileutils'
require File.dirname(__FILE__)+'/lib/trec_corpus'
require File.dirname(__FILE__)+'/lib/feature_hash'
require File.dirname(__FILE__)+'/lib/standard_tokenizer'
require File.dirname(__FILE__)+'/lib/benchlog'

trec_corpus = TrecCorpus.new(path_to_trec)
feature_hash = FeatureHash.new

if File.exists?('feature_hash.txt')
  benchlog 'loading existing feature_hash.txt' do
    feature_hash = FeatureHash.load('feature_hash.txt')
  end
else
  benchlog 'generating feature hash' do
    trec_corpus.each do |label, content|
      StandardTokenizer.new(content).each do |feature|
        feature_hash << feature
      end
      print '.'
    end
  end

  benchlog 'pruning feature hash' do
    feature_hash.prune!(10, 10_000)
  end

  benchlog 'dumping feature hash' do
    feature_hash.dump('feature_hash.txt')
  end
end

if File.exists?('svm.input')
  puts "using existing svm.input"
else
  benchlog 'building svm.input file' do
    File.open('svm.input', 'w') do |f|
      trec_corpus.each do |label, content|
        tokens = StandardTokenizer.new(content).to_a
        if vector = feature_hash.vectorize(tokens)
          f.puts [(label == 'spam' ? -1 : 1), vector].join(" ")
          print '.'
        end
      end
    end
  end
end

FileUtils.cd('vendor/liblinear-1.33') do
  if File.exists?('train') && File.exists?('predict')
    puts 'using existing liblinear binaries'
  else
    benchlog 'building liblinear' do
      system('make')
      unless File.exists?('train') && File.exists?('predict')
        puts "liblinear build appears to have failed: could not find train and/or predict."
        exit(1)
      end
    end
  end
end

benchlog 'training (using liblinear)' do
  system('vendor/liblinear-1.33/train -c 0.1 svm.input svm.mod')
end

benchlog 'predicting (using liblinear)' do
  system('vendor/liblinear-1.33/predict svm.input svm.mod svm.predict')
end


_________________________________________________________

First, we set up the data directory that points to the trec spam data. Then. we instantiate the TrecCorpus and FeatureHash classes (not shown).
TrecCorpus iterates over every file in the directory and yields the file content, for each_spam and/or each_ham file.
FeatureHash maps each feature (i.e. token or word) to a unique integer index. It also has a method called vectorize() that maps a set of features to their ids.

Next, generate.rb uses a StanardTokenizer, extracted from ferret, to tokenize the content into features. Each feature is indexed and hashed, in memory, and then dumped to a text file called ‘feature_hash.txt’.  This file can be reused for running different experiments on the same corpus.

Notice that we prune the feature hash, removing the very common and very infrequent tokens. Specifically, a token is removed it only appears in 10 or less emails, or if it appears in 10_000 or more. This pruning reduces the vector space dimensions an order of magnitude without any loss in accuracy. It also simulates the more commonly used approach of removing so-called stop-words. Indeed, unlike methods like Naive Bayesian classifiers, for SVMs it is generally unnecessary to remove stop words because SVMs naturally can handle enormously large feature spaces. Still, there is no need to be wasteful, and we definitely want to remove the most and least frequent words for efficiency.

Next, we generate the first input file for Liblinear SVM:  ’svm.input’ .

Notice that unlike other ruby svm solutions, we do not try to bind the svm into ruby (nor do we use a non-linear svm!). We care about performance, and we want to run at scale. Liblinear is the latest and fastest implementation of a linear SVM, which is what we need for vanilla text classification. It is C++ and reads a massive input text file and generates text output files. If we were to directly bind the SVM to ruby, we would need to generate the SVM input in ruby–in other words, push a giant array up the Ruby VM stack, create a ridiculously large Ruby memory structure to hold that array as bunch of heavyweight objects (~20 bytes each compared to 4 bytes in C++), and then just pass this down the Ruby VM into the low level C-library. Why in the world would we do all of that?! (We’ll discuss why using a non-linear SVM is pointless in a later blog–suffice it to say the reason Liblinear exists is because you do not need polynomial or RBF kernels for text!)

Indeed, our approach is to always use C++ for performance, and to wire things together using Ruby (and SQL) in an efficient and lightweight fashion. For example, in a future blog, we will show to access the low level Ruby bindings to the C++ tokenizer code in ferret.

Finally, we run Liblinear , first to generate a model file, and, second, to make predictions. We can also run it just once to get a measure of the training accuracy.

…  need to work on this, and then discuss results…

Leave a Reply