A Modified Naïve Bayesian-based Spam Filter using Support Vector Machine








Abstract

The ever-growing problem which is threatening the current mailing system is spam. Spam is nothing but an unsolicited bulk e-mail frequently sent in a financial nature which generates the need for creating an anti-spam filter. Amongst many spam filtering techniques, the most advanced method "Naïve Bayesian filtering" using the Support Vector Machine (SVM) have been implemented. Spammers are very careful about the filtering techniques. For that very reason, dynamic filtering is needed and the proposed method meets the demand. The algorithm splits the received email into tokens and uses Bayes' theorem of probability to calculate the probability of spam for each token to determine the total spam probability of the mail. Implementation of SVM instead of corpora is one of the added features of the algorithm. The most challenging feature was to take the words as well as whole sentences as input in the SVM as tokens and feature vectors. The inclusion of sentences in the dataset training has increased the accuracy of detecting spam and ham. Natural Language Tool Kit (NLTK) has been used as a useful language processing tool to tokenize the sentences and also to understand the meaning of the same types of sentences to some extent. As a test mail is being compared by word to word and also sentence to sentence from the training datasets to determine if the mail is spam or not, it will improve the performance of the filter. With some simple modifications, the filter can be used in both server and client end. The efficiency increases gradually with the increased number of email it processes.


Modules


Algorithms

Support vector machine


Software And Hardware

• Hardware: Processor: i3 ,i5 RAM: 4GB Hard disk: 16 GB • Software: operating System : Windws2000/XP/7/8/10 Anaconda,jupyter,spyder,flask Frontend :-python Backend:- MYSQL