How to write a spam classifier with PaddlePaddle

Good at explaining deep learning and machine learning algorithms in a simple and easy-to-understand way, familiar with deep learning frameworks such as Tensorflow, PaddlePaddle, etc., responsible for multiple machine learning landing projects, such as automatic filtering of spam comments, user classification and precise marketing, and building of distributed deep learning platforms Wait, all take a good effect.

Background introduction

In our daily lives, we often receive various types of spam, such as advertisements from merchants, discounted promotional information, Macau gaming emails, and financial promotion information. Generally speaking, email clients will set certain keywords to block this spam. , Or categorize emails, but there will always be some fish that miss the net. However, it is not difficult to make a spam classifier by yourself. Traditional machine learning algorithms usually use simple Bayesian, support vector machine and other algorithms to filter spam. Today we mainly talk about how to use PaddlePaddle to write a spam classifier. Of course, before talking about PaddlePaddle dealing with spam, let ’s review how traditional machine learning algorithms classify spam.

Understanding the data set

First, let ’s take a look at today ’s dataset: trec06c. trec06c is a public spam corpus, provided by the International Text Retrieval Conference, divided into English data set (trec06p) and Chinese data set (trec06c), the messages contained in them are all derived from real mail and retain the original format content. File download address: trec06c File format:

trec06c│└───data│ │ 000│ │ 001│ │ ... │ └───215└───delay│ │ index└───full│ │ index

document content:

Examples of spam: The company has some ordinary invoices (commodity sales invoices) VAT invoices, special VAT payment forms for customs collection and other service industry invoices, road and inland water transport invoices. You can open your company at a low tax rate. Our company has the strength of domestic and foreign trade business to ensure the authenticity of the bills issued by our company. Hope to cooperate! Common development! Respect your call to discuss and consult! Contact: Mr. Li's telephone number is for interruption and hope to understand, I wish Shang Qi. Normal email example: telling the story of Confucius' descendants. An old leader returned to his hometown, had an emotional disagreement with his son, and was in harmony with the wealthy grandson Kong Ben. Wei Zongwan, the younger brother of the old leader, drove the carriage. There is a foreign girl who investigates folklore and spends New Year in their home. Kong Ben always wanted to go abroad and was educated by his grandfather. Finally, the family basically reconciled. By the way, ask another type of film, Beijing Youth Film Studio.

Data preprocessing

After getting the data, we can clearly see the content of the email, but not all the content is what we need. Here we only extracted the Chinese in the email as the training corpus. If you observe carefully, you will find that not all emails can be opened directly, and the data encoding format also needs to be converted to utf-8 format for our training and use later. So we need to do some data preprocessing on the original data, including the following contents.

The basic steps

Convert the source data encoding format to UTF-8 format

Filter character

Remove all non-Chinese characters, such as punctuation marks, English characters, numbers, website links and other special characters.

Filter stop words

Word segmentation processing of email content

Training code

The following is the specific code transfer.py:

#-*-coding: utf-8-*-# Created by huxiaoman 2018.1.28 # transfer.py: Generate spam and ham data import jiebaimport sysimport osimport re # Determine whether the characters in the email are Chinese def check_contain_chinese (check_str): for ch in check_str.decode ('utf-8'): if u '\ u4e00'

Run the script

run.sh:

bashif [$ 1 = "test"]; then echo "test" python transfer.py ../test/ ../trec06c/full/indexelse echo "whole" python transfer.py ../trec06c/data/ ../trec06c / full / indexfi

Operation mode:

sh run.sh

operation result:

ham.txt: positive sample, normal email. A total of 21373 data.

Example: I do n’t understand as long as you do n’t mind what you have to do with your parents ... First of all, thank you for comforting me. But I really feel bad, I have my own pains. I dare not tell my mother about this situation. My mother is the type that is particularly easy to worry about. And I'm not with her. My home is out of town. If she told her mother, she would not agree with me. Mothers are particularly important for their health. One year, my brother-in-law often had nosebleeds that year, and my mother was particularly worried.

spam.txt: negative sample, spam. A total of 41627 data.

Example: Hello, the following are Hong Kong information pictures, scenic spots, etc. specially sent to you

word2vec.txt: Contains the content of all email word segmentation, provides training prediction for Word2Vec. A total of 63000 data.

Example: I think that debt does not matter. The most important thing is to be able to bear this responsibility. I owe so much money. At least I have to explain to the netizens who took out the love at first, or not, or not, or yes I have this kind of mind but I am really incapable. It means that I do n’t even activate the ID, even the mobile phone number has been changed ... Not to mention out of town, even netizens in Beijing ca n’t find him ... He was at the time During the flint version of Mizuki, I watched the whole process.

Generate word vector

Limitations of traditional methods

We know that the data after word segmentation cannot be directly trained in the model. We need to convert the words into word vectors to train the model, so that a word can be composed of a multi-dimensional word vector. The traditional method is one-hot encoding, that is, a long vector is used to represent a word, the length of the vector is the size of the dictionary, the vector component is only one, and the rest are all 0. The position of 1 corresponds to the word change in the dictionary The location, such as the computer is expressed as: [0 0 0 0 0 1 1 0 0 0 0], the headset is expressed as [0 0 0 0 0 0 0 0 1 0] In this way, if sparse storage is used, the expression is concise and takes up less space, but This method also has several shortcomings. First, it is susceptible to dimensional disasters, especially when it is used in some algorithms of Deep Learning; second, it is unable to portray the similarity between words, that is, any two The words are all isolated. From these two vectors, we can't see whether the two words are related, and most of the information is lost, resulting in a large deviation in the results.

Advantages of the Word2Vec method

In 1968, Hinton also proposed Distributed REpresentation, which can shortcoming of One-hot encoding. The basic idea is to directly use a common vector to represent a word. This kind of vector generally looks like this: [0.792, −0.177, −0.107, 0.109, −0.542, ...], which is the ordinary vector representation. The dimensions are more common in 50 and 100 dimensions. Of course, how to express a word into such a vector needs to be obtained through training, there are many training methods, word2vec is the most common one. It should be noted that, for each word under different corpora and different training methods, the resulting word vector may be different. The word vector is generally not dimensional. Generally, it is enough to specify 1000 and 500 dimensions, so the chance of using the dimension disaster is now greatly reduced for one-hot representation. Because it is expressed in vectors, and the vectors of word vectors obtained with better training algorithms generally have a spatial meaning, that is, putting all these vectors together to form a word vector space, and each vector is For a point in the space, the distance metric between word vectors in this space can also represent the "distance" between the corresponding two words. The so-called "distance" between two words is the similarity between the grammar and semantics between the two words. A good application method is, after obtaining the word vector, if you want to find the most similar word for a word A, after creating a good word vector, for the computer, just take the word of the word The vector and the word vectors of other words are used to calculate the Euclidean distance or cos distance, and the word with the smallest distance is the most similar. So here we chose the word2vec method to train to generate word vectors. About the principle of word2vec, you can search and learn on the Internet, and I will not repeat them here.

Implementation code

In the data preprocessing, the word2vec.txt we generated can be placed here to train the word2vec model to generate word vectors. The specific implementation code is as follows: word2vec.py

#-*-coding: utf-8-*-# Created by huxiaoman 2018.1.28 # word2vec.py: generate word2vec model import osimport sysimport numpy as npfrom gensim.models.word2vec import Word2Vecfrom gensim.corpora.dictionary import Dictionaryimport codecsreload (sys ) sys.setdefaultencoding ("utf-8") class MySentences (object): def __init __ (self, dirname): self.dirname = dirname def __iter __ (self): for fname in os.listdir (self.dirname): for line in codecs.open (os.path.join (self.dirname, fname), "r", encoding = "utf-8", errors = "ignore"): yield line.strip (). split () # word2vec. The address of the txt data train_path = "rawData /" # The address of the generated word2vec model model_path = "/ modelPath /" sentences = MySentences (train_path) # where min_count = 5 represents the 5-element model, size = 100 represents the word vector dimension, worker = 15 means 15 threads model = Word2Vec (sentences, min_count = 5, size = 100, workers = 15) #Save model model.save (model_path + '/ Word2vec_model.pkl')

Operation mode

python word2vec.py

operation result

Word2vec_model.pkl

Model training

After generating positive and negative sample data and converting all words into word vectors, we can pour the data into the model for training. In this article, the traditional machine learning algorithm svm will be used for training.

Specific steps

Load dataset

Divide the training set train, validation set val and test set test

Define training model and train

Verification accuracy

Implementation code

# Build svm model, load data and other codes, see githubdef get_svm_model (x_train, y_train, x_val, y_val): model = SVC (C = 1, kernel = 'rbf', max_iter = 10, gamma = 1, probability = True) model .fit (x_train, y_train) pred = model.predict (x_val) fpr, tpr, thresholds = roc_curve (y_val, pred, pos_label = 2) score = metrics.f1_score (y_val, pred) print score

Operation mode

python train_svm.py

operation result

0.73343221

summary

This article is used as a pre-heating series for dealing with spam using PaddlePaddle. It mainly talks about how to preprocess and filter text data, how to generate word vectors, and use traditional machine learning methods-support vector machine training models. The accuracy is 0.73343221. The quality of the result depends on the size of the dictionary, the size of the word vector dimension, and the adjustment of the basic parameters of svm. In the actual operation process, continuous adjustment of parameters is required to achieve the best results. In the next article, we will lead you how to use PaddlePaddle for spam processing, use deep learning methods to classify spam, and see if the effect is better than traditional machine learning methods, and whether performance and speed can be improved to a certain extent. .

IDC Mining Rig

Easy Electronic Technology Co.,Ltd , https://www.pcelectronicgroup.com