Machine Learning Powers Data Leak Detection

Machine Learning Models Power Data Leak Detection

Data Science is a critical cog in CybelAngel’s Data Leak detection. Our Machine Learning models protect customers against data breaches by augmenting the expertise of CybelAngel security analysts. A human-only solution would take an extremely long time to detect actual leaks from all the false positives generated by the exponential growth of data exchanged on the web. Efficient Machine Learning models allow for effective triage between true false positives and potential threats. They reduce the overall feed of alerts analysts have to investigate.  But how do we train a mathematical model to act as a human — by training and retraining it.  Let’s see how we do it.

Preparing the Training Session

At CybelAngel, we build models to discriminate the “safe” data from the “shouldn’t-be-out-there” data. This is a classification task, with only two classes: the negative one “not critical”, or the positive one “critical document.” Before we start the training per se, we prepare the dataset. We clean it, meaning we do some deduplication and under-sampling, and there it is, all shiny! The data set is then split in training, testing and validation subsets, with all necessary checks to have them similarly distributed. We have also made feature exploration, and we chose what features we want to use. Now we can move to the training phase, which unfolds into five key components. These steps are not sequential, you can tune your model by starting at any point in the process — as long as you complete each step.

#1 – Benchmarking Algorithms

Building a Machine Learning model starts with choosing the right algorithms. You can use neural networks (NN) – but you can also use good standard models. Although NN are trendy and often presented as better learners, it is worth comparing them with others, such as Random Forest Classifiers (RFC), Logistic regression (logreg), or even k-nearest neighbors algorithm (kNN). At CybelAngel, our data scientists work with all these algorithms to adapt our approach to the datasets. Our target? Interpretability, and robustness. It is very important because our data changes month-to-month!

#2 – Choosing Hyperparameters

Most models need hyperparameters to complete the algorithms. The trick with fine tuning hyperparameters is that it’s easy to fall into overfitting.  A good example is we use a RFC for one of our models. This model is sensitive to overfitting. Why? RFC is an ensemble method based on trees. You have to choose how deep the trees may go, and how much data one leaf must contain. If you keep splitting branches until there are only a few instances per leaf, you can be sure you will craft a model unable to generalize well from our training data to unseen data. Our target? Optimal goodness of fit with generalization ability.

#3 – Implementing the Sample Weight Method

At CybelAngel, our classification goal is risk detection, which means not all documents have the same importance. While it’s troublesome to have an invoice in the wild, it could be devastating to your business to have the blueprints for your headquarters sold on a dark forum! CybelAngel’s classification differentiates between troublesome risks and one that could wreak havoc on your business.  We assign different sample weights to the learning instances: the more severe the data breach, the more important it is for the algorithm to learn to classify it well. Our target: assign a coefficient to each type of data, based on their level of severity.

#4 – Using Learning Curves

There’s a perpetual argument around how much data is enough data. How much data is enough data to ensure the quality and the robustness of the model? We’ve got friends in learning curves Learning curves represent the performance of the model on a specific metric according to the size of your dataset. The performance on the test set increases, while the performance on the training set decreases, representing the generalization of the training. Once the performance on the test set stops increasing, and is coming near to the performance on the training set, that’s it! We can say that we have enough data to build a general model. Learning curves also tell you if you are in an overfitting or under-fitting situation. It tells you “how well you are learning” — that is, if your model is ready to go to production. If you are overfitting, you might wish to go back to fine-tuning your hyperparameters.

#5 – Retraining Machine Learning Models

Models never stay at their optimum for a long time. The data and the behavior change constantly.  Online learning is one solution to be at the edge, to keep your model coherent between what it is trained on and what it treats. The idea is to get feedback from the prediction of the model, and make it learn from it.  At CybelAngel, the Data Science team works hand-in-hand with CybelAngel’s cyber analysts. The last filter allows for validation of the models with human expertise. The cyber analysts classify the instances, while the Data Scientists use the feedback to update the model and keep it sharp. This is how we train Machine Learning models at CybelAngel to copy the behavior of a cyber analyst addressing a potential leak. But Machine Learning models won’t replace human investigation. They can only enhance it. By sorting out billions of true negatives, these models allow cyber analysts to focus their time and skills on critical threats versus false positives.    It’s this unique combination of Machine Learning and Human Expertise, that enables CybelAngel to provide comprehensive, scalable, and actionable coverage to our enterprises across the globe.