To any Data Scientist, creating a model and overfitting it to your data is one of the very typical challenges you would have to face. When a particular model performs perfectly when given training data but is unable to perform well on the test data, it becomes evident that the model is trying to accommodate and compensate for the overfitting by cross-validation or sometimes hyperparameter turning.
Other times the issue of overfitting goes unnoticed due to its subtle nature. This goes to show that sometimes the problem may be visible while other times it may be hard to catch.
In some cases, cross-validation will not do a good job of fixing problems. This occurs when the test data is brought from a different source than the train data. Cross-validation requires a certain training set to solve overfitting issues, thus failing.
The solution to these problems is adversarial validation.
What is Adversarial Validation?
Adversarial validation is a method used to reduce overfitting by applying it to the data. It involves the identification of the similarities between the test data and the training data. This is done through analysis of the distribution of features. A classifier is built which in turn makes predictions about where the data is from exactly.
It assigns rows from training sets and rows from test sets in the form of 0’s and 1’s respectively. If any differences exist, they can be identified quickly and easily. This technique is made use of mostly in Kaggle competitions.
Execution and Application of the Adversarial Validation Technique
Selecting a data set in order to try and identify the performance, the following steps are followed:
- The data is downloaded and in order to turn the data into a usable format, pre-processing is carried out.
- Unnecessary and irrelevant columns are dropped while column setup is being done. The empty columns are to be filled in with default values.
- Once this is done a separate column is created for the validation classifier. This will contain the 0’s and 1’s pertaining to the training and test data respectively. Then both the datasets are combined to leave just one.
- Once the data is turned into a categorical set you would be required to do the writing and training of the classifier. Catboosting the classification may make things more convenient.
- By plotting a roc graph you would be able to tell whether the classifier is performing well.
- If there is a large variation in the data sets, a graph can be plotted to find the most important feature.
- After gathering all the information you would be able to remove a few features and re-check the model.
- The goal of this entire process is to make it very difficult for an advert to classify between the two points, that is the training and testing points.
Although adversarial validation is a very good method to identify the distribution, it does not give any measures to mend the distribution. The adversarial model can be analyzed and the important features can be found with this technique. The model also distinguishes between labels, thus allowing the analyst to drop those features.
In conclusion, adversarial modeling can assist in the identification of the hidden reasons behind a model's inability to perform optimally. This method can be utilized to come up with advanced machine learning models, making it popular among people competing in Kaggle. The only drawback with this method is that it is still in development and does not provide solutions to mend problems with data distribution.
Machine Learning Training is perfect for people looking for a job in data analysis. Analytics and artificial intelligence course would also help in increasing the person's knowledge further and thus assuring their success in the field of data analysis.