Improving accuracy of machine learning models. How ?
Fitting a machine learning model can be as simple as A, B, C, D but how to make that machine learning model predict accurately with as minimal error as possible is the most difficult part as data scientist hence the motivation to write this.
This blog post will purpose to enlighten you on the various ways that you can use to improve the accuracy of your machine learning models. In each method a valid example will be provided and at the end of the blog a link to a GitHub repository containing the implementation of some of these strategies will be provided. The following are the methods that have proven to greatly improve accuracy and subsequently reduced the error rate in prediction.
1) More data.
This has been the oldest trick in the book. As you know, the quality of predictions of a machine learning model heavily relies on the quality and the amount of data provided to the model. This simply means that if less data is provided to the model, the model will not have enough to learn from and the best way to mitigate such an occurrence is by having more data.
2) Feature Selection.
Charles Babbage, who’s regarded to as the father of computers is known for coining the mantra, “Garbage in garbage out” abbreviated as GIGO. This principle heavily applies in machine learning when it comes to feature selection.
Given a particular dataset (assuming we are working with a supervised algorithm) each dependent variable holds a certain weight in how it affects the dependent variable (variable to be predicted) and therefore selecting a variable that less affects the target, means that the model will not have sufficient information to derive insights hence poor quality of predictions.
In feature selection, which is an imperative step in any data science project, there are specific methods that data scientists can leverage on to determine which features to pick given a large dataset. The following are the methods.
a) Filter Methods.
A filter method is a feature selection method where features are dropped based on how they relate to the output
Filter methods include:
i) Information Gain. Refers to the amount of information acquired by a variable in reference to another variable. To implement this mutual info classif function from sklearn is used. The code below is an illustration on how to implement information gain.
ii) Correlation coefficient. This refers to the linear relationship between two columns represented as a value between 0 and 1 where a correlation of 0.5 (+ve or -ve)and above is considered a strong correlation between the two columns. The code below is an illustration.
iii) Variance Threshold. This feature selection strategy capitalizes on dropping columns / variables whose variance don’t meet a certain threshold.
iv) Fisher’s score. This feature selection method Implements the fisher’s algorithm to compute feature importance. The code below is an illustration.
iv) Chi-square Test. This feature selection method used for categorical features.
Other filter methods include: Mean Absolute Difference and Dispersion Ratio
b) Wrapper Methods.
In this methods, the importance of a feature is based on the output performance of the classifier.
Here, the training data is divided into subsets. A model is then trained and subsequently, features are added or subtracted depending on the performance of the model. The model is trained again with the new set of features.
Some of the wrapper methods include : Forward feature selection, backward feature elimination, exhaustive Feature selection and Recursive feature elimination. Click here to learn more about wrapper methods
c) Intrinsic Methods (Embedded methods).
Intrinsic methods are a hybrid of both wrapper and filter methods. These methods do this by embedding feature selection in the modelling process. These methods are iterative meaning that, the involve fitting and refitting the machine learning model again and again as the features are evaluated. Some of these methods include the following:
i) Lasso Regression.
This methods implements L1 regularization technique. Regularization entails adding a penalty to parameters being fed into the model to avoid issues such as overfitting. In linear models, the penalty is applied over the coefficients of each independent variable.
ii) Random Forest Importance.
Various machine learning algorithms such as Random forest and Decision trees provide a function, clf.feature_importance_ (where clf is the name of the variable that has been used to instantiate the model) that helps ML engineers know the most important features in a hierarchical order. The results of this guide in selecting features based on the importance.
3) Outlier Detection and Removal.
An outlier is any data point within a dataset that is not within the range of the normal observations. An outlier can be on any extreme, either too high or too low and whatever the case, outliers greatly affect the performance of the model because the model tries to accommodate this data points forgetting that these are just rare cases. The best way to deal with outliers is by getting rid of them to enable the model to only learn what is relevant and preventing it from capturing rare data points thus avoiding overfitting.
The following are some methods that you can use to detect outliers.
a) Interquartile Range Method.
IQR refers to difference between the third and first quartile. Statistically, most of the normal observation within a dataset fall within this range therefore the IQR is used as the basis of outlier detection and removal. The code below is an illustration.
b) Anomaly Detection Algorithms.
Through use algorithms such as Isolation Forest.
This method is best used in unsupervised machine learning cases.
4) Hyper parameter tuning.
As mentioned in the beginning of this blog post, fitting a machine learning model is as simple as A, B, C, D but making the model produce quality predictions is normally the difficult part. Assuming that you have added more data, selected the necessary features for the model and successfully removed outliers, still that’s not enough. The other important part entails choosing specific values for the parameters of the model. Before the invention of the techniques that will be discussed below, hyper parameter tuning was more of trial and error since there is no set of parameter values for a specific algorithm that will work optimally on all data sets. In present day, the hyper parameter tuning methods include.
a) Grid Search CV
Iterates through the whole set of parameters provided and comes up with all possible combination of the values of the hyper parameters and subsequently fitting the given model with all these parameters.
The result is a data frame that contains accuracies of different folds and a rank of all hyper parameter values according to the accuracy that each set of parameters will produce. The code below is an illustration.
The set of hyper parameters that produces the best average accuracy ranks position 1 and this are the hyper parameters that would be used to fit the model.
b) Randomized Search CV
Randomized Search is similar to Grid search cv but unlike grid search cv, random search cv does not come up with all possible combinations of the hyper parameters and fits the model but instead, it picks a few hyper parameters randomly, fits the model and produces a data frame that shows accuracies of each set of hyper parameters.
Randomized search is often faster since it does not have to generate all possible combinations of the hyper parameter values, instead potential values for the model are picked randomly making it faster compared to the grid search. The code below is an illustration.
5) Ensemble learning.
This method involves combining of two or more models to better the quality of output and to mitigate the drawbacks of each model. This similar to building a hybrid that inferences from multiple models which is better compared to inferencing from a single model.
The various ensemble techniques include:
a) Bagging.
Also known as bootstrap aggregation. Here, the variance of the model is reduced by generating additional data.
Bagging comprises of two major parts, one is aggregating of models and secondly, bootstrapping which statistical sampling method hence the name bootstrap aggregation.
Therefore, different data samples obtained from the bootstrapping technique are fitted against the ensemble of models. Click here to learn more about bagging.
b) Boosting.
Boosting methods reduce bias by turning weak learners into strong learners. In this method, random samples of data are chosen fitted into the model and trained sequentially. Examples of boosting algorithms are XGBoost, Adaboost, Gradient boosting etc. The code below is an illustration.
Here are some resources that can help you gain more knowledge on this topic.
https://github.com/keithmartinkinyua/House-price-prediction — Sample code
https://www.youtube.com/watch?v=X3Wbfb4M33w&t=44s — Implementation of Bagging and Boosting
https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/ — Feature selection
https://www.simplilearn.com/tutorials/machine-learning-tutorial/bagging-in-machine-learning — Bagging