Random forest demystified

Machine Learning Brainweights
3 min readJun 1, 2021

--

What is random forest ?

Random forest is a supervised machine learning algorithm built on multiple decision trees that ensembles (combines together) trees to improve the output.

Random forest is used to solve both regression and classification problems.

Click here to read on decision trees to better understand random forest.

How Random forest works

  • Given sample data, Random forest begins by building multiple uncorrelated decision trees.
  • Random forest algorithm is then trained via “bagging” or “bootstrap aggregation” method where random subsets of the main datasets of the training data are fitted using the random forest model.
  • Random forest algorithm then ensembles the individual uncorrelated decision trees to aggregate an output.

“Wisdom of the crowd” is the concept behind Random forest. It is an idea that seeks to find information from multiple sources and then later amalgamates to give an overall decision.

This is the main reason why Random forest predicts better than Decision trees.

Implementing a random forest in python

regression

i) Random Forest Regressor.

Below is sample code of how to fit a Random Forest regressor model.

from sklearn.ensemble import RandomForestClassifier

RF = RandomForestRegressor(n_estimators=10)

RF = RF.fit(x, y)

ii) Random Forest Classifier.

Below is sample code of how to implement Random Forest regressor.

from sklearn. ensemble import RandomForestClassifier

RF = RandomForestClassifier(n_estimators=10)

RF = RF.fit (x, y)

Random forest parameters.

Parameter tuning enhances the quality of predictions as well as reducing the error rate of the model.

Below are the most common Random forest hyper parameters that adversely affect positively the quality of your prediction if tweaked well.

a) n-estimators. Refers to the number of trees that will be built by random forest algorithm. Here the default value is 10. Choosing a high number slows down the algorithm but could give better and stable results.

b) Max_features. Refers to the maximum number of features (column in the dateset) that random forest is allowed to try out in an individual tree.

The following are some of the arguments provide for Max_features hyper parameter.

  1. Auto/None. This option takes all the features which makes sense in all trees.

2. Sqrt. This option takes the square root of all the features. For example, if total number of variables is 100, only 10 features will be considered for the Random forest.

3. 0.2. This option allows random forest to consider only 20% of the total features available.

c) min_samples_split.parameter will evaluate the number of samples in the node, and if the number is less than the minimum the split will be avoided and the node will be a leaf.

d) min_samples_leaf.parameter checks before the node is generated, that is, if the possible split results in a child with fewer samples, the split will be avoided (since the minimum number of samples for the child to be a leaf has not been reached) and the node will be replaced by a leaf.

model training

Hyper parameters to enhance the training process.

i) n_jobs. This parameter tells the engine how many processors it is allowed to use. Example if n_jobs is equal to one, only one processor is used during training.

ii) Random_state. A definite random state value, for example, a random state of 1 will always ensure the same result upon multiple runs of the algorithm therefore making it reproducible.

Click here to see a simple implementation of random forest classifier on an actual dataset.

--

--

No responses yet