Random Forest is one of the most versatile Machine Learning algorithms in use today. Random Forest has built-in ensemble learning capacity. Ensemble learning is when we take multiple machine learning algorithms and put them together to predict one final outcome. As name suggests, this machine learning algorithm creates a forest or group of trees which can be referred as group of algorithms. In general more the number of trees more accurate the result would be.
Random forest can be used both for Classification and Regression. In this article we would look into Random Forest algorithm used for classification.
Random Forest Algorithm: Random Forest is a tree based algorithm which involves building several trees (Decision Tree) then combining their output to improve the generalization ability of the model. The method of combining several decision trees together is referred as Ensemble learning, where we try to refer to several algorithms to come up with final or best algorithm. Say you want to watch a movie but before you buy the ticket you would like have reviews about the movie. You ask 20 friends with each one of them have different likes or dislikes regarding genre of the movie. Few like Romantic other may like action or someone like science fiction. Now everyone of them would ask you different questions to understand your like or dislike and accordingly would try to help you to decide on whether to watch movie or not.
Similarly Random Forest would try to pick a sample dataset with a subset of variables as well as subset of observations. For example if our dataset has 20 variables and 1000 observations then first decision tree built might choose 15 variables and 700 observations. The second sample might have different set of variables and observations, so in this way one sample we selected would have different set of variables as well as different set of observations because of this the chances of correlation among different sample is very low. This is one of the reasons why Random Forest t times preferred over Bagging.
How Random Forest works:
To understand the working of random forest it is important to understand the working of decision tree.
Root Node: The topmost decision node in a tree which corresponds to the best predictor called root node.
Split Node: The split node divides a sub node into two or more nodes also known as decision node
Leaf Node: Nodes that do not split are known as leaf node or Terminal node
Random Forest algorithm would try to build several decision trees with each one of them is built based on randomly selected data before we put them together to get the final result. We can say Random Forest algorithm tries to take advantage of crowding.
Random Forest follows following steps:
Step1: Pick a random a sample from training data set. It is not just the observations been picked randomly but a sample of predictors also picked randomly
Step2: Build a decision tree based on these data points
Step3: Choose the number of trees we want to build and repeat the step 1 and 2. Every time we repeat step 1 and 2, every sample would randomly pick variables as well as observations from total number of variables and observations
Step4: While predicting an output, Random Forest algorithm would refer to each of the above built decision trees to categorize the variables.
In case of Classification Random Forest assign the new data point category which wins the majority votes. For example if we trying to predict Default or Non-Default status, we would refer to each of the decision trees and try predict output as per maximum number of votes. In Other words we choose Mode of the outputs as final result.
In Classification trees the splitting decision is based on the following methods:
Gini Index: It’s a measure of node purity. If the Gini index takes on a smaller value, it suggests that the node is pure. For a split to take place, the Gini index for a child node should be less than that for the parent node.
Entropy – Entropy is a measure of node impurity. Entropy is maximum at p = 0.5. The entropy is minimum when the probability is 0 or 1.
Advantages and Disadvantages of Random Forest:
- It is robust to correlated predictors
- It can be used both for Regression and Classification
- It takes care of missing data internally in an effective manner
- Random forest classifier won’t over fit the model
- The main limitation of Random Forest is that a large number of trees can make the algorithm slow and ineffective for real-time predictions.