Random Forest: Regression

1
157 In this article we would look into Regression tree rather than Classification Tree. The basic concept used in Regression Tree is very similar to the one we looked into Random Forest Classification.  As we know Random Forest is a version on Ensemble Learning.

The basic difference between Random Forest classifications and regression:

In case of classification we try to classify new data entry based on maximum number of votes or mode of the output we get from each of the decision tree, whereas in Regression we assign the new data point the average across all of the predicted values from decision trees.

Random Forest Algorithm:  Random Forest is a tree based algorithm which involves building several trees (Decision Tree) then combining their output to improve the generalization ability of the model. The method of combining several decision trees together is referred as Ensemble learning, where we try to refer to several algorithms to come up with final or best algorithm. Say you want to watch a movie but before you buy the ticket you would like have reviews about the movie. You ask 20 friends with each one of them have different likes or dislikes regarding genre of the movie. Few like Romantic other may like action or someone like science fiction. Now everyone of them would ask you different questions to understand your like or dislike and accordingly would try to help you to decide on whether to watch movie or not.

Similarly Random Forest would try to pick a sample dataset with a subset of variables as well as subset of observations.    For example if our dataset has 20 variables and 1000 observations then first decision tree built might choose 15 variables and 700 observations. The second sample might have different set of variables and observations, so in this way one sample we selected would have different set of variables as well as different set of observations because of this the chances of correlation among different sample is very low. This is one of the reasons why Random Forest t times preferred over Bagging.

How Random Forest works:

To understand the working of random forest it is important to understand the working of decision tree. Root Node: The topmost decision node in a tree which corresponds to the best predictor called root node.

Split Node: The split node divides a sub node into two or more nodes also known as decision node

Leaf Node:  Nodes that do not split are known as leaf node or Terminal node

Random Forest algorithm would try to build several decision trees with each one of them is built based on randomly selected data before we put them together to get the final result. We can say Random Forest algorithm tries to take advantage of crowding.

Random Forest follows following steps:

Step 1: Pick a random a sample from training data set. It is not just the observations been picked randomly but a sample of predictors also picked randomly

Step 2: Build a decision tree based on these data points

Step 3: Choose the number of trees we want to build and repeat the step 1 and 2. Every time we repeat step 1 and 2, the sample would randomly pick variables as well as observations from total number of variables and observations

Step 4: While predicting an output, Random Forest algorithm would refer to each of the above built decision trees to assign the average across all of the predicted Y values.

In case of Regression Random Forest assign the new data point the average of the predicted values from decision tree used. For example if we trying to predict Expected loss, random forest algorithm would refer to each of the decision trees and try give an  average of all of the predicted values for Expected loss.

In regression trees (where the output is predicted using the mean of observations in the terminal nodes), the splitting decision is based on minimizing RSS. The variable which leads to the greatest possible reduction in RSS is chosen as the root node. The tree splitting takes a top-down greedy approach, also known as recursive binary splitting. We call it “greedy” because the algorithm cares to make the best split at the current step rather than saving a split for better results on future nodes.

1. It is robust to correlated predictors
2. It can be used both for Regression and Classification
3. It takes care of missing data internally in an effective manner
4. Random forest classifier won’t over fit the model

1. Anonymous