Exploring XG-Boost: Extreme Gradient Boosting

0
45

XG-Boost is short name for extreme-Gradient Boosting; initially started as a research project by “Tianqi Chen” as part of the Distributed (Deep) Machine Learning Community (DMLC) group. Since its launch in year 2014, XG-Boost has become one of the most widely used “Deep Learning Algorithms”.

XG-Boost is very well known to provide better solutions than other machine learning algorithms. In fact, since its inception; it has become the “state-of-the-art” machine learning algorithm to deal with structured data.

What Makes XG-Boost so popular?

Speed and performance: Originally written in C++, it is comparatively faster than other ensemble classifiers.

Core algorithm is parallelizable: Because the core XG-Boost algorithm is parallelizable it can harness the power of multi-core computers. It is also parallelizable onto GPU’s and across networks of computers making it feasible to train on very large datasets as well.

Consistently outperforms other algorithm methods: It has shown better performance on a variety of machine learning benchmark datasets.

Wide variety of tuning parameters: XG-Boost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.

XG-Boost a Supervised Learning algorithm:  XG-Boost is used for supervised learning problems, where we use the training data (with multiple features) to predict a target variable .

Algorithmic Enhancements in XG-Boost:

  1. Regularization: It penalizes more complex models through both LASSO (L1) and Ridge (L2) regularization to prevent overfitting.
  2. Sparsity Awareness: XG-Boost naturally admits sparse features for inputs by automatically ‘learning’ best missing value depending on training loss and handles different types of sparsity patterns in the data more efficiently.
  3. Weighted Quantile Sketch: XG-Boost employs the distributed weighted Quantile Sketch algorithm to effectively find the optimal split points among weighted datasets.
  4. Cross-validation: The algorithm comes with built-in cross-validation method at each iteration, taking away the need to explicitly program this search and to specify the exact number of boosting iterations required in a single run.

What is Gradient Boosting?

When we try to predict the target variable using any machine learning algorithm, the main causes of difference between actual and predicted values are; noise, variance, and bias. Ensemble learning helps to reduce these factors (except noise, which is irreducible error). Ensemble learning is a machine learning technique that combines several base models in order to produce one optimal predictive model.  Ensemble Learning can broadly be divided into categories; i.e. Bagging and Boosting.

Bagging: Bagging is a simple ensemble learning technique in which we build many independent predictors/models/learners and combine them using some model averaging techniques. (e.g. weighted average, majority vote or normal average) e.g. Random Forest.

Boosting: Boosting is an ensemble technique in which the predictors are not made independently or parallely, but sequentially.  This technique employs the logic in which the subsequent predictors learn from the mistakes of the previous predictors. Therefore, the observations have an unequal probability of appearing in subsequent models and ones with the highest error appear most.

With basic understanding of logic behind boosting let us understand XG-Boost in detail:

XG-Boost is sequentially learning algorithm; which combines a set of weak learner to deliver an algorithm with improved or high predicting accuracy. At any stage of algorithm development; the model outcomes are weighed based on the outcomes of previous stage. The outcomes predicted correctly are given a lower weight and the ones miss-classified are weighted higher. Note that a weak learner is one which is slightly better than random guessing.

XG-Boost with an example: Four classifiers (in 4 boxes), shown above, are trying to classify + and – classes as homogeneously as possible.

 

  1. Box 1: The first classifier (usually a decision stump) creates a vertical line (split) at D1. It says anything to the left of D1 is + and anything to the right of D1 is -. However, this classifier misclassifies three + points. Note a Decision Stump is a Decision Tree model that only splits off at one level, therefore the final prediction is based on only one feature.
  1. Box 2: The second classifier gives more weight to the three + misclassified points (see the bigger size of +) and creates a vertical line at D2. Again it says, anything to the right of D2 is – and left is +. Still, it makes mistakes by incorrectly classifying three – points.
  2. Box 3: Again, the third classifier gives more weight to the three – misclassified points and creates a horizontal line at D3. Still, this classifier fails to classify the points (in the circles) correctly.
  3. Box 4: This is a weighted combination of the weak classifiers (Box 1,2 and 3). As you can see, it does a good job at classifying all the points correctly.

That’s the basic idea behind boosting algorithms where errors of weak models are used sequentially to build a new, stronger model while capitalizing on the misclassification error of the previous model and try to reduce it. XG-Boost; to begin with is tree based algorithm in which tree ensembles sequentially. The tree ensemble model is a set of classification and regression trees (CART). Trees are grown one after another and attempts to reduce the misclassification rate are made in subsequent iterations.

XG-Boost Parameters: XG-Boost algorithm has three types of parameters:  general parameters, booster parameters and task parameters. (Note: These parameters are for XG-Boost algorithm and these parameters have nothing to do with assumptions regarding data distribution.)

General parameters: General parameters refers to which booster we are using to do XG-boosting, commonly tree (classification) or linear (regression) model

Booster parameters: Booster parameters depend on which booster you have chosen.

Learning task parameters: Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

General Parameters:

1: Booster[default=gbtree]: Sets the booster type (gbtree, gblinear or dart) to use. For classification problems, you can use gbtree, dart. For regression, you can use any.

2: nthread[default=maximum cores available]: Activates parallel computation. Generally, people don’t change it as using maximum cores leads to the fastest computation.

3: silent[default=0]: If you set it to 1, your R console will get flooded with running messages. Better not to change it.

Booster Parameters:

As mentioned above, parameters for tree and linear boosters are different. Let’s understand each one of them:

Parameters for Tree Booster

  1. nrounds[default=100]
    • It controls the maximum number of iterations. For classification, it is similar to the number of trees to grow.
    • Should be tuned using CV
  2. eta[default=0.3][range: (0,1)]
    • It controls the learning rate, i.e., the rate at which our model learns patterns in data. After every round, it shrinks the feature weights to reach the best optimum.
    • Lower eta leads to slower computation. It must be supported by increase in nrounds.
    • Typically, it lies between 0.01 – 0.3
  3. gamma[default=0][range: (0,Inf)]
    • It controls regularization (or prevents overfitting). The optimal value of gamma depends on the data set and other parameter values.
    • Higher the value, higher the regularization. Regularization means penalizing large coefficients which don’t improve the model’s performance. default = 0 means no regularization.
    • Tune trick: Start with 0 and check CV error rate. If you see train error >>> test error, bring gamma into action. Higher the gamma, lower the difference in train and test CV. If you have no clue what value to use, use gamma=5 and see the performance. Remember that gamma brings improvement when you want to use shallow (low max_depth) trees.
  4. max_depth[default=6][range: (0,Inf)]
    • It controls the depth of the tree.
    • Larger the depth, more complex the model; higher chances of overfitting. There is no standard value for max_depth. Larger data sets require deep trees to learn the rules from data.
    • Should be tuned using CV
  5. min_child_weight[default=1][range:(0,Inf)]
    • In regression, it refers to the minimum number of instances required in a child node. In classification, if the leaf node has a minimum sum of instance weight (calculated by second order partial derivative) lower than min_child_weight, the tree splitting stops.
    • In simple words, it blocks the potential feature interactions to prevent overfitting. Should be tuned using CV.
  6. subsample[default=1][range: (0,1)]
    • It controls the number of samples (observations) supplied to a tree.
    • Typically, its values lie between (0.5-0.8)
  7. colsample_bytree[default=1][range: (0,1)]
    • It control the number of features (variables) supplied to a tree
    • Typically, its values lie between (0.5,0.9)
  8. lambda[default=0]
    • It controls L2 regularization (equivalent to Ridge regression) on weights. It is used to avoid overfitting.
  9. alpha[default=1]
    • It controls L1 regularization (equivalent to Lasso regression) on weights. In addition to shrinkage, enabling alpha also results in feature selection. Hence, it’s more useful on high dimensional data sets.

Parameters for Linear Booster

Using linear booster has relatively lesser parameters to tune, hence it computes much faster than gbtree booster.

  1. nrounds[default=100]
    • It controls the maximum number of iterations (steps) required for gradient descent to converge.
    • Should be tuned using CV
  2. lambda[default=0]
    • It enables Ridge Regression. Same as above
  3. alpha[default=1]
    • It enables Lasso Regression. Same as above

Learning Task Parameters

These parameters specify methods for the loss function and model evaluation. In addition to the parameters listed below, you are free to use a customized objective / evaluation function.

  1. Objective[default=reg:linear]
    • reg:linear – for linear regression
    • binary:logistic – logistic regression for binary classification. It returns class probabilities
    • multi:softmax – multiclassification using softmax objective. It returns predicted class labels. It requires setting num_class parameter denoting number of unique prediction classes.
    • multi:softprob – multiclassification using softmax objective. It returns predicted class probabilities.
  2. eval_metric [no default, depends on objective selected]
    • These metrics are used to evaluate a model’s accuracy on validation data. For regression, default metric is RMSE. For classification, default metric is error.
    • Available error functions are as follows:
      • mae – Mean Absolute Error (used in regression)
      • Logloss – Negative loglikelihood (used in classification)
      • AUC – Area under curve (used in classification)
      • RMSE – Root mean square error (used in regression)
      • error – Binary classification error rate [#wrong cases/#all cases]
      • mlogloss – multiclass logloss (used in classification)

Conclusion: Machine Learning is a very active research area and already there are several viable alternatives to XG-Boost. Microsoft Research recently released Light-GBM framework for gradient boosting that shows great potential. Cat-Boost developed by Yandex Technology has been delivering impressive bench-marking results. It is a matter of time when we have a better model framework that beats XG-Boost in terms of prediction performance, flexibility, explain-ability, and pragmatism. However, until a time when a strong challenger comes along, XG-Boost will continue to reign over the Machine Learning world!

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here