Tree Ensembles

Tree Ensembles

Tree Ensembles

Instead of picking a single model, Ensemble Method combines multiple models in a certain way to fit the training data. Here are the two primary ways: "bagging" and "boosting." In "bagging", we take a subset of training data (pick n random sample out of N training data, with replacement) to train up each model. After multiple models are trained, we use a voting scheme to predict future data.

Random Forest is one of the most popular bagging models; in addition to selecting n training data out of N at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5). Then it learns a decision tree from that. Finally, each tree in the forest votes for the result. Here is the R code to use Random Forest:

> library(randomForest)
    #Train 100 trees, random selected attributes
    > model <- randomForest(Species~., data=iristrain, nTree=500)
    #Predict using the forest
    > prediction <- predict(model, newdata=iristest, type='class')
    > table(prediction, iristest$Species)
    > importance(model)
    MeanDecreaseGini
    Sepal.Length         7.807602
    Sepal.Width          1.677239
    Petal.Length        31.145822
    Petal.Width         38.617223

"Boosting" is another approach in Ensemble Method. Instead of sampling the input features, it samples the training data records. It puts more emphasis, though, on the training data that is wrongly predicted in previous iterations. Initially, each training data is equally weighted. At each iteration, the data that is wrongly classified will have its weight increased.

Gradient Boosting Method is one of the most popular boosting methods. It is based on incrementally adding a function that fits the residuals.

Set i = 0 at the beginning, and repeat until convergence.

  • Learn a function Fi(X) to predict Y. Basically, find F that minimizes the expected(L(F(X) – Y)), where L is the lost function of the residual
  • Learning another function gi(X) to predict the gradient of the above function
  • Update Fi+1 = Fi + a.gi(X), where a is the learning rate

Below is Gradient-Boosted Tree using the decision tree as the learning model F. Here is the sample code in R:

> library(gbm)
    > iris2 <- iris
    > newcol = data.frame(isVersicolor=(iris2$Species=='versicolor'))
    > iris2 <- cbind(iris2, newcol)
    > iris2[45:55,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species isVersicolor
    45          5.1         3.8          1.9         0.4     setosa        FALSE
    46          4.8         3.0          1.4         0.3     setosa        FALSE
    47          5.1         3.8          1.6         0.2     setosa        FALSE
    48          4.6         3.2          1.4         0.2     setosa        FALSE
    49          5.3         3.7          1.5         0.2     setosa        FALSE
    50          5.0         3.3          1.4         0.2     setosa        FALSE
    51          7.0         3.2          4.7         1.4 versicolor         TRUE
    52          6.4         3.2          4.5         1.5 versicolor         TRUE
    53          6.9         3.1          4.9         1.5 versicolor         TRUE
    54          5.5         2.3          4.0         1.3 versicolor         TRUE
    55          6.5         2.8          4.6         1.5 versicolor         TRUE
    > formula <- isVersicolor ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
    > model <- gbm(formula, data=iris2, n.trees=1000, interaction.depth=2, distribution="bernoulli")
    Iter   TrainDeviance   ValidDeviance   StepSize   Improve
    1        1.2714         -1.#IND     0.0010    0.0008
    2        1.2705         -1.#IND     0.0010    0.0004
    3        1.2688         -1.#IND     0.0010    0.0007
    4        1.2671         -1.#IND     0.0010    0.0008
    5        1.2655         -1.#IND     0.0010    0.0008
    6        1.2639         -1.#IND     0.0010    0.0007
    7        1.2621         -1.#IND     0.0010    0.0008
    8        1.2614         -1.#IND     0.0010    0.0003
    9        1.2597         -1.#IND     0.0010    0.0008
    10        1.2580         -1.#IND     0.0010    0.0008
    100        1.1295         -1.#IND     0.0010    0.0008
    200        1.0090         -1.#IND     0.0010    0.0005
    300        0.9089         -1.#IND     0.0010    0.0005
    400        0.8241         -1.#IND     0.0010    0.0004
    500        0.7513         -1.#IND     0.0010    0.0004
    600        0.6853         -1.#IND     0.0010    0.0003
    700        0.6266         -1.#IND     0.0010    0.0003
    800        0.5755         -1.#IND     0.0010    0.0002
    900        0.5302         -1.#IND     0.0010    0.0002
    1000        0.4901         -1.#IND     0.0010    0.0002

 
    > prediction <- predict.gbm(model, iris2[45:55,], type="response", n.trees=1000)
    > round(prediction, 3)
    [1] 0.127 0.131 0.127 0.127 0.127 0.127 0.687 0.688 0.572 0.734 0.722
    > summary(model)
    var       rel.inf
    1 Petal.Length 61.4203761582
    2  Petal.Width 34.7557511871
    3  Sepal.Width  3.5407662531
    4 Sepal.Length  0.2831064016

The GBM R package also gave the relative importance of the input features, as shown in the bar graph.

GMB R package