# Ensembles

Model ensembling is a very powerful technique to increase accuracy on a variety of ML tasks. Moreover, it often reduces the generalization error, that is useful for further inference.

Most popular ML libraries and platforms don’t provide ensembles out of the box. Instead they assume that the user will prepare them. Assembling optimal ensembles is typically beyond the expertise of most developers or business analysts.

AugerML provides different ensemble techniques and state-of-the-art algorithms to leverage performance boost for the final model.

### Averaging

It is one of the classic approaches for model ensembling, where it averages the predictions from selected pipelines (i.e.: model == pipeline == estimator). The selection procedure is based on some meta-score calculation. It computes decorrelations between algorithms’ predictions and combines them with its performance metric to sort them further and select on. Moreover, all ensemble algorithms use some pre-selection routine, which allows to not just pull top models from the leaderboard (like a 100 random forest, if all of them better than others estimators). It selects top pipelines from each group, where the models are assembled by their topology properties.

**Pros**:

- Simple
- Fast
- Powerful enough
- Averaging predictions often reduces overfit

**Cons**:

- Shallow analysis
- Not super powerful

### Voting

This algorithm is similar to `averaging`

one,
where each incorporated pipeline are combined with some weight.
All models predictions are sum up with the similar weight in `averaging`

case,
but `voting`

looks like a weighted sum.
It tends to operate with probabilities (i.e. soft labels) for classification tasks.

And it relies on the majority vote rule for the hard labels {0, 1, ...5}, for example: we have 3 binary classifiers (A, B, C) with a 70% accuracy (they output a “1” 70% of the time and a “0” for 30%). Majority voting could gives us a 78% accuracy.

All three are correct: 0.7 * 0.7 * 0.7 = 0.3429

Two are correct: 0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7 + 0.3 * 0.7 * 0.7 = 0.4409

Two are wrong: 0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 * 0.3 = 0.189

All three are wrong: 0.3 * 0.3 * 0.3 = 0.027

It gives us in total ~78% (0.3429 + 0.4409 = 0.7838). A voting ensemble of 5 pseudo-random classifiers with 70% accuracy would be correct ~83% of the time. One or two errors are being corrected during ~66% of the majority votes.

**Pros**:

- Simple
- Fast
- Powerful enough
- It could correct predictions for the weak models

**Cons**:

- Shallow analysis
- Not super powerful

### Greedy Selection

The main idea of this method is to select models greedily. Starting with the best pipeline and then adding another one if the performance is better than the previous best model (or combination). This method tries to prove that some combinations can bring benefit. Moreover, the algorithm uses a bagging of such combinations. It initializes some bag by selecting pipelines from the provided set by random span from the the top. It performs a forward greed search at the each bag and combines them to the final (bagged) model.

**Paper**: *R. Caruana et. al. Getting the Most Out of Ensemble Selection. ICDM 2006.*

**Pros**:

- Simple enough
- Powerful
- Fast enough

**Cons**:

- Performance mostly depend on the amount of provided pipelines

### Super Learner

This ensemble technique strongly relies on the cross-validation predictions (uses “soft” labels for classification) to form what is called the “level-one” data on which the meta-learner is trained on. This data looks like a prediction matrix with rows representing our samples and columns - selected models. It is represented as a cube for a classification task by adding one more dimension which is equal to number of classes. As meta-learners it is used L-BFGS, SLSQP and NNLS (regression) algorithms.

Ensemble construction algorithm:

Define inputs:

- Specify a set of N base models (from the leaderboard).
- Specify a meta-learning algorithm (L-BFGS, SLSQP, NNLS).

Construction (selection) stage: the k-fold cross-validation forms for each of N selected model the predicted values for each sample in provided data (S samples for example). It looks as S x N matrix or S x N x C (number of classes) cube.

- Train the meta-learning algorithm on this matrix (cube) wrt optimization constraints to get the weight for each provided model.
- Eliminate some models with the weights under specified threshold.

Prediction: just perform a weighted prediction wrt specified weight.

**Paper**: *M. van der Laan et. al. Super Learner. 2007.*

**Pros**:

- Simple enough
- Fast enough
- Powerful

**Cons**:

- Shallow optimization

### Deep Super Learner

It is the most modern ensembling technique, which shows the promising results (wrt paper and it is also true from our internal experiments). It is like N-layer stacking. Each layer represents a SuperLearnerAlgorithm where its averaging predictions (after optimization step) are fed to the next layer as additional features. It constructs each layer on the top of the previous one until it reaches a predefined threshold for the its loss (it’s like an early stopping) or the number of maximum iterations was exceeded.

**Paper**: *S.Young et. al. Deep Super Learner: A Deep Ensemble for Classification Problems. 2018.*

**Pros**:

- Super Powerful

**Cons**:

- Slow