Synopsis
Boosting operator based on Bayes' theorem.
Description
This operator trains an ensemble of classifiers for boolean target attributes. In each iteration the training set is reweighted, so that previously discovered patterns and other kinds of prior knowledge are "sampled out" {@rapidminer.cite Scholz/2005b}. An inner classifier, typically a rule or decision tree induction algorithm, is sequentially applied several times, and the models are combined to a single global model. The number of models to be trained maximally are specified by the parameter iterations
.
If the parameter rescale_label_priors
is set, then the example set is reweighted, so that all classes are equally probable (or frequent). For two-class problems this turns the problem of fitting models to maximize weighted relative accuracy into the more common task of classifier induction {@rapidminer.cite Scholz/2005a}. Applying a rule induction algorithm as an inner learner allows to do subgroup discovery. This option is also recommended for data sets with class skew, if a "very weak learner" like a decision stump is used. If rescale_label_priors
is not set, then the operator performs boosting based on probability estimates.
The estimates used by this operator may either be computed using the same set as for training, or in each iteration the training set may be split randomly, so that a model is fitted based on the first subset, and the probabilities are estimated based on the second. The first solution may be advantageous in situations where data is rare. Set the parameter ratio_internal_bootstrap
to 1 to use the same set for training as for estimation. Set this parameter to a value of lower than 1 to use the specified subset of data for training, and the remaining examples for probability estimation.
If the parameter allow_marginal_skews
is not set, then the support of each subset defined in terms of common base model predictions does not change from one iteration to the next. Analogously the class priors do not change. This is the procedure originally described in {@rapidminer.cite Scholz/2005b} in the context of subgroup discovery.
Setting the allow_marginal_skews
option to true
leads to a procedure that changes the marginal weights/probabilities of subsets, if this is beneficial in a boosting context, and stratifies the two classes to be equally likely. As for AdaBoost, the total weight upper-bounds the training error in this case. This bound is reduced more quickly by the BayesianBoosting operator, however.
In sum, to reproduce the sequential sampling, or knowledge-based sampling, from {@rapidminer.cite Scholz/2005b} for subgroup discovery, two of the default parameter settings of this operator have to be changed: rescale_label_priors
must be set to true
, and allow_marginal_skews
must be set to false
. In addition, a boolean (binomial) label has to be used.
The operator requires an example set as its input. To sample out prior knowledge of a different form it is possible to provide another model as an optional additional input. The predictions of this model are used to weight produce an initial weighting of the training set. The ouput of the operator is a classification model applicable for estimating conditional class probabilities or for plain crisp classification. It contains up to the specified number of inner base models. In the case of an optional initial model, this model will also be stored in the output model, in order to produce the same initial weighting during model application.
Input
- training set: expects: ExampleSet
- model: optional: PredictionModel
Output
- model:
- example set:
Parameters
- use subset for training: Fraction of examples used for training, remaining ones are used to estimate the confusion matrix. Set to 1 to turn off test set.
- iterations: The maximum number of iterations.
- rescale label priors: Specifies whether the proportion of labels should be equal by construction after first iteration .
- allow marginal skews: Allow to skew the marginal distribution (P(x)) during learning.
- use local random seed: Indicates if a local random seed should be used.
- local random seed: Specifies the local random seed