Synopsis
X-Validation encapsulates a cross-validation in order to estimate the performance of a learning operator.
Description
X-Validation
performs a cross-validation process. The input ExampleSet S is split up into number of validations subsets S_i. The inner subprocesses are applied number of validations times using S_i as the test set (input of the Testing subprocess) and S \ S_i as training set (input of the Training subprocess).
The Training subprocess must return a model, which is usually trained on the input ExampleSet. The Testing subprocess must return a PerformanceVector. This is usually generated by applying the model and measuring it's performance. Additional objects might be passed from the Training to the Testing subprocess using the through ports. Please note that the performance calculated by this estimation scheme is only an estimation of the performance which would be achieved with the model built on the complete delivered data set instead of an exact calculation. Exactly this model, hence the one built on the complete input data, is delivered at the corresponding port in order to give convenient access to this model.
Like other validation schemes the RapidMiner cross validation can use several types of sampling for building the subsets
Linear sampling simply divides the example set into partitions without changing the order of the examples. Shuffled sampling build random subsets from the data. Stratifed sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole example set. For having the random splits independent from the rest of the process, a local random seed might be used. See the parameters for details.
The cross validation operator provides several values which can be logged by means of a
Log </p>
<p> . Of course the number of the current iteration can be logged which might be useful for ProcessLog operators wrapped inside a cross validation. Beside that, all performance estimation operators of RapidMiner provide access to the average values calculated during the estimation. Since the operator cannot ensure the names of the delivered criteria, the ProcessLog operator can access the values via the generic value names:
- performance: the value for the main criterion calculated by this validation operator
- performance1: the value of the first criterion of the performance vector calculated
- performance2: the value of the second criterion of the performance vector calculated
- performance3: the value of the third criterion of the performance vector calculated
- for the main criterion, also the variance and the standard deviation can be accessed where applicable.
Input
- training: expects: ExampleSet
Output
- model:
- training:
- averagable 1:
- averagable 2:
Parameters
- create complete model: Indicates if a model of the complete data set should be additionally build after estimation.
- average performances only: Indicates if only performance vectors should be averaged or all types of averagable result vectors
- leave one out: Set the number of validations to the number of examples. If set to true, number_of_validations is ignored
- number of validations: Number of subsets for the crossvalidation.
- sampling type: Defines the sampling type of the cross validation (linear = consecutive subsets, shuffled = random subsets, stratified = random subsets with class distribution kept constant)
- use local random seed: Indicates if a local random seed should be used.
- local random seed: Specifies the local random seed