Synopsis
This operator uses the distance between an example's label value and the result of a local polynomial regression to determine the weight of this example.
Description
This operator performs a weighting of the examples and hence the resulting exampleset will contain a new weight attribute. If a weight attribute was already included in the exampleSet, its values will be used as initial values for this algorithm. If not, each example is assigned a weight of 1.
For calculating the weights, this operator will perform a local polynomial regression for each example. For more information about local polynomial regression, take a look at the operator description of the local polynomial regression operator Local Polynomial Regression.
After the predicted result has been calculated, the residuals are computed and rescaled using their median.
This result will be transformed by a smooth function, which cuts of values greater than a threshold. This means, that examples without prediction error will gain a weight of 1, while examples with an error greater than the threshold will be down weighted to 0.
This procedure is iterated as often as specified by the user and will result in weights, which will penalize outliers heavily. This is especially useful for algorithms using the least squares optimization such as Linear Regression, Polynomial Regression or Local Polynomial Regression, since least square is very sensitive to outliers.
Input
- example set: expects: ExampleSet
Output
- example set:
Parameters
- degree: Specifies the degree of the local fitted polynomial. Please keep in mind, that a higher degree than 2 will increase calculation time extremely and probably suffer from overfitting.
- ridge factor: Specifies the ridge factor. This factor is used to penalize high coefficients. In order to aviod overfitting this might be increased.
- iterations: The number of iterations performed for weight calculation. See operator description for details.
- numerical measure: Select measure
- neighborhood type: Determines which type of neighborhood should be used. Either with fixed number of neighbors, or all neighbors within a distance or mixed.
- k: Specifies the number of neighbors in the neighborhood. Regardless of the local density, always that much samples are returned.
- fixed distance: Specifies the size of the neighborhood. All points within this distance are added.
- relative size: Specifies the size of the neighborhood relative to the total number of examples. A value of 0.04 would include 4% of the data points into the neighborhood.
- distance: Specifies the size of the neighborhood. All points within this distance are added.
- at least: If the neighborhood count is less than this number, the distance is increased until this number is met.
- smoothing kernel: Determines which kernel type is used to calculate the weights of distant examples.