Embed this resource in your web site
Script to select the
n best features for modeling a given dataset,
using a greedy algorithm:
Initialize the set
Sof selected features to the empty set
Split your dataset into training and test sets
1 ... n:
For each feature
S, model and evaluate with feature set
S + f
Greedily select the feature
f'with the best performance and add it to
The script takes as inputs the dataset to use and the number of features (that is, dataset fields) to return and yields as output a list of the
n selected features, as field identifiers.
To select the best performance, the script uses the metric
average_phi in the evaluations it performs, which is only available for classification problems. Therefore, the script is only valid for categorical objective fields.
This script implements feature selection using a version of the Boruta algorithm to detect important and unimportant fields in your dataset. The algorithm:
Retrieves the dataset information.
Creates a new extended dataset. In the new dataset, each field has a corresponding shadow field which has the same type but contains a random sample of the values contained in the original one.
Creates a random forest from the extended dataset.
Extracts the maximum of the importances for the shadow fields.
Uses this maximum plus (minus) a minimum gain as threshold. Any of the original fields scoring less than the minimal threshold are considered unimportant and fields scoring more than the maximum threshold are considered important.
Fields marked as unimportant are removed from the list of fields to be used as input fields for new datasets.
The procedure is repeated, and a new extended dataset is created with the remaining fields. The process stops when it reaches the user-given number of runs or when all the original fields in the dataset are marked as important or unimportant.
When iteration stops, a new dataset is created where unimportant fields have been removed.
The output of the script is a dataset ID.
Script to select the n best features for modeling a dataset using a recursive algorithm. It starts with all the features and, on each iteration, it creates a model and removes the least important feature from the dataset.
For more information, please see the readme