Embed this resource in your web site
Script to select the
n best features for modeling a given dataset,
using a greedy algorithm:
Initialize the set
Sof selected features to the empty set
Split your dataset into training and test sets
1 ... n:
For each feature
S, model and evaluate with feature set
S + f
Greedily select the feature
f'with the best performance and add it to
The script takes as inputs the dataset to use and the number of features (that is, dataset fields) to return and yields as output a list of the
n selected features, as field identifiers.
To select the best performance, the script uses the metric
average_phi in the evaluations it performs, which is only available for classification problems. Therefore, the script is only valid for categorical objective fields.
A very simple script in which we decide whether it's better to use a model or an ensemble for making predictions by creating both (given an input source) and evaluating the results, choosing the one with best
f-1 measure in its evaluation if the objective field is categorical, or
r-measure for regression problems.
Given an input dataset:
Create a dataset with the input source.
Split it into training and test parts (80%/20%).
Create a model using the training dataset.
Create an ensemble using the training dataset.
Evaluate both the model and the ensemble using the test dataset.
Compare their evaluations and choose the best.
This script takes as inputs a cluster identifier, an instance, i.e., a map with values for all fields used by the cluster, and a positive count
n. It then:
Finds the centroid in the cluster closer to the given instance
Selects within that centroid's dataset the
ninstances that are closest to
If there are less than
nrows in the centroid's dataset, missing instances are read from the next closest centroid.
This workflow uses flatline to compute the distance between
p and the centroid datasets (via the
row-distance-squared flatline function) and add an extra column to the dataset, and then creates a sample of the result, ordered by the computed distance.
The input instance can be specified using either field identifiers or field names.
The idea behind this script is to take a dataset as input and return a "clean" dataset with no missing values (except possibly in the objective) and only "preferred" fields.
The script "completes" missing fields by using predictive models to impute value where they are missing. The result is a dataset with the columns containing missing values replaced by columns with the missing values imputed. In addition, for each completed column, we add a binary column indicating whether or not the value was missing in the original dataset. Finally, we also remove non-preferred columns.
Check this readme for more information.
This is a simple script that, given an input dataset, creates an anomaly detector and uses it to identify its top anomalous rows, proceeding then to create a new dataset without them using a Flatline filter.
The objective of this script is to perform a k-fold cross validation of a model built from a dataset. The algorithm:
Divides the dataset in k parts
Holds out the data in one of the parts and builds a model with the rest of data
Evaluates the model with the hold out data
The second and third steps are repeated with each of the k parts, so that k evaluations are generated
Finally, the evaluation metrics are averaged to provide the cross-validation metrics.
The output of the script will be an
evaluation ID. This evaluation is a cross-validation, meaning that its metrics are averages of the k evaluations created in the cross-validation process.
Find the global field importance across a cluster
Please see the readme for more information.
Given an input dataset, we use SMACdown to find the best parameters for creating an ensemble from that dataset.
The script uses as inputs, beside the identifier of the dataset, the evaluation metric we maximize (defaulting to average_phi), the objective field and a string used as a prefix when naming intermediate resources created by the workflow. You can select the metric to optimize (see below).
This workflow will generate a big number of auxiliary resources when executed. To instruct the script to delete all of them before finishing set the
delete-resources execution input parameter to
This script implements feature selection using a version of the Boruta algorithm to detect important and unimportant fields in your dataset. The algorithm:
Retrieves the dataset information.
Creates a new extended dataset. In the new dataset, each field has a corresponding shadow field which has the same type but contains a random sample of the values contained in the original one.
Creates a random forest from the extended dataset.
Extracts the maximum of the importances for the shadow fields.
Uses this maximum plus (minus) a minimum gain as threshold. Any of the original fields scoring less than the minimal threshold are considered unimportant and fields scoring more than the maximum threshold are considered important.
Fields marked as unimportant are removed from the list of fields to be used as input fields for new datasets.
The procedure is repeated, and a new extended dataset is created with the remaining fields. The process stops when it reaches the user-given number of runs or when all the original fields in the dataset are marked as important or unimportant.
When iteration stops, a new dataset is created where unimportant fields have been removed.
The output of the script is a dataset ID.