The objective of this script is to perform a 5-fold cross validation of the model built from a dataset by using the default choices in all the available configuration parameters. Thus, the only input needed in for the script to run is the name of the dataset used to both train and test de models in the cross validation. The algorithm:
Divides the dataset in 5 parts.
Holds out the data in one of the parts and builds a model with the rest of data.
Evaluates the model with the hold out data.
The second and third steps are repeated with each of the 5 parts, so that 5 evaluations are generated.
Finally, the evaluation metrics are averaged to provide the cross-validation metrics.
The output of the script will be an evaluation ID. This evaluation is a cross-validation, meaning that its metrics are averages of the 5 evaluations created in the cross-validation process.
Script to select the n best features for modeling a given dataset,
using a greedy algorithm:
Initialize the set S of selected features to the empty set
Split your dataset into training and test sets
For i in 1 ... n:
For each feature f not in S, model and evaluate with feature set S + f
Greedily select the feature f' with the best performance and add it to S
The script takes as inputs the dataset to use and the number of features (that is, dataset fields) to return and yields as output a list of the n selected features, as field identifiers.
To select the best performance, the script uses the metric
average_phi in the evaluations it performs, which is only available for classification problems. Therefore, the script is only valid for categorical objective fields.
The objective of this script is to perform a k-fold cross validation of a model built from a dataset. The algorithm:
Divides the dataset in k parts
Holds out the data in one of the parts and builds a model with the rest of data
Evaluates the model with the hold out data
The second and third steps are repeated with each of the k parts, so that k evaluations are generated
Finally, the evaluation metrics are averaged to provide the cross-validation metrics.
The output of the script will be an evaluation ID. This evaluation is a cross-validation, meaning that its metrics are averages of the k evaluations created in the cross-validation process.
The idea behind this script is to take a dataset as input and return a "clean" dataset with no missing values (except possibly in the objective) and only "preferred" fields.
The script "completes" missing fields by using predictive models to impute value where they are missing. The result is a dataset with the columns containing missing values replaced by columns with the missing values imputed. In addition, for each completed column, we add a binary column indicating whether or not the value was missing in the original dataset. Finally, we also remove non-preferred columns.
The objective of this script is to perform a k-fold cross validation of an ensemble built from a dataset. The algorithm:
Divides the dataset in k parts.
Holds out the data in one of the parts and builds an ensemble with the rest of data.
Evaluates the ensemble with the hold out data.
The second and third steps are repeated with each of the k parts, so that k evaluations are generated.
Finally, the evaluation metrics are averaged to provide the cross-validation metrics.
The output of the script will be an evaluation ID. This evaluation is a cross-validation, meaning that its metrics are averages of the k evaluations created in the cross-validation process.
A very simple script in which we decide whether it's better to use a model or an ensemble for making predictions by creating both (given an input source) and evaluating the results, choosing the one with best f-1 measure in its evaluation if the objective field is categorical, or r-measure for regression problems.
Given an input dataset:
Create a dataset with the input source.
Split it into training and test parts (80%/20%).
Create a model using the training dataset.
Create an ensemble using the training dataset.
Evaluate both the model and the ensemble using the test dataset.
This script takes as inputs a cluster identifier, an instance, i.e., a map with values for all fields used by the cluster, and a positive count n. It then:
Finds the centroid in the cluster closer to the given instance p
Selects within that centroid's dataset the n instances that are closest to p
If there are less than n rows in the centroid's dataset, missing instances are read from the next closest centroid.
This workflow uses flatline to compute the distance between p and the centroid datasets (via the row-distance-squared flatline function) and add an extra column to the dataset, and then creates a sample of the result, ordered by the computed distance.
The input instance can be specified using either field identifiers or field names.
This is a simple script that, given an input dataset, creates an anomaly detector and uses it to identify its top anomalous rows, proceeding then to create a new dataset without them using a Flatline filter.
Given an input dataset, we use SMACdown to find the best parameters for creating an ensemble from that dataset.
The script uses as inputs, beside the identifier of the dataset, the evaluation metric we maximize (defaulting to average_phi), the objective field and a string used as a prefix when naming intermediate resources created by the workflow. You can select the metric to optimize (see below).
Classification metrics:
average_recall
average_phi
accuracy
average_precision
average_f_measure
Regression metrics:
r_squared
mean_absolute_error
mean_squared_erro
This workflow will generate a big number of auxiliary resources when executed. To instruct the script to delete all of them before finishing set the delete-resources execution input parameter to true.