BigML.com's gallery of WhizzML scripts

FREE

Basic 5-fold cross-validation whizzml

The objective of this script is to perform a 5-fold cross validation of the model built from a dataset by using the default choices in all the available configuration parameters. Thus, the only input needed in for the script to run is the name of the dataset used to both train and test de models in the cross validation. The algorithm:

Divides the dataset in 5 parts.
Holds out the data in one of the parts and builds a model with the rest of data.
Evaluates the model with the hold out data.
The second and third steps are repeated with each of the 5 parts, so that 5 evaluations are generated.
Finally, the evaluation metrics are averaged to provide the cross-validation metrics.

The output of the script will be an evaluation ID. This evaluation is a cross-validation, meaning that its metrics are averages of the 5 evaluations created in the cross-validation process.

Evaluations Models Cross-validation

10.8 KB

727

FREE

Best-first feature selection whizzml

Script to select the n best features for modeling a given dataset, using a greedy algorithm:

Initialize the set S of selected features to the empty set
Split your dataset into training and test sets
For i in 1 ... n:
For each feature f not in S, model and evaluate with feature set S + f
Greedily select the feature f' with the best performance and add it to S

The script takes as inputs the dataset to use and the number of features (that is, dataset fields) to return and yields as output a list of the n selected features, as field identifiers.

To select the best performance, the script uses the metric average_phi in the evaluations it performs, which is only available for classification problems. Therefore, the script is only valid for categorical objective fields.

Optimization

4.0 KB

669

FREE

Model's k-fold cross-validation whizzml

The objective of this script is to perform a k-fold cross validation of a model built from a dataset. The algorithm:

Divides the dataset in k parts
Holds out the data in one of the parts and builds a model with the rest of data
Evaluates the model with the hold out data
The second and third steps are repeated with each of the k parts, so that k evaluations are generated
Finally, the evaluation metrics are averaged to provide the cross-validation metrics.

The output of the script will be an evaluation ID. This evaluation is a cross-validation, meaning that its metrics are averages of the k evaluations created in the cross-validation process.

Evaluations Models Cross-validation

13.7 KB

571

FREE

Auto-complete Missing Fields whizzml

The idea behind this script is to take a dataset as input and return a "clean" dataset with no missing values (except possibly in the objective) and only "preferred" fields.

The script "completes" missing fields by using predictive models to impute value where they are missing. The result is a dataset with the columns containing missing values replaced by columns with the missing values imputed. In addition, for each completed column, we add a binary column indicating whether or not the value was missing in the original dataset. Finally, we also remove non-preferred columns.

Check this readme for more information.

missing data

7.8 KB

447

FREE

Ensemble's k-fold cross-validation whizzml

The objective of this script is to perform a k-fold cross validation of an ensemble built from a dataset. The algorithm:

Divides the dataset in k parts.
Holds out the data in one of the parts and builds an ensemble with the rest of data.
Evaluates the ensemble with the hold out data.
The second and third steps are repeated with each of the k parts, so that k evaluations are generated.
Finally, the evaluation metrics are averaged to provide the cross-validation metrics.

The output of the script will be an evaluation ID. This evaluation is a cross-validation, meaning that its metrics are averages of the k evaluations created in the cross-validation process.

Evaluations Ensembles Cross-validation

16.8 KB

388

FREE

Model or ensemble whizzml

A very simple script in which we decide whether it's better to use a model or an ensemble for making predictions by creating both (given an input source) and evaluating the results, choosing the one with best f-1 measure in its evaluation if the objective field is categorical, or r-measure for regression problems.

Given an input dataset:

Create a dataset with the input source.
Split it into training and test parts (80%/20%).
Create a model using the training dataset.
Create an ensemble using the training dataset.
Evaluate both the model and the ensemble using the test dataset.
Compare their evaluations and choose the best.

Model Ensemble

2.2 KB

322

FREE

Find neighbors whizzml

This script takes as inputs a cluster identifier, an instance, i.e., a map with values for all fields used by the cluster, and a positive count n. It then:

Finds the centroid in the cluster closer to the given instance p
Selects within that centroid's dataset the n instances that are closest to p
If there are less than n rows in the centroid's dataset, missing instances are read from the next closest centroid.

This workflow uses flatline to compute the distance between p and the centroid datasets (via the row-distance-squared flatline function) and add an extra column to the dataset, and then creates a sample of the result, ordered by the computed distance.

The input instance can be specified using either field identifiers or field names.

Clusters Nearest neighbor

4.2 KB

318

FREE

Remove Anomalies whizzml

This is a simple script that, given an input dataset, creates an anomaly detector and uses it to identify its top anomalous rows, proceeding then to create a new dataset without them using a Flatline filter.

Cleansing Anomalies

1.0 KB

232

FREE

Ensemble optimization whizzml

Given an input dataset, we use SMACdown to find the best parameters for creating an ensemble from that dataset.

The script uses as inputs, beside the identifier of the dataset, the evaluation metric we maximize (defaulting to average_phi), the objective field and a string used as a prefix when naming intermediate resources created by the workflow. You can select the metric to optimize (see below).

Classification metrics:

average_recall
average_phi
accuracy
average_precision
average_f_measure

Regression metrics:

r_squared
mean_absolute_error
mean_squared_erro

This workflow will generate a big number of auxiliary resources when executed. To instruct the script to delete all of them before finishing set the delete-resources execution input parameter to true.

Ensemble SMACDown

4.7 KB

223

COMPANY

PRODUCT

BUSINESS

TRAINING

GALLERY

License

Embed this resource in your web site

COMPANY

PRODUCT

BUSINESS

TRAINING

GALLERY