Development mode is required
You can change your settings in your account page.
Production mode is required
production mode to peform this action. Please, switch to production mode (PROD) to perform it. You can change your
settings in your account page.

- API
- Quick Start
- Overview
- Authentication
- Organizations NEW
- Requests
- Responses
- Status Codes
- Category Codes
- RESOURCES
- Projects
- Sources
- Datasets
- Samples
- Correlations
- Statistical Tests
- Models
- Ensembles
- Logistic Regressions
- Clusters
- Anomaly Detectors
- Associations
- Topic Models
- Time Series
- Deepnets
- Predictions
- Centroids
- Anomaly Scores
- Association Sets
- Topic Distributions
- Forecasts
- Batch Predictions
- Batch Centroids
- Batch Scores
- Batch Distributions
- Evaluations
- Libraries
- Scripts
- Executions
- Configurations
BigML.io—The BigML API
Documentation
Quick Start
Last Updated: Monday, 2017-10-30 10:31
This page helps you quickly create your first source, dataset, model, and prediction.
To get started with BigML.io you need:- Your username and your API key.
- A terminal with curl or any other command-line tool that implements standard HTTPS methods.
-
Some sample data. You can use:
- A csv file with some data. You can download the "Iris dataset" or "Diabetes dataset" from our servers.
- Even easier, you can just use a URL that points to your data. For example, you can use https://static.bigml.com/csv/iris.csv or https://static.bigml.com/csv/diabetes.csv.
- Even even easier, you can just send some inline test data.
Jump to:
- Getting a Toy Data File
- Authentication
- Creating a Source
- Creating a Remote Source
- Creating an Inline Source
- Creating a Dataset
- Creating a Model
- Creating a Prediction
Getting a Toy Data File
If you do not have any dataset handy, you can download Fisher’s Iris dataset using the curl command below or by just clicking on the link.
curl -o iris.csv https://static.bigml.com/csv/iris.csv
$ Getting iris.csv
Authentication
The following snippet will help you set up an environment variable (i.e., BIGML_AUTH) to store your username and API key and avoid typing them again in the rest of examples. See this section for more details.
Note: Use your own username and API Key.
export BIGML_USERNAME=alfred
export BIGML_API_KEY=79138a622755a2383660347f895444b1eb927730
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY"
$ Setting Alfred's Authentication Parameters
Creating a Source
To create a new source, POST the file containing your data to the source base URL.
curl "https://bigml.io/source?$BIGML_AUTH" -F file=@iris.csv
> Creating a source
To create more sources simply repeat the curl command above using another file. Make sure to use the full path if the file is not in your current directory.
Creating a Remote Source
You can also create a source using a valid URL that points to your data or some public data. For example:
curl "https://bigml.io/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"remote": "https://static.bigml.com/csv/iris.csv"}'
> Creating a remote source
Creating an Inline Source
You can also create a source using some inline data. For example:
curl "https://bigml.io/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"data": "a,b,c,d\n1,2,3,4\n5,6,7,8"}'
> Creating an inline source
{
"code": 201,
"content_type": "application/octet-stream",
"created": "2012-03-01T05:29:07.217968",
"credits": 0.0087890625,
"file_name": "iris.csv",
"md5": "d1175c032e1042bec7f974c91e4a65ae",
"name": "iris.csv",
"number_of_datasets": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"resource": "source/4f52824203ce893c0a000053",
"size": 4608,
"source_parser": {
"header": true,
"locale": "en-US",
"missing_tokens": [
"N/A",
"n/a",
"NA",
"na",
"-",
"?"
],
"quote": "\"",
"separator": ",",
"trim": true
},
"status": {
"code": 2,
"elapsed": 0,
"message": "The source creation has been started"
},
"type": 0,
"updated": "2012-03-01T05:29:07.217990"
}
< Example source JSON response
Creating a Dataset
To create a dataset, POST the source/id from the previous step to the dataset base URL as follows.
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source": "source/4f52824203ce893c0a000053"}'
> Creating a dataset
{
"code": 201,
"columns": 5,
"created": "2012-03-04T02:58:11.910363",
"credits": 0.0087890625,
"fields": {
"000000": {
"column_number": 0,
"name": "sepal length",
"optype": "numeric"
},
"000001": {
"column_number": 1,
"name": "sepal width",
"optype": "numeric"
},
"000002": {
"column_number": 2,
"name": "petal length",
"optype": "numeric"
},
"000003": {
"column_number": 3,
"name": "petal width",
"optype": "numeric"
},
"000004": {
"column_number": 4,
"name": "species",
"optype": "categorical"
}
},
"name": "iris' dataset",
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"resource": "dataset/4f52da4303ce896fe3000000",
"rows": 0,
"size": 4608,
"source": "source/4f52824203ce893c0a000053",
"source_parser": {
"header": true,
"locale": "en-US",
"missing_tokens": [
"N/A",
"n/a",
"NA",
"na",
"-",
"?"
],
"quote": "\"",
"separator": ",",
"trim": true
},
"source_status": true,
"status": {
"code": 1,
"message": "The dataset is being processed and will be created soon"
},
"updated": "2012-03-04T02:58:11.910387"
}
< Dataset
Creating a Model
To create a model, POST the dataset/id from the previous step to the model base URL. By default BigML.io will include all fields as predictors and will treat the last non-text field as the objective. In the Models Section you will learn how to customize the input fields or the objective field.
curl "https://bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f52da4303ce896fe3000000"}'
> Creating a model
{
"code": 201,
"columns": 5,
"created": "2012-03-04T03:46:53.033372",
"credits": 0.03515625,
"dataset": "dataset/4f52da4303ce896fe3000000",
"dataset_status": true,
"holdout": 0.0,
"input_fields": [],
"max_columns": 5,
"max_rows": 150,
"name": "iris' dataset model",
"number_of_predictions": 0,
"objective_fields": [],
"private": true,
"range": [
1,
150
],
"resource": "model/4f52e5ad03ce898798000000",
"rows": 150,
"size": 4608,
"source": "source/4f52824203ce893c0a000053",
"source_status": true,
"status": {
"code": 1,
"message": "The model is being processed and will be created soon"
},
"updated": "2012-03-04T03:46:53.033396"
}
< Model
Creating a Prediction
To create a prediction, POST the model/id and some input data to the prediction base URL.
curl "https://bigml.io/prediction?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"model": "model/4f52e5ad03ce898798000000", "input_data": {"000000": 5, "000001": 3}}'
> Creating a prediction
{
"code": 201,
"created": "2012-03-04T04:11:10.433996",
"credits": 0.01,
"dataset": "dataset/4f52da4303ce896fe3000000",
"dataset_status": true,
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric"
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric"
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric"
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical"
}
},
"input_data": {
"000000": 5,
"000001": 3
},
"model": "model/4f52e5ad03ce898798000000",
"model_status": true,
"name": "Prediction for species",
"objective_fields": [
"000004"
],
"prediction": {
"000004": "Iris-virginica"
},
"prediction_path": {
"bad_fields": [],
"next_predicates": [
{
"count": 100,
"field": "000002",
"operator": ">",
"value": 2.45
},
{
"count": 50,
"field": "000002",
"operator": "<=",
"value": 2.45
}
],
"path": [],
"unknown_fields": []
},
"private": true,
"resource": "prediction/4f52eb5e03ce898798000009",
"source": "source/4f52824203ce893c0a000053",
"source_status": true,
"status": {
"code": 5,
"message": "The prediction has been created"
},
"updated": "2012-03-04T04:11:10.434030"
}
< Prediction
Overview
Last Updated: Monday, 2018-01-29 08:31
This page provides an introduction to BigML.io—The BigML API. A quick start guide for the impatient is here.
BigML.io is a Machine Learning REST API to easily build, run, and bring predictive models to your project. You can use BigML.io for basic supervised and unsupervised machine learning tasks and also to create sophisticated machine learning pipelines.
BigML.io is a REST-style API for creating and managing BigML resources programmatically. That is to say, using BigML.io you can create, retrieve, update and delete BigML resources using standard HTTP methods.
BigML.io gives you:
- Secure programmatic access to all your BigML resources.
- Fully white-box access to your datasets, models, clusters and anomaly detectors.
- Asynchronous creation of resources.
- Near real-time predictions.
Jump to:
- BigML Resources
- REST API
- HTTPS
- Base URL
- Version
- Summary of Resource URL Patterns
- Summary of HTTP Methods
- Resource ID
- Libraries
- Limits
BigML Resources
BigML.io gives you access to the following resources: project, source, dataset, sample, correlation, statisticaltest, model, ensemble, logisticregression, cluster, anomaly, association, topicmodel, timeseries, deepnet, prediction, centroid, anomalyscore, associationset, topicdistribution, forecast, batchprediction, batchcentroid, batchanomalyscore, batchtopicdistribution, evaluation, library, script, execution, and configuration.
The four original BigML resources are: source, dataset, model, and prediction.
As shown in the picture below, the most basic flow consists of using some local (or remote) training data to create a source, then using the source to create a dataset, later using the dataset to create a model, and, finally, using the model and new input data to create a prediction.

The training data is usually in tabular format. Each row in the data represents an instance (or example) and each column a field (or attribute). These fields are also known as predictors or covariates.
When the machine learning task to learn from training data is supervised one of the columns (usually the last column) represents a special attribute known as objective field (or target) that assigns a label (or class) to each instance. The training data in this format is named labeled and the machine learning task to learn from is named supervised learning.
Once a source is created, it can be used to create multiple datasets. Likewise, a dataset can be used to create multiple models and a model can be used to create multiple predictions.
A model can be either a classification or a regression model depending on whether the objective field is respectively categorical or numeric.
Often an ensemble (or collection of models) can perform better than just a single model. Thus, a dataset can also be used to create an ensemble instead of a single model.
A dataset can also be used to create a cluster or an anomaly detector. Clusters and Anomaly Detectors are both built using unsupervised learning and therefore an objective field is not needed. In these cases, the training data is named unlabeled.
A centroid is to a cluster what a prediction is to a model. Likewise, an anomaly score is to an anomaly detector what a prediction is to a model.
There are scenarios where generating predictions for a relative big collection of input data is very convenient. For these scenarios, BigML.io offers batch resources such as: batchprediction, batchcentroid, and batchanomalyscore. These resources take a dataset and respectively a model (or ensemble), a cluster, or an anomaly detector to create a new dataset that contains a new column with the corresponding prediction, centroid or anomaly score computed for each instance in the dataset.
When dealing with multiple projects, it's better to keep the resources that belong to each project separated. Thus, BigML also has a resource named project that helps you group together all the other resources. As you will see, you just need to assign a source to a pre-existing project and all the subsequent resources will be created in that project.Note: In the snippets below you should substitute Alfred's username and API key for your own username and API Key.
REST API
BigML.io conforms to the design principles of Representational State Transfer (REST). BigML.io is entirely HTTPS-based.
You can create, read, update, and delete resources using the respective standard HTTP methods: POST, GET, PUT and DELETE.
All communication with BigML.io is JSON formatted except for source creation. Source creation is handled with a HTTP PUT using the "multipart/form-data" content-type.
HTTPS
All access to BigML.io must be performed over HTTPS. In this way communication between your application and BigML.io is encrypted and the integrity of traffic between both is verified.
Base URL
All BigML.io HTTP commands use the following base URL:
https://bigml.io
Base URL
Version
The BigML.io API is versioned using code names instead of version numbers. The current version name is "andromeda" so URLs for this version can be written to require this version as follows: https://bigml.io/andromeda/
Version
Specifying the version name is optional. If you omit the version name in your API requests, you will always get access to the latest API version. While we will do our best to make future API versions backward compatible it is possible that a future API release could cause your application to fail.
Specifying the API version in your HTTP calls will ensure that your application continues to function for the life cycle of the API release.
Summary of Resource URL Patterns
BigML.io gives you access to the following resources: project, source, dataset, sample, correlation, statistical test, model, ensemble, logistic regression, cluster, anomaly detector, association, topic model, time series, deepnet, prediction, centroid, anomaly score, association set, topic distribution, forecast, batch prediction, batch centroid, batch anomaly score, batch topic distribution, evaluation, library, script, execution, and configuration.
https://bigml.io/project
https://bigml.io/source
https://bigml.io/dataset
https://bigml.io/sample
https://bigml.io/correlation
https://bigml.io/statisticaltest
https://bigml.io/model
https://bigml.io/ensemble
https://bigml.io/logisticregression
https://bigml.io/cluster
https://bigml.io/anomaly
https://bigml.io/association
https://bigml.io/topicmodel
https://bigml.io/timeseries
https://bigml.io/deepnet
https://bigml.io/prediction
https://bigml.io/centroid
https://bigml.io/anomalyscore
https://bigml.io/associationset
https://bigml.io/topicdistribution
https://bigml.io/forecast
https://bigml.io/batchprediction
https://bigml.io/batchcentroid
https://bigml.io/batchanomalyscore
https://bigml.io/batchtopicdistribution
https://bigml.io/evaluation
https://bigml.io/library
https://bigml.io/script
https://bigml.io/execution
https://bigml.io/configuration
Resource URL Patterns
Summary of HTTP Methods
BigML.io uses the standard POST, GET, PUT, and DELETE HTTP methods to create, retrieve, update and delete resources, respectively.Operation | HTTP method | Semantics |
---|---|---|
CREATE | POST | Creates a new resource. Only certain fields are "postable". This method is not idempotent. Each valid POST request results in a new directly accessible resource. |
RETRIEVE | GET | Retrieves either a specific resource or a list of resources. This methods is idempotent. The content type of the resources is always "application/json; charset=utf-8". |
UPDATE | PUT | Updates partial content of a resource. Only certain fields are "putable". This method is idempotent. |
DELETE | DELETE | Deletes a resource. This method is idempotent. |
Resource ID
All BigML resources are identified by a name composed of two parts separated by a slash "/". The first part is the type of the resource and the second part is a 24-char unique identifier. See the examples below:
source/4f510d2003ce895676000069
dataset/4f510cfc03ce895676000040
model/4f51473203ce89b7ef000005
ensemble/523e9017035d0772e600b285
prediction/4f51473b03ce89b7ef000008
evaluation/50a30a453c19200bd1000839
Example of resources ids
Libraries
We have developed light-weight API bindings for Python, Node.js, and Java.
A number of libraries for many other languages have been developed by the growing BigML community: C#, Ruby, PHP , and iOS. If you are interested in library support for a particular language let us know. Or if you are motivated to develop a library, we will give you all the support that we can.
Limits
BigML.io is currently limited to 1,000,000 (one million) requests per API key per hour. Please email us if you have a specific use case that requires a higher rate limit.Authentication
Last Updated: Thursday, 2018-02-01 21:20
https://bigml.io/source?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730
Example URL to list your sources
Your BigML API Key is a unique identifier that is assigned exclusively to your account. You can manage your BigML API Key in your account settings. Remember to keep your API key secret.
To use BigML.io from the command line, we recommend setting your username and API key as environment variables. Using environment variables is also an easy way to keep your credentials out of your source code.
Note: Use your own username and API Key.
export BIGML_USERNAME=alfred
export BIGML_API_KEY=79138a622755a2383660347f895444b1eb927730
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY"
$ Setting Alfred's Authentication Parameters
set BIGML_USERNAME=alfred
set BIGML_API_KEY=79138a622755a2383660347f895444b1eb927730
set BIGML_AUTH=username^=%BIGML_USERNAME%;api_key^=%BIGML_API_KEY%
$ Setting Alfred's Authentication Parameters in Windows
Here is an example of an authenticated API request to list the sources in your account from a command line.
curl "https://bigml.io/source?$BIGML_AUTH"
$ Example request to list your sources
Alternative Keys
Alternative Keys allow you to give fine-grained access to your BigML resources.To create an alternative key you need to use BigML's web interface. There you can define what resources an alternative key can access and what operations (i.e., create, list, retrieve, update or delete) are allowed with it. This is useful in scenarios where you want to grant different roles and privileges to different applications. For example, an application for the IT folks that collects data and creates sources in BigML, another that is accessed by data scientists to create and evaluate models, and a third that is used by the marketing folks to create predictions.
You can read more about alternative keys here.
Organizations
Last Updated: Monday, 2018-02-19 10:04
An organization is a permission-based grouping of resources that helps you centralize your organization's resources. The permissions can be managed in a company-specific dashboard, and a user can be a member of multiple organizations at the same time. All resources are created under a specific project in the the organization. A project can be configured as private or public, and you can control who has the access to your projects and resources under the projects.
Organization Member Types
There are 4 types of membership for an organization.
- A restricted member can create, retrieve, update, and delete resources in the organization project, and view public or private projects that the user has access to.
-
A member has the restricted member privilege and also
can create public or private projects in the organization.
A public project can be accessed by any users of the organization, and
a private project can be accessed only by those who have permission to the project.
When a project is created or updated, certain organization users can be assigned with the manage, write, or read permission. A user with the admin permission or an organization administrator can update and delete the project. A user with the write permission can create, retrieve, update, and delete resources in the project, and a user with the read permission can only read existing resources in the project. The user who creates the project will automatically have the admin permission until the user is specifically removed from the project or the organization.
For example, let's say a user with a member role John is in the sales department. John has created a private project Sales Reports and added users Amy and Mike to the write permission list. Now John has been transferred to the marketing department and he shouldn't have access to the Sales Reports project anymore. John can delegate Amy or another organization user with the admin permission allowing the user to update or delete the project in the future and remove himself from the list. If John is already removed or unavailable, it can also be done by any administrator.
Any user with the write permission of the project can create, update, and delete resources and move their personal resources to the project. However, once personal resource is moved under a organization project, it cannot be moved back to the personal account.
Last, the users with read permissions can view all resources in the project. However, they cannot update or delete them, or create a new resource. - An administrator has the full access to all projects and resources in the organization, and can manage the users and their membership of the organization.
- The owner has all privileges that an administrator has plus billing, and is the only one who can update and delete the organization.
Each user can have only one role. If a user is assigned with multiple roles, then only the role with the highest privilege will be considered. For example, a user is assigned with the member and restricted member roles, then the user's final role in the organization will be member.
All resources created under the organization have the username and user_id properties filled with the owner's username and id, and a separate property creator which is the username of the user who actually created the resource.
Authentication
In addition to your username and api_key, all access to BigML organization resources requires an additional parameter in the query string to authenticate. As explained above, an organization resource must be created under a project. In order to create, retrieve, update, and delete an organization resource, you must pass project in the query string. Thus, even if project is defined in the POST request, it will simply be ignored in favor of the project in the query string. For organization project resources, however, you need to pass organization instead. See the examples below.
https://bigml.io/source?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730;project=project/5948be694e17273079000000
Example URL to list your sources in an organization project
https://bigml.io/project?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730;organization=organization/5728cce44e1727587a000000
Example URL to list your projects in an organization
Requests
Last Updated: Monday, 2018-01-29 08:31
BigML.io uses the standard POST, GET, PUT, and DELETE HTTP methods to create, retrieve, update, and delete individual resources, respectively. You can also list all your resources for each resource type.
Jump to:
- Creating a Resource
- Retrieving a Resource
- Updating a Resource
- Deleting a Resource
- Listing Resources
- Paginating Resources
- Filtering Resources
- Ordering Resources
Creating a Resource
To create a new resource, you need to POST an object to the resource's base URL. The content-type must always be "application/json". The only exception is source creation which requires the "multipart/form-data" content type.
For example, to create a model with a dataset, you can use curl like this:
curl "https://bigml.io/model/?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a model
The following is an example of what a request header would look like for the request:
POST /model?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
> Example model create request
BigML.io will return a newly create resource document if the request is succeeded.
A number of required and optional arguments exist for each type of resource. You can see a detailed arguments list for each resource in their respective sections: project, source, dataset, sample, correlation, statistical test, model, ensemble, logistic regression, cluster, anomaly detector, association, topic model, time series, deepnet, prediction, centroid, anomaly score, association set, topic distribution, forecast, batch prediction, batch centroid, batch anomaly score, batch topic distribution, evaluation, library, script, execution, and configuration.
Retrieving a Resource
To retrieve a resource, you need to issue a HTTP GET request to the resource/id to be retrieved. Each resource has a unique identifier in the form resource/id where resource is a type of the resource such as dataset, model, and etc, and id is a string of 24 alpha-numeric characters that you can use to retrieve the resource or as a parameter to create other resources from the resource.
For example, using curl you can do something like this to retrieve a dataset:
curl "https://bigml.io/dataset/54d86680f0a5ea5fc0000011?$BIGML_AUTH"
$ Retrieving a dataset from the command line
The following is an example of what a request header would look like for a dataset GET request:
GET /dataset/54d86680f0a5ea5fc0000011?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
> Example dataset retreive request
Once a resource has been successfully created, it will have properties. A number of properties exist for each type of resource. You can see a detailed property list for each resource in their respective sections: projects, sources, datasets, samples, correlations, statisticaltests, models, ensembles, logisticregressions, clusters, anomalies, associations, topicmodels, timeseries, deepnets, predictions, centroids, anomalyscores, associationsets, topicdistributions, forecasts, batchpredictions, batchcentroids, batchanomalyscores, batchtopicdistributions, evaluations, libraries, scripts, executions, and configurations.
Updating a Resource
To update a resource, you need to PUT an object containing the fields that you want to update to the resource's base URL. The content-type must always be: "application/json".
If the request succeeds, BigML.io will respond with a 202 accepted code and with the new updated resource in the body of the message.
For example, to update a project with a new name, a new category, a new description, and new tags you can use curl like this:
curl "https://bigml.io/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "My new Project",
"category": 3,
"description": "My first BigML Project",
"tags": ["fraud", "detection"]}'
$ Updating a project
The following is an example of what a request header would look like for the request:
PUT /project/54d9553bf0a5ea5fc0000016?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
> Example project update request
Deleting a Resource
To delete a resource, you need to issue a HTTP DELETE request to the resource/id to be deleted.
For example, using curl you can do something like this to delete a dataset:
curl -X DELETE "https://bigml.io/dataset/54d86680f0a5ea5fc0000011?$BIGML_AUTH"
$ Deleting a dataset from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return HTTP 204 responses with no body.
HTTP/1.1 204 NO CONTENT
Content-Length: 0
< Successful response
Once you delete a resource, it is permanently deleted. That is, a delete request cannot be undone.
For example, if you try to delete a dataset a second time, or a dataset that does not exist you will receive an error like this:
{
"code": 404,
"status": {
"code": -1201,
"extra": [
"A dataset matching the provided arguments could not be found"
],
"message": "Id does not exist"
}
}
Error trying to delete a dataset that does not exist
The following is an example of what a request header would look like for a dataset DELETE request:
DELETE /dataset/54d86680f0a5ea5fc0000011?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
> Example dataset delete request
Listing Resources
To list all the resources, you can use its base URLs. By default, only the 20 most recent resources will be returned. You can see below how to change this number using the limit parameter.
You can get the list of each resource type directly in your browser using your own username and API key with the following links.
https://bigml.io/project?$BIGML_AUTH
https://bigml.io/source?$BIGML_AUTH
https://bigml.io/dataset?$BIGML_AUTH
https://bigml.io/sample?$BIGML_AUTH
https://bigml.io/correlation?$BIGML_AUTH
https://bigml.io/statisticaltest?$BIGML_AUTH
https://bigml.io/model?$BIGML_AUTH
https://bigml.io/ensemble?$BIGML_AUTH
https://bigml.io/logisticregression?$BIGML_AUTH
https://bigml.io/cluster?$BIGML_AUTH
https://bigml.io/anomaly?$BIGML_AUTH
https://bigml.io/association?$BIGML_AUTH
https://bigml.io/topicmodel?$BIGML_AUTH
https://bigml.io/timeseries?$BIGML_AUTH
https://bigml.io/deepnet?$BIGML_AUTH
https://bigml.io/prediction?$BIGML_AUTH
https://bigml.io/centroid?$BIGML_AUTH
https://bigml.io/anomalyscore?$BIGML_AUTH
https://bigml.io/associationset?$BIGML_AUTH
https://bigml.io/topicdistribution?$BIGML_AUTH
https://bigml.io/forecast?$BIGML_AUTH
https://bigml.io/batchprediction?$BIGML_AUTH
https://bigml.io/batchcentroid?$BIGML_AUTH
https://bigml.io/batchanomalyscore?$BIGML_AUTH
https://bigml.io/batchtopicdistribution?$BIGML_AUTH
https://bigml.io/evaluation?$BIGML_AUTH
https://bigml.io/library?$BIGML_AUTH
https://bigml.io/script?$BIGML_AUTH
https://bigml.io/execution?$BIGML_AUTH
https://bigml.io/configuration?$BIGML_AUTH
> Listing resources from a browser
You can also easily list them from the command line using curl as follows:
curl "https://bigml.io/project?$BIGML_AUTH"
curl "https://bigml.io/source?$BIGML_AUTH"
curl "https://bigml.io/dataset?$BIGML_AUTH"
curl "https://bigml.io/sample?$BIGML_AUTH"
curl "https://bigml.io/correlation?$BIGML_AUTH"
curl "https://bigml.io/statisticaltest?$BIGML_AUTH"
curl "https://bigml.io/model?$BIGML_AUTH"
curl "https://bigml.io/ensemble?$BIGML_AUTH"
curl "https://bigml.io/logisticregression?$BIGML_AUTH"
curl "https://bigml.io/cluster?$BIGML_AUTH"
curl "https://bigml.io/anomaly?$BIGML_AUTH"
curl "https://bigml.io/association?$BIGML_AUTH"
curl "https://bigml.io/topicmodel?$BIGML_AUTH"
curl "https://bigml.io/timeseries?$BIGML_AUTH"
curl "https://bigml.io/deepnet?$BIGML_AUTH"
curl "https://bigml.io/prediction?$BIGML_AUTH"
curl "https://bigml.io/centroid?$BIGML_AUTH"
curl "https://bigml.io/anomalyscore?$BIGML_AUTH"
curl "https://bigml.io/associationset?$BIGML_AUTH"
curl "https://bigml.io/topicdistribution?$BIGML_AUTH"
curl "https://bigml.io/forecast?$BIGML_AUTH"
curl "https://bigml.io/batchprediction?$BIGML_AUTH"
curl "https://bigml.io/batchcentroid?$BIGML_AUTH"
curl "https://bigml.io/batchanomalyscore?$BIGML_AUTH"
curl "https://bigml.io/batchtopicdistribution?$BIGML_AUTH"
curl "https://bigml.io/evaluation?$BIGML_AUTH"
curl "https://bigml.io/library?$BIGML_AUTH"
curl "https://bigml.io/script?$BIGML_AUTH"
curl "https://bigml.io/execution?$BIGML_AUTH"
curl "https://bigml.io/configuration?$BIGML_AUTH"
$ Listing resources from the command line
The following is an example of what a request header would look like when you request a list of models:
GET /model?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730
Host: bigml.io
> Example model list request
Property | Type | Description |
---|---|---|
meta | Object | Specifies in which page of the listing you are, how to get to the previous page and next page, and the total number of resources. |
objects | Array of resources | A list of resources filtered and ordered according to the criteria that you supply in your request. See the filtering and ordering options for more details. |
Meta objects have the following properties:
For example, when you list your projects, they will be displayed as below:
{
"meta": {
"limit": 20,
"next": "/?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730&offset=20"
"offset": 0,
"previous": null,
"total_count": 54
},
"objects": [
{
"category": 0,
"code": 200,
"created": "2015-01-27T22:51:57.488000",
"description": "",
"name": "Project 1",
"private": true,
"resource": "project/54c8168df0a5eae58c000019",
...
},
{
"category": 0,
"code": 200,
"created": "2015-01-29T04:08:12.696000",
"description": "",
"name": "Project 2",
"private": true,
"resource": "project/54c9b22cf0a5ea7765000000",
...
},
...
]
}
< Listing of projects template
Paginating Resources
There are two parameters that can help you retrieve just a portion of your resources and paginate them.
If a limit is given, no more than that many resources will be returned but possibly less, if the request itself yields less resources.
For example, if you want to retrieve only the third and forth latest projects:
curl "https://bigml.io/project?$BIGML_AUTH;limit=2;offset=2"
$ Paginating projects from the command line
To paginate results, you need to start off with an offset of zero, then increment it by whatever value you use for the limit each time. So if you wanted to return resources 1-10, then 11-20, then 21-30, etc., you would use "limit=10;offset=0", "limit=10;offset=10", and limit=10;offset=20", respectively.
Filtering Resources
The listings of resources can be filtered by any of the fields that we labeled as filterable in the table describing the properties of a resource type. For example, to retrieve all the projects tagged with "fraud":
https://bigml.io/project?$BIGML_AUTH;tags__in=fraud
> Filtering projects by tag from a browser
curl "https://bigml.io/project?$BIGML_AUTH;tags__in=fraud"
$ Filtering projects by tag from the command line
In addition to exact match, there are more filters that you can use. To add one of these filters to your request you just need to append one of the suffixes in the following table to the name of the property that you want to use as a filter.
Filter | Description |
---|---|
! optional |
Not Example: !size=1048576 (<>1MB) |
__gt optional |
Greater than Example: size__gt=1048576 (>1MB) |
__gte optional |
Greater than or equal to Example: size__gte=1048576 (>=1MB) |
__contains optional |
Case-sensitive word match Example: name__contains=test |
__icontains optional |
Case-insensitive word match Example: name__icontains=test |
__in optional |
Case-sensitive list word match Example: tags__in=fraud,test |
__lt optional |
Less than Example: created__lt=2016-08-20T00:00:00.000000 (before 2016-08-20) |
__lte optional |
Less than or equal to Example: created__lte=2016-08-20T00:00:00.000000 (before or on 2016-08-20) |
Ordering Resources
A list of resources can also be ordered by any of the fields that we labeled as sortable in the table describing the properties of a resource type.
For example, you can list your projects ordered by descending name directly in your browser, using your own username and API key, with the following link.
https://bigml.io/project?$BIGML_AUTH;order_by=-name
> Listing projects ordered by name from a browser
You can do the same thing from the command line using curl as follows:
curl "https://bigml.io/project?$BIGML_AUTH;order_by=-name"
$ Listing projets ordered by name from the command line
Responses
Last Updated: Monday, 2017-10-30 10:31
HTTP/1.1 201 CREATED
Server: nginx/1.0.5
Date: Sat, 03 Mar 2012 23:28:59 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Location: https://bigml.io/dataset/4f5a59b203ce8945c200000a
< Example HTTP response
{
"code": 201,
"columns": 5,
"created": "2012-03-03T23:28:59.404542",
"credits": 0.0087890625,
"fields": {
"000000": {
"column_number": 0,
"name": "sepal length",
"optype": "numeric"
},
"000001": {
"column_number": 1,
"name": "sepal width",
"optype": "numeric"
},
"000002": {
"column_number": 2,
"name": "petal length",
"optype": "numeric"
},
"000003": {
"column_number": 3,
"name": "petal width",
"optype": "numeric"
},
"000004": {
"column_number": 4,
"name": "species",
"optype": "categorical"
}
},
"name": "iris' dataset",
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"resource": "dataset/4f5a59b203ce8945c200000a",
"rows": 0,
"size": 4608,
"source": "source/4f52824203ce893c0a000053",
"source_parser": {
"header": true,
"locale": "en-US",
"missing_tokens": [
"N/A",
"n/a",
"NA",
"na",
"-",
"?"
],
"quote": "\"",
"separator": ",",
"trim": true
},
"source_status": true,
"status": {
"code": 1,
"message": "The dataset is being processed and will be created soon"
},
"updated": "2012-03-03T23:28:59.404561"
}
< Example JSON response
Error Codes
Errors also use conventional HTTP response headers. For example, here is the header for a 404 response:
HTTP/1.1 404 NOT FOUND
Content-Type: application/json; charset=utf-8
Date: Fri, 03 Mar 2012 23:29:18 GMT
Server: nginx/1.1.11
Content-Length: 169
Connection: keep-alive
< Example HTTP error response
{
"code": 404,
"status": {
"code": -1201,
"extra": [
"4f5157f1035d07306600005b"
],
"message": "Id does not exist"
}
}
< Example JSON error response
Status Codes
Last Updated: Monday, 2017-10-30 10:31
This section lists the different status codes BigML.io sends in responses. First, we list the HTTP status codes, then the codes that define a resource creation status, and finally detailed error codes for every resource.
Jump to:
- HTTP Status Code Summary
- Resource Status Code Summary
- Error Code Summary
- Source Error Code Summary
- Dataset Error Code Summary
- Download Dataset Unsuccessful Requests
- Sample Error Code Summary
- Correlation Error Code Summary
- Statistical Test Error Code Summary
- Model Error Code Summary
- Ensemble Error Code Summary
- Logistic Regression Error Code Summary
- Cluster Error Code Summary
- Anomaly Error Code Summary
- Association Error Code Summary
- Topic Model Error Code Summary
- Time Series Error Code Summary
- Deepnet Error Code Summary
- Prediction Error Code Summary
- Centroid Error Code Summary
- Anomaly Score Error Code Summary
- Association Set Error Code Summary
- Topic Distribution Error Code Summary
- Forecast Error Code Summary
- Batch Prediction Error Code Summary
- Batch Centroid Error Code Summary
- Batch Anomaly Score Error Code Summary
- Batch Topic Distribution Error Code Summary
- Evaluation Error Code Summary
- Whizzml Library Error Code Summary
- Whizzml Script Error Code Summary
- Whizzml Execution Error Code Summary
HTTP Status Code Summary
BigML.io returns meaningful HTTP status codes for every request. The same status code is returned in both the HTTP header of the response and in the JSON body.
Code | Status | Semantics |
---|---|---|
200 | OK | Your request was successful and the JSON response should include the resource that you requested. |
201 | Created | A new resource was created. You can get the new resource complete location through the HTTP headers or the resource/id through the resource key of the JSON response. |
202 | Accepted | Received after sending a request to update a resource if it was processed successfully. |
204 | No Content | Received after sending a request to delete a resource if it was processed successfully. |
400 | Bad Request | Your request is malformed, missed a required parameter, or used an invalid value as parameter. |
401 | Unauthorized | Your request used the wrong username or API key. |
402 | Payment Required | Your subscription plan does not allow to perform this action because it has exceeded your subscription limit. Please wait until your running tasks complete or upgrade your plan. |
403 | Forbidden | Your request is trying to access to a resource that you do not own. |
404 | Not Found | The resource that you requested or used as parameter in a request does not exist anymore. |
405 | Not Allowed | Your request is trying to use a HTTP method that is not supported or to change fields of a resource that can not be modified. |
411 | Length required | Your request is trying to PUT or POST without sending any content or specifying its length. |
413 | Request Entity Too Large | The size of the content in your request is greater than what support to PUT or POST. |
415 | Unsupported Media Type | Your request is trying to POST 'multipart/form-data' content but it is actually sending the wrong content-type. |
429 | Too Many Requests | You have sent too many requests in a given amount of time |
500 | Internal Server Error | Your request could not be processed because something went wrong on BigML's end. |
503 | Service Unavailable | BigML.io is undergoing maintenance. |
Resource Status Code Summary
The creation of resources involves a computational task that can last a few seconds or a few days depending on the size of the data. Consequently, some HTTP POST requests to create a resource may launch an asynchronous task and return immediately. In order to know the completion status of this task, each resource has a status field that reports the current state of the request. This status is useful to monitor the progress during their creation. The possible states for a task are:
Code | Status | Semantics | |
---|---|---|---|
0 | Waiting | The resource is waiting for another resource to be finished before BigML.io can start processing it. | |
1 | Queued | The task that is going to create the resource has been accepted but has been queued because there are other tasks using the system. | |
2 | Started | The task to create the resource has been is started and you should expect partial results soon. | |
3 | In Progress | The task has computed the first partial resource but still needs to do more computations. | |
4 | Summarized | This status is specific to datasets. It happens when the dataset has been computed but its data has not been serialized yet. The dataset is final but you cannot use it yet to create a model or if you use it the model will be waiting until the dataset is finished. | |
5 | Finished | The task is completed and the resource is final. | |
-1 | Faulty | The task has failed. We either could not process the task as you requested it or have an internal issue. | |
-2 | Unknown | The task has reached a state that we cannot verify at this time. This a status you should never see unless BigML.io suffers a major outage. |
Error Code Summary
This is the list of possible general error codes you can receive fromBigML.io managing any type of resources.
Error Code | Semantics |
---|---|
-1100 | Unauthorized use |
-1101 | Not enough credits |
-1102 | Wrong resource |
-1104 | Cloned resourced cannot be public |
-1105 | Price cannot be changed |
-1107 | Too many projects |
-1108 | Too many tasks |
-1109 | Subscription required |
-1200 | Missing parameter |
-1201 | Invalid Id |
-1203 | Field Error |
-1204 | Bad Request |
-1205 | Value Error |
-1206 | Validation Error |
-1207 | Unsupported Format |
-1208 | Invalid Sort Error |
Source Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing sources.
Error Code | Semantics |
---|---|
-2000 | This source cannot be read properly |
-2001 | Bad request to create a source |
-2002 | The source could not be created |
-2003 | The source cannot be retrieved |
-2004 | The source cannot be deleted now |
-2005 | Faulty source |
Dataset Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing datasets.
Error Code | Semantics |
---|---|
-3000 | The source is not ready yet |
-3001 | Bad request to create a dataset |
-3002 | The dataset cannot be created |
-30021 | The dataset cannot be created now |
-3003 | The dataset cannot be retrieved |
-3004 | The dataset cannot be deleted now |
-3005 | Faulty dataset |
-3006 | The dataset could not be created properly. This happens when a 1-click model has been requested and corresponding dataset could not be created |
-3008 | The dataset could not be cloned properly. This happens when there is an internal error when you try to buy or clone other user's dataset |
-3010 | The clone of the origin dataset is not finished yet |
-3020 | The source does not contain readable data |
-3030 | The source cannot be parsed |
-3040 | The filter expression is not valid |
Download Dataset Unsuccessful Requests
This is the list of possible specific error codes you can receive from BigML.io managing downloads.
Error Code | Semantics |
---|---|
-9000 | The dataset export is not ready yet |
-9001 | Bad request to perform a dataset export |
-9002 | The dataset export cannot be performed |
-90021 | The dataset export cannot be performed now |
-9003 | The dataset export cannot be retrieved now |
-9004 | The dataset export cannot be deleted now |
-9005 | The dataset export could not be performed |
-9006 | Dataset exports aren't available for cloned datasets |
Sample Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing samples.
Error Code | Semantics |
---|---|
-16000 | The sample is not ready yet |
-16001 | Bad request to create a sample |
-16002 | Your sample cannot be created |
-16021 | Your sample cannot be created now |
-16003 | The sample cannot be retrieved now |
-16004 | Cannot delete sample now |
-16005 | The sample could not be created |
Correlation Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing correlations.
Error Code | Semantics |
---|---|
-18000 | The correlation is not ready yet |
-18001 | Bad request to create a correlation |
-18002 | Your correlation cannot be created |
-18021 | Your correlation cannot be created now |
-18003 | The correlation cannot be retrieved now |
-18004 | Cannot delete correlation now |
-18005 | The correlation could not be created |
Statistical Test Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing statistical tests.
Error Code | Semantics |
---|---|
-17000 | The statistical test is not ready yet |
-17001 | Bad request to create a statistical test |
-17002 | Your statistical test cannot be created |
-17021 | Your statistical test cannot be created now |
-17003 | The statistical test cannot be retrieved now |
-17004 | Cannot delete statistical test now |
-17005 | The statistical test could not be created |
Model Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing models.
Error Code | Semantics |
---|---|
-4000 | The dataset is not ready. A one-click model has been requested but the corresponding dataset is not ready yet |
-4001 | Bad request to create a model |
-4002 | The model cannot be created |
-40021 | The model cannot be created now |
-4003 | The model cannot be retrieved |
-4004 | The model cannot be deleted now |
-4005 | Faulty model |
-4006 | The dataset is empty |
-4007 | The input fields are empty |
-4008 | The model could not be cloned properly. This happens when there is an internal error when you try to buy or clone other user's model |
-4008 | Wrong objective field |
-6060 | The (sampled) input dataset is empty |
Ensemble Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing ensembles.
Error Code | Semantics |
---|---|
-8001 | Bad request to create an ensemble |
-8002 | The ensemble cannot be created |
-80021 | The ensemble cannot be created now |
-8003 | The ensemble cannot be retrieved now |
-8004 | The ensemble cannot be deleted now |
-8005 | The ensemble could not be created |
-8008 | The ensemble could not be cloned properly |
Logistic Regression Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing logistic regressions.
Error Code | Semantics |
---|---|
-22000 | The logistic regression is not ready yet |
-22001 | Bad request to create a logistic regression |
-22002 | Your logistic regression cannot be created |
-22021 | Your logistic regression cannot be created now |
-22003 | The logistic regression cannot be retrieved now |
-22004 | Cannot delete logistic regression now |
-22005 | The logistic regression could not be created |
Cluster Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing clusters.
Error Code | Semantics |
---|---|
-10000 | The cluster is not ready yet |
-10001 | Bad request to create a cluster |
-10002 | The cluster cannot be created |
-10003 | The cluster cannot be created now |
-10004 | The cluster cannot be retrieved now |
-10005 | The cluster cannot be deleted now |
-10008 | The cluster could not be cloned properly |
Anomaly Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing anomaly detectors.
Error Code | Semantics |
---|---|
-13000 | The anomaly detector is not ready yet |
-13001 | Bad request to create an anomaly detector |
-13002 | The anomaly detector cannot be created |
-13021 | The anomaly detector cannot be created now |
-13003 | The anomaly detector cannot be retrieved now |
-13004 | The anomaly detector cannot be deleted now |
-13005 | The anomaly detector could not be created |
-13008 | The anomaly detector could not be cloned properly |
Association Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing associations.
Error Code | Semantics |
---|---|
-23000 | The association is not ready yet |
-23001 | Bad request to create an association |
-23002 | Your association cannot be created |
-23021 | Your association cannot be created now |
-23003 | The association cannot be retrieved now |
-23004 | Cannot delete association now |
-23005 | The association could not be created |
Topic Model Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing topic models.
Error Code | Semantics |
---|---|
-26000 | The topic model is not ready yet |
-26001 | Bad request to create a topic model |
-26002 | Your topic model cannot be created |
-26021 | Your topic model cannot be created now |
-26003 | The topic model cannot be retrieved now |
-26004 | Cannot delete topic model now |
-26005 | The topic model could not be created |
Time Series Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing time series.
Error Code | Semantics |
---|---|
-30000 | The time series is not ready yet |
-30001 | Bad request to create a time series |
-30002 | Your time series cannot be created |
-30021 | Your time series cannot be created now |
-30003 | The time series cannot be retrieved now |
-30004 | Cannot delete time series now |
-30005 | The time series could not be created |
Deepnet Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing deepnets.
Error Code | Semantics |
---|---|
-33001 | Bad request to create an deepnet |
-33002 | The deepnet cannot be created |
-330021 | The deepnet cannot be created now |
-33003 | The deepnet cannot be retrieved now |
-33004 | The deepnet cannot be deleted now |
-33005 | The deepnet could not be created |
Prediction Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing predictions.
Error Code | Semantics |
---|---|
-5000 | This model is not ready yet |
-5001 | Bad request to create a prediction |
-5002 | The prediction can not be created |
-5003 | The prediction cannot be retrieved |
-5004 | The prediction cannot be deleted now |
-5005 | The prediction could not be created |
Centroid Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing centroids.
Error Code | Semantics |
---|---|
-11001 | Bad request to create a centroid |
-11002 | Your centroid cannot be created now |
-11003 | The centroid cannot be retrieved now |
-11004 | Cannot delete centroid now |
-11005 | The centroid could not be created |
Anomaly Score Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing anomaly scores.
Error Code | Semantics |
---|---|
-14001 | Bad request to create an anomaly score |
-14002 | Your anomaly score cannot be created now |
-14003 | The anomaly score cannot be retrieved now |
-14004 | Cannot delete anomaly score now |
-14005 | The anomaly score could not be created |
Association Set Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing association set.
Error Code | Semantics |
---|---|
-24001 | Bad request to create an association set |
-24002 | Your association set cannot be created now |
-24003 | The association set cannot be retrieved now |
-24004 | Cannot delete association set now |
-24005 | The association set could not be created |
Topic Distribution Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing topic distributions.
Error Code | Semantics |
---|---|
-27001 | Bad request to create a topic distribution |
-27002 | Your topic distribution cannot be created now |
-27003 | The topic distribution cannot be retrieved now |
-27004 | Cannot delete topic distribution now |
-27005 | The topic distribution could not be created |
Forecast Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing forecasts.
Error Code | Semantics |
---|---|
-31001 | Bad request to create a forecast |
-31002 | Your forecast cannot be created now |
-31003 | The forecast cannot be retrieved now |
-31004 | Cannot delete forecast now |
-31005 | The forecast could not be created |
Batch Prediction Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch predictions.
Error Code | Semantics |
---|---|
-6001 | Bad request to perform a batch prediction |
-6002 | The batch prediction cannot be performed |
-60021 | The batch prediction cannot be performed now |
-6003 | The batch prediction cannot be retrieved now |
-6004 | The batch prediction cannot be deleted now |
-6005 | The batch prediction could not be performed |
Batch Centroid Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch centroids.
Error Code | Semantics |
---|---|
-12001 | Bad request to perform a batch centroid |
-12002 | The batch centroid cannot be performed |
-12021 | The batch centroid cannot be performed now |
-12003 | The batch centroid cannot be retrieved now |
-12004 | The batch centroid cannot be deleted now |
-12005 | The batch centroid could not be performed |
Batch Anomaly Score Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch anomaly scores.
Error Code | Semantics |
---|---|
-15001 | Bad request to perform a batch anomaly score |
-15002 | The batch anomaly score cannot be performed |
-15021 | The batch anomaly score cannot be performed now |
-15003 | The batch anomaly score cannot be retrieved now |
-15004 | The batch anomaly score cannot be deleted now |
-15005 | The batch anomaly score could not be performed |
Batch Topic Distribution Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch topic distributions.
Error Code | Semantics |
---|---|
-28001 | Bad request to perform a batch topic distribution |
-28002 | The batch topic distribution cannot be performed |
-28021 | The batch topic distribution cannot be performed now |
-28003 | The batch topic distribution cannot be retrieved now |
-28004 | The batch topic distribution cannot be deleted now |
-28005 | The batch topic distribution could not be performed |
Evaluation Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing evaluations.
Error Code | Semantics |
---|---|
-7001 | Bad request to perform an evaluation |
-7002 | The evaluation cannot be performed |
-70021 | The evaluation cannot be performed now |
-7003 | The evaluation cannot be retrieved now |
-7004 | The evaluation cannot be deleted now |
-7005 | The evaluation could not be performed |
Whizzml Library Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing libraries.
Error Code | Semantics |
---|---|
-19000 | The library is not ready yet |
-19001 | Bad request to create a library |
-19002 | Your library cannot be created |
-19021 | Your library cannot be created now |
-19003 | The library cannot be retrieved now |
-19004 | Cannot delete library now |
-19005 | The library could not be created |
Whizzml Script Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing scripts.
Error Code | Semantics |
---|---|
-20000 | The script is not ready yet |
-20001 | Bad request to create a script |
-20002 | Your script cannot be created |
-20021 | Your script cannot be created now |
-20003 | The script cannot be retrieved now |
-20004 | Cannot delete script now |
-20005 | The script could not be created |
Whizzml Execution Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing executions.
Error Code | Semantics |
---|---|
-21000 | The execution is not ready yet |
-21001 | Bad request to create an execution |
-21002 | Your execution cannot be created |
-21021 | Your execution cannot be created now |
-21003 | The execution cannot be retrieved now |
-21004 | Cannot delete execution now |
-21005 | The execution could not be created |
Category Codes
Last Updated: Monday, 2017-10-30 10:31
Category | Description |
---|---|
-1 | Uncategorized |
0 | Miscellaneous |
1 | Automotive, Engineering & Manufacturing |
2 | Energy, Oil & Gas |
3 | Banking & Finance |
4 | Fraud & Crime |
5 | Healthcare |
6 | Physical, Earth & Life Sciences |
7 | Consumer & Retail |
8 | Sports & Games |
9 | Demographics & Surveys |
10 | Aerospace & Defense |
11 | Chemical & Pharmaceutical |
12 | Higher Education & Scientific Research |
13 | Human Resources & Psychology |
14 | Insurance |
15 | Law & Order |
16 | Media, Marketing & Advertising |
17 | Public Sector & Nonprofit |
18 | Professional Services |
19 | Technology & Communications |
20 | Transportation & Logistics |
21 | Travel & Leisure |
22 | Utilities |
Category | Description |
---|---|
-1 | Uncategorized |
0 | Miscellaneous |
1 | Advanced Workflow |
2 | Anomaly Detection |
3 | Association Discovery |
4 | Basic Workflow |
5 | Boosting |
6 | Classification |
7 | Classification/Regression |
8 | Correlations |
9 | Cluster Analysis |
10 | Data Transformation |
11 | Evaluation |
12 | Feature Engineering |
13 | Feature Extraction |
14 | Feature Selection |
15 | Hyperparameter Optimization |
16 | Model Selection |
17 | Prediction and Scoring |
18 | Regression |
19 | Stacking |
20 | Statistical Test |
Projects
Last Updated: Monday, 2017-10-30 10:31
A project is an abstract resource that helps you group related BigML resources together.
A project must have a name and optionally a category, description, and multiple tags to help you organize and retrieve your projects.
When you create a new source you can assign it to a pre-existing project. All the subsequent resources created using that source will belong to the same project.
All the resources created within a project will inherit the name, description, and tags of the project unless you change them when you create the resources or update them later.
When you select a project on your BigML's dashboard, you will only see the BigML resources related to that project. Using your BigML dashboard you can also create, update and delete projects (and all their associated resources).
BigML.io allows you to create, retrieve, update, delete your project. You can also list all of your projects.
Jump to:
- Project Base URL
- Creating a Project
- Project Arguments
- Retrieving a Project
- Project Properties
- Updating a Project
- Deleting a Project
- Listing Projects
Project Base URL
You can use the following base URL to create, retrieve, update, and delete projects. https://bigml.io/project
Project base URL
All requests to manage your projects must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Project
To create a new project, you just need to POST the name you want to give to the new project to the project base URL.
You can easily do this using curl.
curl "https://bigml.io/project?$BIGML_AUTH" \
-H 'content-type: application/json' \
-d '{"name": "My First Project"}'
> Creating a project
BigML.io will return a newly created project document, if the request succeeded.
{
"category":0,
"created":"2015-02-02T07:49:20.226764",
"description":"",
"name":"My First Project",
"private":true,
"resource":"project/54d9553bf0a5ea5fc0000016",
"stats":{
"anomalies":{
"count":0
},
"anomalyscores":{
"count":0
},
"batchanomalyscores":{
"count":0
},
"batchcentroids":{
"count":0
},
"batchpredictions":{
"count":0
},
"batchtopicdistributions":{
"count":0
},
"centroids":{
"count":0
},
"clusters":{
"count":0
},
"configurations":{
"count":0
},
"correlations":{
"count":0
},
"datasets":{
"count":0
},
"ensembles":{
"count":0
},
"evaluations":{
"count":0
},
"models":{
"count":0
},
"predictions":{
"count":0
},
"sources":{
"count":0
},
"statisticaltests":{
"count":0
},
"topicmodels":{
"count":0
},
"topicdistributions":{
"count":0
}
},
"status":{
"code":5,
"message":"The project has been created"
},
"tags":[],
"updated":"2015-02-02T07:49:20.226781"
}
< Example project JSON response
In addition to the name, you can also use the following arguments.
Project Arguments
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is 0 |
The category that best describes the project. See the category codes for the complete list of categories.
Example: 1 |
description
optional |
String |
A description of the project up to 8192 characters long.
Example: "This is a description of my new project" |
name
optional |
String, default is Project Number |
The name you want to give to the new project.
Example: "my new project" |
tags
optional |
Array of Strings |
A list of strings that help classify and index your project.
Example: ["best customers", "2018"] |
You can also use curl to customize your new project with a category, description, or tags. For example, you can create a new project with all those arguments as follows:
curl "https://bigml.io/project?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{
"name": "Fraud Detection",
"category": 4,
"description": "Detecting fraud in bank transactions",
"tags": ["fraud", "detection"]
}'
> Creating a project with arguments
Retrieving a Project
Each project has a unique identifier in the form "project/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the project.
To retrieve a project with curl:
curl "https://bigml.io/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH"
$ Retrieving a project from the command line
You can also use your browser to visualize the project using the full BigML.io URL or pasting the project/id into the BigML.com dashboard.
Project Properties
Once a project has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the project and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the project creation has been completed without errors. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the project was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
description
updatable |
String | A text describing the project. It can contain restricted markdown to decorate the text. |
name
filterable, sortable, updatable |
String | The name of the project as provided. |
private
filterable, sortable |
Boolean | Whether the project is public or not. |
resource | String | The project/id. |
stats | Object | An object keyed by resource that informs of the number of resources created. |
status | Object | A description of the status of the project. It includes a code, a message, and some extra information. See the table below. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the project was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
Updating a Project
To update a project, you need to PUT an object containing the fields that you want to update to the project' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated project.
For example, to update a project with a new name, a new category, a new description, and new tags you can use curl like this:
curl "https://bigml.io/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "My New Project",
"category": 3,
"description": "My first BigML project",
"tags": ["fraud", "detection"]}'
$ Updating a project
Deleting a Project
To delete a project, you need to issue a HTTP DELETE request to the project/id to be deleted.
Using curl you can do something like this to delete a project:
curl -X DELETE "https://bigml.io/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH"
$ Deleting a project from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a project, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a project a second time, or a project that does not exist, you will receive a "404 not found" response.
However, if you try to delete a project that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Projects
To list all the projects, you can use the project base URL. By default, only the 20 most recent projects will be returned. You can see below how to change this number using the limit parameter.
You can get your list of projects directly in your browser using your own username and API key with the following links.
https://bigml.io/project?$BIGML_AUTH
> Listing projects from a browser
Sources
Last Updated: Tuesday, 2018-03-13 12:20
A source is the raw data that you want to use to create a predictive model. A source is usually a (big) file in a comma separated values (CSV) format. See the example below. Each row represents an instance (or example). Each column in the file represents a feature or field. The last column usually represents the class or objective field. The file might have a first row named header with a name for each field.
Plan, Talk,Text,Purchases,Data,Age,Churn?
family, 148, 72, 0, 33.6, 50, TRUE
business, 85, 66, 0, 26.6, 31, FALSE
business, 83, 64, 0, 23.3, 32,TRUE
individual, 9, 66, 94, 28.1, 21, FALSE
family, 15, 0, 0, 35.3, 29, FALSE
individual, 66, 72, 175, 25.8, 51,TRUE
business, 0, 0, 0, 30, 32, TRUE
family, 18, 84, 230, 45.8, 31,TRUE
individual, 71, 110, 240, 45.4, 54, TRUE
family, 59, 64, 0, 27.4, 40, FALSE
Example CSV file
A source:
- Should be a comma-separated values (CSV) file. Spaces, tabs, semicolons and tabs are also valid separators.
- Weka's ARFF files are also supported.
- JSON in a few formats is also supported. See below for more details.
- Microsoft Excel files or Mac OS numbers files should also work most times. But it would be better if you please export them to CSV (commad-separated values).
- Cannot be bigger than 64GB.
- Can be gzipped (.gz) or compressed (.bz2). It can be zipped (.zip), but only if the archive contains one single file.
You can also create sources from remote locations using a variety of protocols like https, hdfs, s3, asv, odata/odatas, dropbox, gcs, gdrive, etc. See below for more details.
BigML.io allows you to create, retrieve, update, delete your source. You can also list all of your sources.
Jump to:
- JSON Sources
- Source Base URL
- Creating a Source
- Creating a Source Using a Local File
- Creating a Source Using a URL
- Creating a Source Using Inline Data
- Creating a Source with Automatically Generated Synthetic Data
- Text Processing
- Items Detection
- Datetime Detection
- Source Arguments
- Retrieving a Source
- Source Properties
- Filtering and Paginating Fields from a Source
- Updating a Source
- Deleting a Source
- Listing Sources
JSON Sources
BigML.io can parse JSON data in one of two formats:
- A top-level list of lists of atomic values, each one defining a row:
Valid JSON Source format[ ["sepal length","sepal width","petal length","petal width","species"], [5.1,3.5,1.4,0.2,"Iris-setosa"], [4.9,3.0,1.4,0.2,"Iris-setosa"], ... ]
- A top-level list of dictionaries, where each dictionary's values represent the row values and the corresponding keys the column names. The first dictionary defines the keys that will be selected.
Valid JSON Source format[ {"sepal length":5.1,"sepal width":3.5,"petal length":1.4,"petal width":0.2 ,"species":"Iris-setosa"}, {"sepal length":4.9,"sepal width":3.0,"petal length":1.4,"petal width":0.2 ,"species":"Iris-setosa"}, ... ]
Source Base URL
You can use the following base URL to create, retrieve, update, and delete sources. https://bigml.io/source
Source base URL
All requests to manage your sources must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Source
You can create a new source in any of the following four ways:
- Local Sources: Using a local file. You need to post the file content in "multipart/form-data". The maximum size allowed is 64 GB per file.
- Remote Sources: Using a URL that points to your data. The maximum size allowed is 64 GB or 5 TB if you use a file stored in Amazon S3.
- Inline Sources: Using some inline data. The content type must be "application/json". The maximum size in this case is limited 10 MB per post.
- Synthetic Sources: Automatically generate synthetic data sources, presumably for activities such as testing, prototyping, and benchmarking.
Creating a Source Using a Local File
To create a new source, you need to POST the file containing your data to the source base URL. The file must be attached in the post as a file upload.The Content-Type in your HTTP request must be "multipart/form-data" according to RFC2388. This allows you to upload binary files in compressed format (.Z, .gz, etc) that will be uploaded faster.
You can easily do this using curl. The option -F (--form) lets curl emulate a filled-in form in which a user has pressed the submit button. You need to prefix the file path name with "@".
curl "https://bigml.io/source?$BIGML_AUTH" -F file=@iris.csv
> Creating a source
Creating a Source Using a URL
To create a new remote source you need a URL that points to the data file that you want BigML to download for you.
You can easily do this using curl. The option -H lets curl set the content type header while the option -X sets the http method. You can send the URL within a JSON object as follows:
curl "https://bigml.io/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"remote": "https://static.bigml.com/csv/iris.csv"}'
> Creating a remote source
You can use the following types of URLs to create remote sources:
- HTTP or HTTPS. They can also include basic realm authorization.
Example URLshttps://test:test@static.bigml.com/csv/iris.csv http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
- Public or private files in Amazon S3.
Example Amazon S3 URLss3://bigml-public/csv/iris.csv s3://bigml-test/csv/iris.csv?access-key=AKIAIF6IUYDYUQ7BALJQ&secret-key=XgrQV/hHBVymD75AhFOzveX4qz7DYrO6q8WsM6ny
Creating a remote source from Google Drive and Google Storage
You have two options to create a remote datasource from Google Drive and Google Storage via API:
- Using BigML:
Allow BigML to access to your Google Drive or Google Storage from the Cloud Storages section from your Account or from your Dashboard sources list. You will get the access token and the refresh token.
Google Drive example:- Select the option to create source from Google Drive:
- Allow BigML access to your Google Drive:
- Get the access token and refresh token:
You can easily create the remote source using curl as in the examples below:
> Creating a remote source from Google Drivecurl "https://bigml.io/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gdrive://noserver/0BxGbAMhJezOScTFBUVFPMy1xT1E?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8"}'
> Creating a remote source from Google Cloud Storagecurl "https://bigml.io/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gcs://company_bucket/Iris.csv?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8"}'
- Using your own app:
You can also create a remote source from your own App. You first need to authorize BigML access from your own Google Apps application. BigML only needs authorization for read-only authentication scope (
https://www.googleapis.com/auth/devstorage.read_only
,https://www.googleapis.com/auth/drive.readonly
), but you can have any of the other available scopes (find authentication scopes available for Google Drive and Google Storage). After the authorization process you will get your access token and refresh token from the Google Authorization Server.
Then the process is the same as creating a remote source using BigML application described above. You need to POST to the source endpoint an object containing at least the file ID (for Google Drive) or the bucket and the file name (for Google Storage) and the access token, but in this case you will also need to include the app secret and app client from your App. Again, including the refresh token is optional.
Your values for app client and app secret appear as Client secret and Client ID in Google developers console respectively. See image below.
> Creating a remote source from Google Drive using your appcurl "https://bigml.io/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gdrive://noserver/0BxGbAMhJezOSXy1oRU5MSU90SUU?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8&app-secret=AvFake1Secretjt27HQWTm4h&app-client=667300000007-07gjg5o912o1v422hfake2cli3nt3no6.apps.googleusercontent.com"}'
> Creating a remote source from Google Cloud Storage using your appcurl "https://bigml.io/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gcs://company_bucket/Iris.csv?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8&app-secret=AvFake1Secretjt27HQWTm4h&app-client=667300000007-07gjg5o912o1v422hfake2cli3nt3no6.apps.googleusercontent.com"}'
Creating a Source Using Inline Data
You can also create sources sending some inline data within the body of a POST http request. This way is specially useful if you want to model small amounts of data generated by an application.
To create an inline source using curl you can use the following example:
curl "https://bigml.io/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"data": "a,b,c,d\n1,2,3,4\n5,6,7,8"}'
> Creating an inline source
Independently of how you create a new source (local, remote or inline) BigML.io will return a newly created source document, if the request succeeded.
{
"category": 0,
"code": 201,
"content_type": "application/octet-stream",
"created": "2012-11-15T02:24:59.686739",
"credits": 0.0,
"description": "",
"disable_datetime": false,
"fields_meta": {
"count": 0,
"limit": 200,
"offset": 0,
"total": 0
},
"file_name": "iris.csv",
"md5": "d1175c032e1042bec7f974c91e4a65ae",
"name": "iris.csv",
"number_of_datasets": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"project": null,
"resource": "source/4f603fe203ce89bb2d000000",
"size": 4608,
"source_parser": {},
"status": {
"code": 1,
"message": "The request has been queued and will be processed soon"
},
"tags": [],
"type": 0,
"updated": "2012-11-15T02:24:59.686758"
}
< Example source JSON response
Creating a Source with Automatically Generated Synthetic Data
You can also synthetically create sources using automatically generated data for activities such as testing, prototyping, and benchmarking.
To create a syntheric source using curl you can use the following example:
curl "https://bigml.io/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"synthetic": {"fields": 10, "rows": 10}}'
> Creating a synthetic source
In addition to the file, you can also use the following arguments.
Source Arguments
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is 0 |
The category that best describes the data. See the category codes for the complete list of categories.
Example: 1 |
data
optional |
String |
Data for inline source creation.
Example: "a,b,c,d\n1,2,3,4\n5,6,7,8" |
description
optional |
String |
A description of the source up to 8192 characters long.
Example: "This is a description of my new source" |
disable_datetime
optional |
Boolean, default is false |
Whether BigML has to generate or not new fields from existing date-time fields.
Example: true |
file
optional |
multipart/form-data; charset=utf-8 | File containing your data in csv format. It can be compressed, gzipped, or zipped if the archive contains only one file |
item_analysis
optional |
Object, default is shown in the table below |
Set of parameters to activate item analysis for the source.
Example:
|
name
optional |
String, default is Unnamed source |
The name you want to give to the new source.
Example: "my new source" |
project
optional |
String |
The project/id you want the source to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
remote
optional |
String |
A URL pointing to file containing your data in csv format. It can be compressed, gzipped, or zipped.
Example: https://static.bigml.com/csv/iris.csv |
source_parser
optional |
Object, default is shown in the table below |
Set of parameters to parse the source.
Example:
|
synthetic
optional |
Object, default is shown in the table below |
Set of parameters to generate a synthetic source presumably for activities such as testing, prototyping and benchmarking.
Example:
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your source.
Example: ["best customers", "2018"] |
term_analysis
optional |
Object, default is shown in the table below |
Set of parameters to activate text analysis for the source.
Example:
|
A source parser object is composed of any combination of the following properties.
You can also use curl to customize your new source with a name and different parser. For example, to create a new source named "my source", without a header and with "x" as the only missing token.
curl "https://bigml.io/source?$BIGML_AUTH" \
-F file=@iris.csv \
-F 'name=my source' \
-F 'source_parser={"header": false, "missing_tokens":["x"]}'
> Creating a source with arguments
If you do not specify a name, BigML.io will assign to the source the same name as the file that you uploaded. If you do not specify a source_parser, BigML.io will do its best to automatically select the parsing parameters for you. However, if you do specify it, BigML.io will not try to second-guess you.
A item_analysis object is composed of any combination of the following properties.
A term_analysis object is composed of any combination of the following properties.
A synthetic object is composed of the following properties.
Text Processing
While the handling of numeric, categorical, or items fields within a decision tree framework is fairly straightforward, the handling of text fields can be done in a number of different ways. BigML.io takes a basic and reasonably robust approach, leverging some basic NLP techniques along with a simple bag-of-words style method of feature generation.
At the source level, BigML.io attempts to do basic language detection. Initially the language can be English ("en"), Spanish ("es"), Catalan/Valencian ("ca"), Dutch ("nl"), French ("fr"), German ("de"), Portuguese ("pt"), or "none" if no language is detected. In the near future, BigML.io will support many more languages.
For text fields, BigML.io adds potentially five keys to the detected fields, all of which are placed in a map under term_analysis.
The first is language, which is mapped to the detected language.
There are also three boolean keys, case_sensitive, use_stopwords, and stem_words. The case_sensitive key is false by default. use_stopwords should be true if we should include stopwords in the vocabulary for the detected field during text summarization. stem_words should be true if BigML.io should perform word stemming on this field, which maps forms of the same term to the same key when summarizing or generating models. By default, use_stopwords is false and stem_words is true for languages other than "none" and they are not present otherwise.
Finally, token_mode determines the tokenization strategy. It may be set as either tokens_only, full_terms_only, and all. When set as tokens_only then individual words are used as terms. For example, "ML for all" becomes ["ML", "for", "all"]. However, when full_terms_only is selected, then the entire field is treated as a single term as long as it is shorter than 256 characters. In this case "ML for all" stays ["ML for all"]. If all is selected, then both full terms and tokenized terms are used. In this case ["ML for all"] becomes ["ML", "for", "all", "ML for all"]. The default for token_mode is all.
There are a few details to note:
- If full_terms_only is selected, then no stemming will occur even if stem_words is true.
- Also, when either all or tokens_only are selected, a term must appear at least twice to be selected for the tag cloud. However full_terms_only lowers this limit to a single occurrence.
- Finally, if the language is "none", or if a language does not have an algorithm available for stopword removal or stemming, the use_stopwords and stem_words keys will have no effect.
Items Detection
BigML automatically detects as items fields that have many different categorical values per instance separated by non-alphanumeric characters, so they can’t be considered either categorical or text fields
These kind of fields can be found in transactional datasets where each instance is associated to a different set of products contained within one field. For example, datasets containing all products bought by users or prescription datasets where each patient is associated to different treatments. These datasets are commonly used for Association Discovery to find relationships between different items.
Find the two CSV examples below that could be considered items fields:
User, Prescription
John Doe, medicine 1; medicine 2
Jane Roe, medicine 1; medicine 3; medicine 4; medicine 6
Transaction, Product
12345, product 1; product 2; product 5; product 6; product 7
67890, product 1; product 3; product 4
In the examples above, the fields Prescription and Products will be considered as items and each different value will be a unique item.
Once a field has been detected as items, BigML tries to automatically detect which is the best separator for your items. For example, for the following itemset {hot dog; milk, skimmed; chocolate}, the best separator is the semicolon which yields three different items: 'hot dog', 'milk, skimmed' and 'chocolate'.
For items fields, there are five different parameters you can configure under the property group item_analysis, which includes separator that allows you to specify which separator you want to set for your items.
Note that items fields can’t be eligible as target fields for models, logistic regression, and ensembles, but they can be used as predictors. For anomaly detection, they can’t be included as an input field to calculate the anomaly score, although they can be selected as summary fields.
Datetime Detection
During the source pre-scan BigML tries to determine the data type of each field in your file. This process automatically detects datetime fields and, if disable_datetime is not explicitly set to "false", BigML will generate additional fields with their components.
For instance, if a field named "date" has been identified as a datetime with format "YYYY-MM-dd", four new fields will be automatically added to the source, namely "date.year", "date.month", "date.day-of-month" and "date.day-of-week". For each row, these new fields will be filled in automatically by parsing the value of their parent field, "date". For example, if the latter contains the value "1969-07-14", the autogenerated columns in that row will have the values 1969, 7, 14 and 1 (because that day was Monday). As noted before, autogenaration can be disabled by setting disable_datetime option to "true", either in the create source request or later in an update source operation.
When a field is detected as datetime, BigML tries to determine its format for parsing the values and generate the fields with their components. By default, BigML accepts ISO 8601 time formats (YYYY-MM-DD) as well as a number of other common European and US formats, as seen in the table below:
time_format Name | Example |
---|---|
basic-date-time | 19690714T173639.592Z |
basic-date-time-no-ms | 19690714T173639Z |
basic-ordinal-date-time | 1969195T173639.592Z |
basic-ordinal-date-time-no-ms | 1969195T173639Z |
basic-t-time | T173639.592Z |
basic-t-time-no-ms | T173639Z |
basic-time | 173639.592Z |
basic-time-no-ms | 173639Z |
basic-week-date | 1969W297 |
basic-week-date-time | 1969W297T173639.592Z |
basic-week-date-time-no-ms | 1969W297T173639Z |
clock-minute | 5:36 PM |
clock-minute-nospace | 5:36PM |
clock-second | 5:36:39 PM |
clock-second-nospace | 5:36:39PM |
date | 1969-07-14 |
date-hour | 1969-07-14T17 |
date-hour-minute | 1969-07-14T17:36 |
date-hour-minute-second | 1969-07-14T17:36:39 |
date-hour-minute-second-fraction | 1969-07-14T17:36:39.592 |
date-hour-minute-second-ms | 1969-07-14T17:36:39.592 |
date-time | 1969-07-14T17:36:39.592Z |
date-time-no-ms | 1969-07-14T17:36:39Z |
eu-date | 14/7/1969 |
eu-date-clock-minute | 14/7/1969 5:36 PM |
eu-date-clock-minute-nospace | 14/7/1969 5:36PM |
eu-date-clock-second | 14/7/1969 5:36:39 PM |
eu-date-clock-second-nospace | 14/7/1969 5:36:39PM |
eu-date-millisecond | 14/7/1969 17:36:39.592 |
eu-date-minute | 14/7/1969 17:36 |
eu-date-second | 14/7/1969 17:36:39 |
eu-sdate | 14-7-1969 |
eu-sdate-clock-minute | 14-7-1969 5:36 PM |
eu-sdate-clock-minute-nospace | 14-7-1969 5:36PM |
eu-sdate-clock-second | 14-7-1969 5:36:39 PM |
eu-sdate-clock-second-nospace | 14-7-1969 5:36:39PM |
eu-sdate-millisecond | 14-7-1969 17:36:39.592 |
eu-sdate-minute | 14-7-1969 17:36 |
eu-sdate-second | 14-7-1969 17:36:39 |
hour-minute | 17:36 |
hour-minute-second | 17:36:39 |
hour-minute-second-fraction | 17:36:39.592 |
hour-minute-second-ms | 17:36:39.592 |
mysql | 1969-07-14 17:36:39 |
no-t-date-hour-minute | 1969-7-14 17:36 |
odata-format | /Datetime(-14752170831)/ |
ordinal-date-time | 1969-195T17:36:39.592Z |
ordinal-date-time-no-ms | 1969-195T17:36:39Z |
rfc822 | Mon, 14 Jul 1969 17:36:39 +0000 |
t-time | T17:36:39.592Z |
t-time-no-ms | T17:36:39Z |
time | 17:36:39.592Z |
time-no-ms | 17:36:39Z |
timestamp | -14718201 |
timestamp-msecs | -14718201000 |
twitter-time | Mon Jul 14 17:36:39 +0000 1969 |
twitter-time-alt | 1969-7-14 17:36:39 +0000 |
twitter-time-alt-2 | 1969-7-14 17:36 +0000 |
twitter-time-alt-3 | Mon Jul 14 17:36 +0000 1969 |
us-date | 7/14/1969 |
us-date-clock-minute | 7/14/1969 5:36 PM |
us-date-clock-minute-nospace | 7/14/1969 5:36PM |
us-date-clock-second | 7/14/1969 5:36:39 PM |
us-date-clock-second-nospace | 7/14/1969 5:36:39PM |
us-date-millisecond | 7/14/1969 17:36:39.592 |
us-date-minute | 7/14/1969 17:36 |
us-date-second | 7/14/1969 17:36:39 |
us-sdate | 7-14-1969 |
us-sdate-clock-minute | 7-14-1969 5:36 PM |
us-sdate-clock-minute-nospace | 7-14-1969 5:36PM |
us-sdate-clock-second | 7-14-1969 5:36:39 PM |
us-sdate-clock-second-nospace | 7-14-1969 5:36:39PM |
us-sdate-millisecond | 7-14-1969 17:36:39.592 |
us-sdate-minute | 7-14-1969 17:36 |
us-sdate-second | 7-14-1969 17:36:39 |
week-date | 1969-W29-7 |
week-date-time | 1969-W29-7T17:36:39.592Z |
week-date-time-no-ms | 1969-W29-7T17:36:39Z |
weekyear-week | 1969-W29 |
weekyear-week-day | 1969-W29-7 |
year-month | 1969-07 |
year-month-day | 1969-07-14 |
It might happen that BigML is not able to determine the right format of your datetime field. In that case, it will be considered either a text or a categorical field. You can override that assignment by setting the optype of the field to datetime and passing the appropriate format in time_formats. For instance:
curl "https://bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000004": {"optype": "datetime", "time_formats": ["date"]}}}' \
-H 'content-type: application/json'
> Updating a source field with optype "datetime"
curl "https://bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000004": {"optype": "datetime", "time_formats": ["YYYY-MM-dd"]}}}' \
-H 'content-type: application/json'
> Updating a source field with custom "time_formats"
Retrieving a Source
Each source has a unique identifier in the form "source/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the source.
To retrieve a source with curl:
curl "https://bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH"
$ Retrieving a source from the command line
You can also use your browser to visualize the source using the full BigML.io URL or pasting the source/id into the BigML.com dashboard.
Source Properties
Once a source has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the source and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the source creation has been completed without errors. |
content_type
filterable, sortable |
String | This is the MIME content-type as provided by your HTTP client. The content-type can help BigML.io to better parse your file. For example, if you use curl, you can alter it using the type option "-F file=@iris.csv;type=text/csv". |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the source was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this source. |
description
updatable |
String | A text describing the source. It can contain restricted markdown to decorate the text. |
disable_datetime
updatable |
Boolean | False when BigML didn't generate new fields from existing date-time fields. |
fields
updatable |
Object |
A dictionary with an entry per field (column) in your data. Each entry includes the column number, the name of the field, the type of the field, a specific locale if it differs from the source's one, and specific missing tokens if the differ from the source's one. This property is very handy to update sources according to your own parsing preferences.
Example:
|
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
file_name
filterable, sortable |
String | The name of the file as you submitted it. |
md5 | String | The file MD5 Message-Digest Algorithm as specified by RFC 1321. |
name
filterable, sortable, updatable |
String | The name of the source as your provided or the name of the file by default. |
number_of_anomalies
filterable, sortable |
Integer | The current number of anomalies that use this source. |
number_of_anomalyscores
filterable, sortable |
Integer | The current number of anomaly scores that use this source. |
number_of_associations
filterable, sortable |
Integer | The current number of associations that use this source. |
number_of_associationsets
filterable, sortable |
Integer | The current number of association sets that use this source. |
number_of_centroids
filterable, sortable |
Integer | The current number of centroids that use this source. |
number_of_clusters
filterable, sortable |
Integer | The current number of clusters that use this source. |
number_of_correlations
filterable, sortable |
Integer | The current number of correlations that use this source. |
number_of_datasets
filterable, sortable |
Integer | The current number of datasets that use this source. |
number_of_ensembles
filterable, sortable |
Integer | The current number of ensembles that use this source. |
number_of_forecasts
filterable, sortable |
Integer | The current number of forecasts that use this source. |
number_of_logisticregressions
filterable, sortable |
Integer | The current number of logistic regressions that use this source. |
number_of_models
filterable, sortable |
Integer | The current number of models that use this source. |
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this source. |
number_of_statisticaltests
filterable, sortable |
Integer | The current number of statistical tests that use this source. |
number_of_timeseries
filterable, sortable |
Integer | The current number of time series that use this source. |
number_of_topicdistributions
filterable, sortable |
Integer | The current number of topic distributions that use this source. |
number_of_topicmodels
filterable, sortable |
Integer | The current number of topic models that use this source. |
private
filterable, sortable |
Boolean | Whether the source is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
remote | String | URL of the remote data source. |
resource | String | The source/id. |
shared
filterable, sortable |
Boolean | Whether the source is shared using a private link or not. |
shared_hash | String | The hash that gives access to this source if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this source. |
size
filterable, sortable |
Integer | The number of bytes of the source. |
source_parser
updatable |
Object | Set of parameters to parse the source. |
status | Object | A description of the status of the source. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the source was created using a subscription plan or not. |
synthetic | Object | Set of parameters to generate a synthetic source presumably for activities such as testing, prototyping and benchmarking. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
term_analysis
updatable |
Object | Set of parameters that define how text analysis should work for text fields. |
type
filterable, sortable |
Integer |
The type of source.
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the source was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
Source Fields
The property fields is a dictionary keyed by an auto-generated id per each field in the source. Each field has as a value an object with the following properties:
For fields classified with optype "text", the default values specified in the term_analysis at the top-level of the source are used.
Non-provided flags by term_analysis take their default value, i.e., false for booleans, none for language.
Besides these global default values, which apply to all text fields (and potential text fields, such as categorical ones that might overflow to text during dataset creation), it's possible to specify term_analysis flags on a per-field basis.
For fields classified with optype "items", the default values specified in the item_analysis at the top-level of the source are used.
Like term_analysis, non-provided flags by item_analysis take their default value and it's possible to specify item_analysis flags on a per-field basis as well at the global level, too.
Source Status
Before a source is successfully created, BigML.io makes sure that it has been uploaded in an understandable format, that the data that it contains is parseable, and that the types for each column in the data can be inferred successfully. The source goes through a number of states until all these analyses are completed. Through the status field in the source you can determine when the source has been fully processed and is ready to be used to create a dataset. These are the fields that a source's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the source creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the source. |
message | String | A human readable message explaining the status. |
Once a source has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"content_type": "application/octet-stream",
"created": "2012-11-15T02:24:59.686000",
"credits": 0.0,
"description": "",
"fields": {
"000000": {
"column_number": 0,
"name": "sepal length",
"optype": "numeric",
"order": 0
},
"000001": {
"column_number": 1,
"name": "sepal width",
"optype": "numeric",
"order": 1
},
"000002": {
"column_number": 2,
"name": "petal length",
"optype": "numeric",
"order": 2
},
"000003": {
"column_number": 3,
"name": "petal width",
"optype": "numeric",
"order": 3
},
"000004": {
"column_number": 4,
"name": "species",
"optype": "categorical",
"order": 4
}
},
"fields_meta": {
"count": 5,
"limit": 200,
"offset": 0,
"total": 5
},
"file_name": "iris.csv",
"md5": "d1175c032e1042bec7f974c91e4a65ae",
"name": "iris.csv",
"number_of_datasets": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"project": null,
"resource": "source/4f603fe203ce89bb2d000000",
"size": 4608,
"source_parser": {
"header": true,
"locale": "en_US",
"missing_tokens": [
"",
"N/A",
"n/a",
"NULL",
"null",
"-",
"#DIV/0",
"#REF!",
"#NAME?",
"NIL",
"nil",
"NA",
"na",
"#VALUE!",
"#NULL!",
"NaN",
"#N/A",
"#NUM!",
"?"
],
"quote": "\"",
"separator": ","
},
"status": {
"code": 5,
"elapsed": 244,
"message": "The source has been created"
},
"tags": [],
"type": 0,
"updated": "2012-11-15T02:25:00.001000"
}
< Example source JSON response
Filtering and Paginating Fields from a Source
A source might be composed of hundreds or even thousands of fields. Thus when retrieving a source, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Source
To update a source, you need to PUT an object containing the fields that you want to update to the source' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated source.
For example, to update a source with a new name and a new locale you can use curl like this:
curl "https://bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name", "source_parser": {"locale": "es-ES"}}' \
-H 'content-type: application/json'
$ Updating a source's name and locale
Deleting a Source
To delete a source, you need to issue a HTTP DELETE request to the source/id to be deleted.
Using curl you can do something like this to delete a source:
curl -X DELETE "https://bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH"
$ Deleting a source from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a source, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a source a second time, or a source that does not exist, you will receive a "404 not found" response.
However, if you try to delete a source that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Sources
To list all the sources, you can use the source base URL. By default, only the 20 most recent sources will be returned. You can see below how to change this number using the limit parameter.
You can get your list of sources directly in your browser using your own username and API key with the following links.
https://bigml.io/source?$BIGML_AUTH
> Listing sources from a browser
Datasets
Last Updated: Monday, 2017-10-30 10:31
A dataset is a structured version of a source where each field has been processed and serialized according to its type. The possible field types are numeric, categorical, text, date-time, or items. For each field, you can also get the number of errors that were encountered processing it. Errors are mostly missing values or values that do not match with the type assigned to the column.
When you create a new dataset, histograms of the field values are created for the categorical and numeric fields. In addition, for the numeric fields, a collection of statistics about the field distribution such as minimum, maximum, sum, and sum of squares are also computed.
For date-time fields, BigML attempts to parse the format and automatically generate the related subfields (year, month, day, and so on) present in the format.
For items fields which have many different categorical values per instance separated by non-alphanumeric characters, BigML tries to automatically detect which is the best separator for your items.
Finally, for text fields, BigML handles plain text fields with some light-weight natural language processing; BigML separates the field into words using punctuation and whitespace, attempts to detect the language, groups word forms together using word stemming, and eliminates words that are too common or too rare to be useful. We are then left with somewhere between a few dozen and a few hundred interesting words per text field, the occurrences of which can be features in a model.

BigML.io allows you to create, retrieve, update, delete your dataset. You can also list all of your datasets.
Jump to:
- Dataset Base URL
- Creating a Dataset
- Dataset Arguments
- Filtering Rows
- Retrieving a Dataset
- Dataset Properties
- Filtering and Paginating Fields from a Dataset
- Updating a Dataset
- Deleting a Dataset
- Listing Datasets
- Multi-Datasets
- Resources Accepting Multi-Datasets Input
- Transformations
- Cloning a Dataset
- Sampling a Dataset
- Filtering a Dataset
- Extending a Dataset
- Filtering the New Fields Output
- Discretization of a Continuous Field
- Outlier Elimination
- Lisp and JSON Syntaxes
- Final Remarks
Dataset Base URL
You can use the following base URL to create, retrieve, update, and delete datasets. https://bigml.io/dataset
Dataset base URL
All requests to manage your datasets must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Dataset
To create a new dataset, you need to POST to the dataset base URL an object containing at least the source/id that you want to use to create the dataset. The content-type must always be "application/json".
You can easily create a new dataset using curl as follows. All you need is a valid source/id and your authentication variable set up as shown above.
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source": "source/50a4527b3c1920186d000041"}'
> Creating a dataset
BigML.io will return the newly created dataset if the request succeeded.
{
"category": 0,
"code": 201,
"columns": 0,
"created": "2012-11-15T02:29:09.293711",
"credits": 0.00439453125,
"description": "",
"excluded_fields": [],
"fields": {},
"fields_meta": {
"count": 0,
"limit": 200,
"offset": 0,
"total": 0
},
"input_fields": [],
"locale": "en-US",
"name": "iris' dataset",
"number_of_evaluations": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"price": 0.0,
"private": true,
"project": null,
"resource": "dataset/52b9359a3c19205ff100002a",
"rows": 0,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"status": {
"code": 1,
"message": "The dataset is being processed and will be created soon"
},
"tags": [],
"updated": "2012-11-15T02:29:09.293733",
"views": 0
}
< Example dataset JSON response
Dataset Arguments
By default, the dataset will include all fields in the corresponding source; but this behaviour can be fine-tuned via the input_fields and excluded_fields lists of identifiers. The former specifies the list of fields to be included in the dataset, and defaults to all fields in the source when empty. To specify excluded fields, you can use excluded_fields: identifiers in that list are removed from the list constructed using input_fields".
See below the full list of arguments that you can POST to create a dataset.
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is the category of the source |
The category that best describes the dataset. See the category codes for the complete list of categories.
Example: 1 |
description
optional |
String |
A description of the dataset up to 8192 characters long.
Example: "This is a description of my new dataset" |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the source is excluded. |
Specifies the fields that won't be included in the dataset.
Example:
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names, labels or descriptions are changed. |
Updates the names, labels, and descriptions of the fields in the dataset with respect to the original names in the source. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
input_fields
optional |
Array, default is []. All the fields in the source. |
Specifies the fields to be included in the dataset.
Example:
|
json_filter
optional |
Array |
A JSON list representing a filter over the rows in the datasource. The first element is an operator and the rest of the elements its arguments. See the section below for more details.
Example: [">", 3.14, ["field", "000002"]] |
lisp_filter
optional |
String |
A string representing a Lisp s-expression to filter rows from the datasource.
Example: "(> 3.14 (field 2))" |
name
optional |
String, default is source's name |
The name you want to give to the new dataset.
Example: "my new dataset" |
objective_field
optional |
Object, default is the last non-auto-generated field in the dataset. |
Specifies the default objective field.
Example:
|
project
optional |
String |
The project/id you want the dataset to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
refresh_field_types
optional |
Boolean, default is false |
Specifies whether field types need to recomputed or not.
Example: true |
refresh_objective
optional |
Boolean, default is false |
Specifies whether the default objective field of the dataset needs to be recomputed or not.
Example: true |
refresh_preferred
optional |
Boolean, default is false |
Specifies whether preferred field flags need to recomputed or not.
Example: true |
size
optional |
Integer, default is the source's size |
The number of bytes from the source that you want to use.
Example: 1073741824 |
source | String |
A valid source/id.
Example: source/4f665b8103ce8920bb000006 |
tags
optional |
Array of Strings |
A list of strings that help classify and index your dataset.
Example: ["best customers", "2018"] |
term_limit
optional |
Integer |
The maximum total number of terms to be used in text analysis.
Example: 500 |
You can also use curl to customize a new dataset with a name, and different size, and only a few fields from the original source. For example, to create a new dataset named "my dataset", with only 500 bytes, and with only two fields:
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source": "source/4f665b8103ce8920bb000006", "name": "my dataset", "size": 500, "fields": {"000001": {"name": "width_1"}, "000003": {"name": "width_2"}}}'
> Creating a customized dataset
If you do not specify a name, BigML.io will assign to the new dataset the source's name. If you do not specify a size, BigML.io will use the full the source's size. If you do not specify any fields BigML.io will include all the fields in the source with their corresponding names.
Filtering Rows
The dataset creation request can include an argument, json_filter, specifying a predicate that the input rows from the source have to satisfy in order to be included in the dataset. This predicate is specified as a (possibly nested) JSON list whose first element is an operator and the rest of the elements its arguments. Here's an example of a filter specification to choose only those rows whose field "000002" is less than 3.14:
[">", 3.14, ["field", "000002"]]
Filter Example
As you see, the list starts with the operator we want to use, ">", followed by its operands: the number 3.14, and the value of the field with identifier "000002", which is denoted by the operator "field". As another example, this filter:
["=", ["field", "000002"], ["field", "000003"], ["field", "000004"]]
Filter Example
selects rows for which the three fields with identifiers "000002", "000003" and "000004" have identical values. Note how you're not limited to two arguments. It's also worth noting that for a filter like that one to be accepted, all three fields must have the same optype (e.g. numeric), otherwise they cannot be compared.
The field operator also accepts as arguments the field's name (as a string) or the row column (as an integer). For instance, if field "000002" had column number 12, and field "000003" was named "Stock prize", our previous query could have been written:
["=", ["field", 12], ["field", "Stock prize"], ["field", "000004"]]
Filter Example
If the name is not unique, the first matching field found is picked, consistently over the whole filter expression. If you have duplicated field names, the best thing to do is to use either column numbers or field identifiers in your filters, to avoid ambiguities.
Besides a field's value, one can also ask whether it's missing or not. For instance, to include only those rows for which field "000002" contains a missing token, you would use:
["missing", "000002"]
Filter Example
["and", ["not", ["missing", 12]]
, ["not", ["missing", "Stock prize"]]]
Filter Example
["or", ["=", 3, ["field", "000001"]]
, [">", "1969-07-14T06:10", ["field", "000111"]]
, ["and", ["missing", 23]
, ["=", "Cat", ["field", "000002"]]
, ["<", 2, ["field", "000003"], 4]]]
Filter Example
In the examples above, you can also see how dates are allowed and can be compared as numerical values (provided the implied fields are of the correct optype).
Finally, it's also possible to use the arithmetic operators +, -, * and / with numeric fields and constants, as in the following example:
[">", ["/", ["+", ["-", ["field", "000000"]
, 4.4]
, ["field", "000003"]
, ["*", 2
, ["field", "Class"]
, ["field", "000004"]]]
, 3]
, 5.5]
Filter Example
These are all the accepted operators:
=, !=, >, >=, <, <=, and, or, not, field, missing, +, -, *, /.To be accepted by the API, the filter must evaluate to a boolean value and contain at least one operator.So, for instance, a constant or an expression evaluating to a number will be rejected.
Since writing and reading the above expressions in pure JSON might be a bit involved, you can also send your query to the server as a string representing a Lisp s-expression using the argument lisp_filter, e.g.
(> (/ (+ (- (field "000000") 4.4)
(field 23)
(* 2 (field "Class") (field "000004")))
3)
5.5)
Filter Example
Retrieving a Dataset
Each dataset has a unique identifier in the form "dataset/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the dataset. Notice that to download the dataset file in the CSV format, you will need to append "/download", and in the Tableau tde format, append "/download?format=tde" to resource id.
To retrieve a dataset with curl:
curl "https://bigml.io/dataset/52b9359a3c19205ff100002a?$BIGML_AUTH"
$ Retrieving a dataset from the command line
To download the dataset file in the CSV format with curl:
curl "https://bigml.io/dataset/52b9359a3c19205ff100002a/download?$BIGML_AUTH"
$ Downloading a dataset csv file from the command line
To download the dataset file in the Tableau tde format with curl:
curl "https://bigml.io/dataset/52b9359a3c19205ff100002a/download?format=tde;$BIGML_AUTH"
$ Downloading a dataset tde file from the command line
You can also use your browser to visualize the dataset using the full BigML.io URL or pasting the dataset/id into the BigML.com dashboard.
Dataset Properties
Once a dataset has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the dataset and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the dataset creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields in the dataset. |
correlations | Object |
A dictionary where each entry represents a field (column) in your data with the last calculated correlation/id for it.
Example:
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the dataset was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this dataset. |
description
updatable |
String | A text describing the dataset. It can contain restricted markdown to decorate the text. |
excluded_fields | Array | The list of fields's ids that were excluded to build the model. |
field_types | Object | A dictionary that informs about the number of fields of each type. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields and an entry for the total number of fields. In new datasets, it uses the key effective_fields to inform of the effective number of fields. That is the total number of fields including those created under the hood to support text fields. |
fields | Object | A dictionary with an entry per field (column) in your data. Each entry includes the column number, the name of the field, the type of the field, and the summary. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
input_fields | Array | The list of input fields' ids used to create the dataset. |
locale | String | The source's locale. |
name
filterable, sortable, updatable |
String | The name of the dataset as your provided or based on the name of the source by default.) |
number_of_anomalies
filterable, sortable |
Integer | The current number of anomalies that use this dataset. |
number_of_anomalyscores
filterable, sortable |
Integer | The current number of anomaly scores that use this dataset. |
number_of_associations
filterable, sortable |
Integer | The current number of associations that use this dataset. |
number_of_associationsets
filterable, sortable |
Integer | The current number of association sets that use this dataset. |
number_of_batchanomalyscores
filterable, sortable |
Integer | The current number of batch anomaly scores that use this dataset. |
number_of_batchcentroids
filterable, sortable |
Integer | The current number of batch centroids that use this dataset. |
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this dataset. |
number_of_batchtopicdistributions
filterable, sortable |
Integer | The current number of batch topic distributions that use this dataset. |
number_of_centroids
filterable, sortable |
Integer | The current number of centroids that use this dataset. |
number_of_clusters
filterable, sortable |
Integer | The current number of clusters that use this dataset. |
number_of_correlations
filterable, sortable |
Integer | The current number of correlations that use this dataset. |
number_of_ensembles
filterable, sortable |
Integer | The current number of ensembles that use this dataset. |
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this dataset. |
number_of_forecasts
filterable, sortable |
Integer | The current number of forecasts that use this dataset. |
number_of_logisticregressions
filterable, sortable |
Integer | The current number of logistic regressions that use this dataset. |
number_of_models
filterable, sortable |
Integer | The current number of models that use this dataset. |
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this dataset. |
number_of_statisticaltests
filterable, sortable |
Integer | The current number of statistical tests that use this dataset. |
number_of_timeseries
filterable, sortable |
Integer | The current number of time series that use this dataset. |
number_of_topicdistributions
filterable, sortable |
Integer | The current number of topic distributions that use this dataset. |
number_of_topicmodels
filterable, sortable |
Integer | The current number of topic models that use this dataset. |
objective_field
updatable |
Object | The default objective field. |
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to clone the dataset instead of the sampled instances. |
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your dataset. |
private
filterable, sortable, updatable |
Boolean | Whether the dataset is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
range | Array | The range of instances used to clone the dataset. |
refresh_field_types
filterable, sortable |
Boolean | Whether the field types of the dataset have been recomputed or not. |
refresh_objective
filterable, sortable |
Boolean | Whether the default objective field of the dataset has been recomputed or not. |
refresh_preferred
filterable, sortable |
Boolean | Whether the preferred flags of the dataset fields have been recomputed or not. |
replacement
filterable, sortable |
Boolean | Whether the instances sampled to clone the dataset were selected using replacement or not. |
resource | String | The dataset/id. |
rows
filterable, sortable |
Integer | The total number of rows in the dataset. |
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
shared
filterable, sortable, updatable |
Boolean | Whether the dataset is shared using a private link or not. |
shared_clonable
filterable, sortable, updatable |
Boolean | Whether the shared dataset can be cloned or not. |
shared_hash | String | The hash that gives access to this dataset if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this dataset. |
size
filterable, sortable |
Integer | The number of bytes of the source that were used to create this dataset. |
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
statisticaltest
filterable, sortable |
String | The last statisticaltest/id that was generated for this dataset. |
status | Object | A description of the status of the dataset. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the dataset was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
term_limit
filterable, sortable |
Integer | The maximum total number of terms used by all the text fields. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the dataset was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
Dataset Fields
The property fields is a dictionary keyed by each field's id in the source. Each field's id has as a value an object with the following properties:
Numeric Summary
Numeric summaries come with all the fields described below. If the number of unique values in the data is greater than 32, then 'bins' will be used for the summary. If not, 'counts' will be available.
Property | Type | Description |
---|---|---|
bins | Array | An array that represents an approximate histogram of the distribution. It consists of value pairs, where the first value is the mean of a histogram bin and the second value is the bin population. bins is only available when the number of distinct values is greater than 32. For more information, see our blog post or read this paper. |
counts | Array | An array of pairs where the first element of each pair is one of the unique values found in the field and the second element is the count. Only available when the number of distinct values is less than or equal to 32. |
kurtosis | Number | The sample kurtosis. A measure of 'peakiness' or heavy tails in the field's distribution. |
maximum | Number | The maximum value found in this field. |
mean | Number | The arithmetic mean of non-missing field values. |
median | Number | The approximate median of the non-missing values in this field. |
minimum | Number | The minimum value found in this field. |
missing_count | Integer | Number of instances missing this field. |
population | Integer | The number of instances containing data for this field. |
skewness | Number | The sample skewness. A measure of asymmetry in the field's distribution. |
standard_deviation | Number | The unbiased sample standard deviation. |
sum | String | Sum of all values for this field (for mean calculation). |
sum_squares | String | Sum of squared values (for variance calculation). |
variance | Number | The unbiased sample variance. |
Categorical Summary
Categorical summaries give you a count per each category and missing count in case any of the instances contain missing values.
Text Summary
Text summaries give statistics about the vocabulary of a text field, and the number of instances containing missing values.
Dataset Status
Before a dataset is successfully created, BigML.io makes sure that it has been uploaded in an understandable format, that the data that it contains is parseable, and that the types for each column in the data can be inferred successfully. The dataset goes through a number of states until all these analyses are completed. Through the status field in the dataset you can determine when the dataset has been fully processed and ready to be used to create a model. These are the fields that a dataset's status has:
Property | Type | Description |
---|---|---|
bytes | Integer | Number of bytes processed so far. |
code | Integer | A status code that reflects the status of the dataset creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the dataset. |
field_errors | Object |
Information about ill-formatted fields that includes the total format errors for the field and a sample of the ill-formatted tokens.
Example:
|
message | String | A human readable message explaining the status. |
row_format_errors | Array | Information about ill-formatted rows. It includes the total row-format errors and a sampling of the ill-formatted rows. |
serialized_rows | Integer | The number of rows serialized so far. |
Once a dataset has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"columns": 5,
"created": "2012-11-15T02:29:09.293000",
"credits": 0.00439453125,
"description": "",
"excluded_fields": [],
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": {
"bins": [
[
4.3,
1
],
[
4.425,
4
],
[
4.6,
4
],
[
4.7,
2
],
[
4.8,
5
],
[
4.9,
6
],
[
5,
10
],
[
5.1,
9
],
[
5.2,
4
],
[
5.3,
1
],
[
5.4,
6
],
[
5.5,
7
],
[
5.6,
6
],
[
5.7,
8
],
[
5.8,
7
],
[
5.9,
3
],
[
6,
6
],
[
6.1,
6
],
[
6.2,
4
],
[
6.3,
9
],
[
6.44167,
12
],
[
6.6,
2
],
[
6.7,
8
],
[
6.8,
3
],
[
6.92,
5
],
[
7.1,
1
],
[
7.2,
3
],
[
7.3,
1
],
[
7.4,
1
],
[
7.6,
1
],
[
7.7,
4
],
[
7.9,
1
]
],
"maximum": 7.9,
"mean": 5.84333,
"median": 5.77889,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"splits": [
4.51526,
4.67252,
4.81113,
4.89582,
4.96139,
5.01131,
5.05992,
5.11148,
5.18177,
5.35681,
5.44129,
5.5108,
5.58255,
5.65532,
5.71658,
5.77889,
5.85381,
5.97078,
6.05104,
6.13074,
6.23023,
6.29578,
6.35078,
6.41459,
6.49383,
6.63013,
6.70719,
6.79218,
6.92597,
7.20423,
7.64746
],
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569
}
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": {
"counts": [
[
2,
1
],
[
2.2,
3
],
[
2.3,
4
],
[
2.4,
3
],
[
2.5,
8
],
[
2.6,
5
],
[
2.7,
9
],
[
2.8,
14
],
[
2.9,
10
],
[
3,
26
],
[
3.1,
11
],
[
3.2,
13
],
[
3.3,
6
],
[
3.4,
12
],
[
3.5,
6
],
[
3.6,
4
],
[
3.7,
3
],
[
3.8,
6
],
[
3.9,
2
],
[
4,
1
],
[
4.1,
1
],
[
4.2,
1
],
[
4.4,
1
]
],
"maximum": 4.4,
"mean": 3.05733,
"median": 3.02044,
"minimum": 2,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.43587,
"sum": 458.6,
"sum_squares": 1430.4,
"variance": 0.18998
}
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true,
"summary": {
"bins": [
[
1,
1
],
[
1.1,
1
],
[
1.2,
2
],
[
1.3,
7
],
[
1.4,
13
],
[
1.5,
13
],
[
1.63636,
11
],
[
1.9,
2
],
[
3,
1
],
[
3.3,
2
],
[
3.5,
2
],
[
3.6,
1
],
[
3.75,
2
],
[
3.9,
3
],
[
4.0375,
8
],
[
4.23333,
6
],
[
4.46667,
12
],
[
4.6,
3
],
[
4.74444,
9
],
[
4.94444,
9
],
[
5.1,
8
],
[
5.25,
4
],
[
5.46,
5
],
[
5.6,
6
],
[
5.75,
6
],
[
5.95,
4
],
[
6.1,
3
],
[
6.3,
1
],
[
6.4,
1
],
[
6.6,
1
],
[
6.7,
2
],
[
6.9,
1
]
],
"maximum": 6.9,
"mean": 3.758,
"median": 4.34142,
"minimum": 1,
"missing_count": 0,
"population": 150,
"splits": [
1.25138,
1.32426,
1.37171,
1.40962,
1.44567,
1.48173,
1.51859,
1.56301,
1.6255,
1.74645,
3.23033,
3.675,
3.94203,
4.0469,
4.18243,
4.34142,
4.45309,
4.51823,
4.61771,
4.72566,
4.83445,
4.93363,
5.03807,
5.1064,
5.20938,
5.43979,
5.5744,
5.6646,
5.81496,
6.02913,
6.38125
],
"standard_deviation": 1.7653,
"sum": 563.7,
"sum_squares": 2582.71,
"variance": 3.11628
}
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true,
"summary": {
"counts": [
[
0.1,
5
],
[
0.2,
29
],
[
0.3,
7
],
[
0.4,
7
],
[
0.5,
1
],
[
0.6,
1
],
[
1,
7
],
[
1.1,
3
],
[
1.2,
5
],
[
1.3,
13
],
[
1.4,
8
],
[
1.5,
12
],
[
1.6,
4
],
[
1.7,
2
],
[
1.8,
12
],
[
1.9,
5
],
[
2,
6
],
[
2.1,
6
],
[
2.2,
3
],
[
2.3,
8
],
[
2.4,
3
],
[
2.5,
3
]
],
"maximum": 2.5,
"mean": 1.19933,
"median": 1.32848,
"minimum": 0.1,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.76224,
"sum": 179.9,
"sum_squares": 302.33,
"variance": 0.58101
}
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true,
"summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-setosa",
50
],
[
"Iris-virginica",
50
]
],
"missing_count": 0
}
}
},
"fields_meta": {
"count": 5,
"limit": 200,
"offset": 0,
"total": 5
},
"input_fields": [
"000000",
"000001",
"000002",
"000003",
"000004"
],
"locale": "en_US",
"name": "iris' dataset",
"number_of_evaluations": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"price": 0.0,
"private": true,
"project": null,
"resource": "dataset/52b9359a3c19205ff100002a",
"rows": 150,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"status": {
"bytes": 4608,
"code": 5,
"elapsed": 163,
"field_errors": [],
"message": "The dataset has been created",
"row_format_errors": [],
"serialized_rows": 150
},
"tags": [],
"updated": "2012-11-15T02:29:10.537000",
"views": 0
}
< Example dataset JSON response
Filtering and Paginating Fields from a Dataset
A dataset might be composed of hundreds or even thousands of fields. Thus when retrieving a dataset, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Dataset
To update a dataset, you need to PUT an object containing the fields that you want to update to the dataset' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated dataset.
For example, to update a dataset with a new name you can use curl like this:
curl "https://bigml.io/dataset/52b9359a3c19205ff100002a?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a dataset's name
Deleting a Dataset
To delete a dataset, you need to issue a HTTP DELETE request to the dataset/id to be deleted.
Using curl you can do something like this to delete a dataset:
curl -X DELETE "https://bigml.io/dataset/52b9359a3c19205ff100002a?$BIGML_AUTH"
$ Deleting a dataset from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a dataset, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a dataset a second time, or a dataset that does not exist, you will receive a "404 not found" response.
However, if you try to delete a dataset that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Datasets
To list all the datasets, you can use the dataset base URL. By default, only the 20 most recent datasets will be returned. You can see below how to change this number using the limit parameter.
You can get your list of datasets directly in your browser using your own username and API key with the following links.
https://bigml.io/dataset?$BIGML_AUTH
> Listing datasets from a browser
Multi-Datasets
BigML.io now allows you to create a new dataset merging multiple datasets. This functionaliy can be very useful when you use multiple sources of data and in online scenarios as well. Imagine, for example, that you collect data in a hourly basis and want to create a dataset aggregrating data collected over the whole day. So you only need to send the new generated data each hour to BigML, create a source and a dataset for each one and then merge all the individual datasets into one at the end of the day.
We usually call datasets created in this way multi-datasets. BigML.io allows you to aggregrate up to 32 datasets in the same API request. You can merge multi-datasets too so basically you can grow a dataset as much as you want.
To create a multi dataset, you can specify a list of dataset identifiers as input using the argument origin_datasets. The example below will construct a new dataset that is the concatenation of three other datasets.
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc7fd03c1920e4a3000016",
"dataset/52bc80233c1920e4a300001a"]}'
> Creating a multi dataset
By convention, the first dataset defines the final dataset fields. However, there can be cases where each dataset might come from a different source and therefore have different field ids. In these cases, you might need to use a fields_maps argument to match each field in a dataset to the fields of the first dataset.
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc7fd03c1920e4a3000016",
"dataset/52bc80233c1920e4a300001a",
"dataset/52bc851b3c1920e4a3000022"],
"fields_maps": {
"dataset/52bc7fd03c1920e4a3000016": {
"000000":"000023",
"000001":"000024",
"000002":"00003a"},
"dataset/52bc80233c1920e4a300001a": {
"000000":"000023",
"000001":"000004",
"000002":"00000f"}}}'
> Creating a multi dataset mapping fields
For instance, in the request above, we use four datasets as input. The first one would define the final dataset fields. For instance, let's say that the dataset "dataset/52bc7fc83c1920e4a3000012" in this example has three fields with identifiers "000001", "000002" and "000003". Those will be the default resulting fields, together with their datatypes and so on. Then we need to specify, for each of the remaining datasets in the list, a mapping from the "standard" fields to those in the corresponding dataset. In our example, we're saying that the fields of the second dataset to be used during the concatenation are "000023", "000024" and "00003a", which correspond to the final fields having them as keys. In the case of the third dataset, the fields used will be "000023", "000004" and "00000f". For the last one, since there's no entry in fields_maps, we'll try to use the same identifiers as those of the first dataset.
The optypes of the paired fields should match, and for the case of categorical fields, be a proper subset. If a final field has optype text, however, all values are converted to strings.
BigML.io also allows you to sample each dataset individually before merging it. You can specify the sample options for each dataset using the arguments sample_rates, replacements, seeds, and out_of_bags. All are dictionaries that must be keyed using the dataset/id of the dataset you want to specify parameters for. The next request will create a multi-dataset sampling the two input datasets differently.
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc851b3c1920e4a3000022"],
"sample_rates": {
"dataset/52bc7fc83c1920e4a3000012": 0.5,
"dataset/52bc851b3c1920e4a3000022": 0.8},
"replacements": {
"dataset/52bc7fc83c1920e4a3000012": false,
"dataset/52bc851b3c1920e4a3000022": true}
}'
> Creating a multi dataset
Argument | Type | Description |
---|---|---|
fields_maps
optional |
Object |
A dictionary keyed by dataset/id and object values. Each entry maps fields in the first dataset to fieds in the dataset referenced by the key.
Example:
|
out_of_bags
optional |
Object |
A dictionary keyed by dataset/id and boolean values. Setting this parameter to true for a dataset will return a dataset containing sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example:
|
replacements
optional |
Object |
A dictionary keyed by dataset/id and boolean values indicating whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example:
|
sample_rates
optional |
Object |
A dictionary keyed by dataset/id and float values. Each value is a number between 0 and 1 specifying the sample rate for the dataset. See the Section on Sampling for more details.
Example:
|
seeds
optional |
Object |
A dictionary keyed by dataset/idand string values indicating the seed to be used for each dataset to generate deterministic samples. See the Section on Sampling for more details.
Example:
|
- Sample each individual dataset according to the specifications provided in the arguments sample_rates, replacements, seeds, and out_of_bags.
- Merge all the datasets together using the fields_maps argument to match fields in case they come from different sources (i.e., have different field ids).
- Sample the merged dataset like in the case of a regular datasaset sampling using the the arguments sample_rate, replacement, seed,out_of_bag.
- Filter the sampled dataset using input_fields, excluded_fields, and either a json_filter or lisp_filter.
- Extend the dataset with new fields according to the specifications provided in the new_fields argument.
- Filter the output of the new fields using either an output_json_filter or output_lisp_filter.
Resources Accepting Multi-Datasets Input
You can also create a model using multiple datasets as input at once. That is, without merging all the datasets together into a new dataset first. The same applies to correlations, statistical tests, ensembles, clusters, anomaly detectors, and evaluations. All the multi-dataset arguments above can be used. You just need to use the datasets argument instead of the regular dataset.See examples below to create a multi-dataset model, a multi-dataset ensemble, and a multi-dataset evaluation.
curl "https://bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc851b3c1920e4a3000022"],
"sample_rates": {
"dataset/52bc7fc83c1920e4a3000012": 0.5,
"dataset/52bc851b3c1920e4a3000022": 0.8},
"replacements": {
"dataset/52bc7fc83c1920e4a3000012": false,
"dataset/52bc851b3c1920e4a3000022": true}
}'
> Creating a multi-dataset model
curl "https://bigml.io/ensemble?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc851b3c1920e4a3000022"],
"sample_rates": {
"dataset/52bc7fc83c1920e4a3000012": 0.5,
"dataset/52bc851b3c1920e4a3000022": 0.8},
"replacements": {
"dataset/52bc7fc83c1920e4a3000012": false,
"dataset/52bc851b3c1920e4a3000022": true}
}'
> Creating a multi-dataset ensemble
curl "https://bigml.io/evaluation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"model": "model/52bcb43e3c1920e4a3000026",
"datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc851b3c1920e4a3000022"],
"out_of_bags": {
"dataset/52bc7fc83c1920e4a3000012": true,
"dataset/52bc851b3c1920e4a3000022": true},
"sample_rates": {
"dataset/52bc7fc83c1920e4a3000012": 0.5,
"dataset/52bc851b3c1920e4a3000022": 0.8},
"replacements": {
"dataset/52bc7fc83c1920e4a3000012": false,
"dataset/52bc851b3c1920e4a3000022": true}
}'
> Creating a multi-dataset evaluation
Transformations
Once you have created a dataset, BigML.io allows you to derive new datasets from it, sampling, filtering, adding new fields, or concatenating it to other datasets. We apply the term dataset transformations to the set of operations to create new modified versions of your original dataset or just transformations to abbreviate.
We use the term:- Cloning for the general operation of generating a new dataset.
- Sampling when the original dataset is sampled.
- Filtering when the original dataset is filtered.
- Extending when new fields are generated.
- Merging when a multi-dataset is created.
Keep in mind that you can sample, filter and extend a dataset all at once in only one API request.
So let's start with the most basic transformation: cloning a dataset.
Cloning a Dataset
To clone a dataset you just need to use the origin_dataset argument to send the dataset/id of the dataset that you want to clone. For example:
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b8fdff3c19205ff100001e"}'
> Cloning a dataset
Argument | Type | Description |
---|---|---|
category
optional |
Integer |
The category that best describes the dataset. See the category codes for the complete list of categories.
Example:
|
fields
optional |
Object |
Updates the names, labels, and descriptions of the fields in the new dataset. An entry keyed with the field id of the original dataset for each field that will be updated.
Example:
|
name
optional |
String |
The name you want to give to the new dataset.
Example: "my new dataset" |
origin_dataset | String |
The dataset/id of the dataset to be cloned.
Example:
|
Sampling a Dataset
It is also possible to provide a sampling specification to be used when cloning the dataset. The sample will be applied to the origin_dataset rows. For example:
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b8fdff3c19205ff100001e",
"sample_rate": 0.8,
"replacement": true,
"seed": "myseed"}'
> Sampling a dataset
Argument | Type | Description |
---|---|---|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a dataset containing a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example:
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example:
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example:
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
Filtering a Dataset
A dataset can be filtered in different ways:- Excluding a few fields using the excluded_fields argument.
- Selecting only a few fields using the input_fields argument.
- Filtering rows using a json_filter or lisp_filter similarly to the way you can filter a source.
- Specifying a range of rows.
As illustrated in the following example, it's possible to provide a list of input fields, selecting the fields from the filtered input dataset that will be created. Filtering happens before field picking and, therefore, the row filter can use fields that won't end up in the cloned dataset.
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b8fdff3c19205ff100001e",
"input_fields": ["000000", "000001", "000003"],
"json_filter": [">", 3.14, ["field", "000002"]],
"range": [50, 100]}'
> Filtering a dataset
Argument | Type | Description |
---|---|---|
excluded_fields
optional |
Array |
Specifies the fields that won't be included in the new dataset.
Example:
|
input_fields
optional |
Array |
Specifies the fields to be included in the dataset.
Example:
|
json_filter
optional |
Array |
A JSON list representing a filter over the rows in the origin dataset. The first element is an operator and the rest of the elements its arguments. See the Section on filtering sources for more details.
Example:
|
lisp_filter
optional |
String |
A string representing a Lisp s-expression to filter rows from the origin dataset.
Example:
|
range
optional |
Array |
The range of successive instances to create the new dataset.
Example:
|
Extending a Dataset
You can clone a dataset and extend it with brand new fields using the new_fields argument. Each new field is created using a Flatline expression and optionally a name, label, and description.
A Flatline expression is a lisp-like expresion that allows you to make references and process columns and rows of the origin dataset. See the full Flatline reference here. Let's see a first example that clones a dataset and adds a new field named "Celsius" to it using an expression that converts the values from the "Fahrenheit" field to Celsius.
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [{
"field": "(/ (* 5 (- (f Fahrenheit) 32)) 9)",
"name": "Celsius"}]}'
> Extending a dataset
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"all_fields": false,
"new_fields": [{
"fields": "(fields 0 1)",
"names": ["Day", "Temperature"]}]}'
> Extending a dataset
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [
{"field": "(avg (window Fahrenheit -6 0))",
"name": "Weekly AVG",
"label":"Weekly Average",
"description": "Temperature average over the last seven days"},
{"fields": "(list (f 0 -1) (f 0 1))",
"names": ["Yesterday", "Tomorrow"],
"labels": ["Yesterday prediction", "Tomorrow prediction"],
"descriptions": ["Prediction for the previous day", "Prediction for the next day"]}]}'
> Extending a dataset
Filtering the New Fields Output
The generation of new fields works by traversing the input dataset row by row and applying the Flatline expression of each new field to each row in turn. The list of values generated from each input row that way constitutes an output row of the generated dataset.
It is possible to limit the number of input rows that the generator sees by means of filters and/or sample specifications, for example:
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb2c263c192015e3000004",
"lisp_filter": "(not (= 0 (f 000001)))",
"new_fields": [
{"field": "(/ 1 (f 000001))",
"name": "Inverse value"}]}'
> Extending a dataset
And, as an additional convenience, it is also possible to specify either a output_lisp_filter or a output_json_filter, that is, a Flatline row filter that will act on the generated rows, instead of on the input data:
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb2c263c192015e3000004",
"lisp_filter": "(not (= 0 (f 000001)))",
"new_fields": [
{"field": "(/ 1 (f 000001))",
"name": "Inverse value"}],
"output_lisp_filter": "(< 0.25 (f \"Inverse value\"))"}'
> Extending a dataset
You can also skip any number of rows in the input, starting the generation at an offset given by row_offset, and traverse the input rows by any step specified by row_step. For instance, the following request will generate a dataset whose rows are created by putting together every three consecutive values of the input field "Price":
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b7f0ba3c19208c13000131",
"row_offset": 2,
"row_step": 3,
"new_fields": [
{"fields": "(window \"Price\" -2 0)",
"names": ["Price-2", "Price-1", "Price"]}]}'
> Extending a dataset
With the specification above, the new field will start with the third row in the input dataset, generate an output row (which uses values from the current input row as well as from the two previous ones), skip to the 6th input row, generate a new output, and so on and so forth.
Next, we'll list all the arguments that can be used to extend a dataset.
Argument | Type | Description |
---|---|---|
all_but
optional |
Array |
Specifies the fields to be included in the dataset.
Example:
|
all_fields
optional |
Boolean |
Whether all fields should be included in the new dataset or not.
Example:
|
new_fields
optional |
Array |
Specifies the new fields to be included in the dataset. See the table below for more details.
Example:
|
output_json_filter
optional |
Array |
A JSON list representing a filter over the rows of the dataset once the new fields have been generated. The first element is an operator and the rest of the elements its arguments. See the Section on filtering rows for more details.
Example:
|
output_lisp_filter
optional |
String |
A string representing a Lisp s-expression to filter rows after the new fields have been generated.
Example:
|
row_offset
optional |
Array |
The initial number of rows to skip from from the input dataset before start processing rows.
Example:
|
row_step
optional |
Array |
The number of rows to skip in every step.
Example:
|
Discretization of a Continuous Field
Here's an example discretizing the "temp" field into three homogeneous levels:
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [{
"field": "(cond (< (f \"temp\") 0) \"SOLID\"
(< (f \"temp\") 100) \"LIQUID\"
\"GAS\")",
"name":"Discrete Temp"}]}'
Descritizing a field
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b9359a3c19205ff100002a",
"new_fields": [{
"field": "(cond (> (percentile \"age\" 0.1) (f \"age\")) \"baby\"
(> (percentile \"age\" 0.2) (f \"age\")) \"child\"
(> (percentile \"age\" 0.6) (f \"age\")) \"adult\"
(> (percentile \"age\" 0.9) (f \"age\")) \"old\"
\"elder\")",
"name":"Discrete Age"}]}'
Descritizing a field
Outlier Elimination
You can use, for instance, the following predicate in a filter to remove outliers:
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b9359a3c19205ff100002a",
"lisp_filter": "(< (percentile \"age\" 0.5) (f \"age\") (percentile \"age\" 0.95))"}'
Eliminating outliers
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b9359a3c19205ff100002a",
"lisp_filter": "(within-percentiles? "age" 0.5 0.95)"}'
Eliminating outliers
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [{
"field": "(if (missing? \"temp\") (mean \"temp\") (field \"temp\"))",
"name": "no missing temp"}]}'
Changing missing values
Lisp and JSON Syntaxes
Flatline also has a json-like flavor with exactly the same semantics that the lisp-like version. Basically, a Flatline expresion can easily be translated to its json-like variant and vice versa by just changing parentheses to brackets, symbols to quoted strings, and adding commas to separate each sub-expression. For example, the following two expressions are the same for BigML.io.
"(/ (* 5 (- (f Fahrenheit) 32)) 9)"
Lisp-like expression
["/", ["*", 5, ["-", ["f", "Fahrenheit"], 32]], 9]
Json-like expression
Final Remarks
A few important details that you should keep in mind:- Cloning a dataset implies creating also a copy of its serialized form, so you get an asyncronous resource with a status that evolves from the Summarized (4) to the Finished (5) state.
- If you specify both sampling and filtering arguments, the former are applied first.
- As with filters applied to datasources, dataset filters can use the full Flatline language to specify the boolean expression to use when sifting the input.
- Flatline performs type inference, and will in general figure out the proper optype for the generated fields, which are subsequently summarized by the dataset creation process, reaching then their final datatype (just as with a regular dataset created from a datasource). In case you need to fine-tune Flatline's inferences, you can provide an optype (or optypes) key and value in the corresponding output field entry (together with generator and names), but in general this shouldn't be needed.
- Please check the Flatline reference manual for a full description of the language for field generation and the many pre-built functions it provides.
Samples
Last Updated: Thursday, 2018-02-22 12:54
A sample provides fast-access to the raw data of a dataset on an on-demand basis.
When a new sample is requested, a copy of the dataset is stored in a special format in an in-memory cache. Multiple and different samples of the data can then be extracted using HTTPS parameterized requests by sampling sizes and simple query string filters.

Samples are ephemeral. That is to say, a sample will be available as long as GETs are requested within periods smaller than a pre-established TTL (Time to Live). The expiration timer of a sample is reset every time a new GET is received.
If requested, a sample can also perform linear regression and compute Pearson's and Spearman's correlations for either one numeric field against all other numeric fields or between two specific numeric fields.
BigML.io allows you to create, retrieve, update, delete your sample. You can also list all of your samples.
Jump to:
- Sample Base URL
- Creating a Sample
- Sample Arguments
- Retrieving a Sample
- Sample Properties
- Filtering and Paginating Fields from a Sample
- Filtering Rows from a Sample
- Updating a Sample
- Deleting a Sample
- Listing Samples
Sample Base URL
You can use the following base URL to create, retrieve, update, and delete samples. https://bigml.io/sample
Sample base URL
All requests to manage your samples must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Sample
To create a new sample, you need to POST to the sample base URL an object containing at least the dataset/id that you want to use to create the sample. The content-type must always be "application/json".
You can easily create a new sample using curl as follows. All you need is a valid dataset/id and your authentication variable set up as shown above.
curl "https://bigml.io/sample?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/5484b109f0a5ea59a6000018"}'
> Creating a sample
BigML.io will return the newly created sample if the request succeeded.
{
"category":0,
"code":201,
"created":"2015-02-03T08:53:08.782775",
"credits":0,
"dataset":"dataset/5484b109f0a5ea59a6000018",
"description":"",
"fields_meta":{
"count":0,
"limit":1000,
"offset":0,
"total":0
},
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007",
"000008",
"000009",
"00000a",
"00000b",
"00000c",
"00000d"
],
"max_columns":14,
"max_rows":32561,
"name":"census' dataset sample",
"private":true,
"project":null,
"resource":"sample/54d9c6f4f0a5ea0b1600003a",
"seed":"c30d76cd14e24ef7ab7d28f98b3c8488",
"size":3292068,
"status":{
"code":1,
"message":"The sample is being processed and will be created soon"
},
"subscription":false,
"tags":[],
"updated":"2015-02-03T08:53:08.782792"
}
< Example sample JSON response
Sample Arguments
See below the full list of arguments that you can POST to create a sample.
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the sample. See the category codes for the complete list of categories.
Example: 1 |
dataset | String |
A valid dataset/id.
Example: dataset/4f665b8103ce8920bb000006 |
description
optional |
String |
A description of the sample up to 8192 characters long.
Example: "This is a description of my new sample" |
name
optional |
String, default is dataset's name sample |
The name you want to give to the new sample.
Example: "my new sample" |
project
optional |
String |
The project/id you want the sample to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
tags
optional |
Array of Strings |
A list of strings that help classify and index your sample.
Example: ["best customers", "2018"] |
You can also use curl to customize a new sample with a name. For example, to create a new sample named "my sample" with some tags:
curl "https://bigml.io/sample?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/5484b109f0a5ea59a6000018",
"name": "my sample",
"tags": ["potential customers", "2015"]}'
> Creating a customized sample
If you do not specify a name, BigML.io will assign to the new sample the dataset's name.
Retrieving a Sample
Each sample has a unique identifier in the form "sample/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the sample.
To retrieve a sample with curl:
curl "https://bigml.io/sample/54d9c6f4f0a5ea0b1600003a?$BIGML_AUTH"
$ Retrieving a sample from the command line
You can also use your browser to visualize the sample using the full BigML.io URL or pasting the sample/id into the BigML.com dashboard.
Sample Properties
Once a sample has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the sample and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the sample creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields returned in the sample's fields. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the sample was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this sample. |
dataset
filterable, sortable |
String | The dataset/id that was used to create the sample. |
description
updatable |
String | A text describing the sample. It can contain restricted markdown to decorate the text. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
input_fields | Array | The list of input fields' ids available to filter the sample |
locale | String | The source's locale. |
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the sample. |
max_rows
filterable, sortable |
Integer | The max number of rows in the sample. |
name
filterable, sortable, updatable |
String | The name of the sample as provided or based on the name of the dataset by default. |
private
filterable, sortable |
Boolean | Whether the sample is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
resource | String | The sample/id. |
rows
filterable, sortable |
Integer | The total number of rows in the sample, |
sample | Object | All the information that you need to analyze the sample on your own. It includes the fields' dictionary describing the fields and their summaries and the rows. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this sample. |
status | Object | A description of the status of the sample. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the sample was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the sample was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
A Sample Object has the following properties:
Property | Type | Description |
---|---|---|
fields
updatable |
Array | A list with an element per field in the dataset used to build the sample. Fields are paginated according to the field_meta attribute. Each entry includes the column number in the original dataset, the name of the field, the type of the field, and the summary. See this Section for more details. |
rows | Array of Arrays | A list of lists representing the rows of the sample. Values in each list are ordered according to the fields list. |
Sample Status
Through the status field in the sample you can determine when the sample has been fully processed and ready to be used. These are the fields that a sample's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the sample creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the sample. |
message | String | A human readable message explaining the status. |
Once a sample has been successfully created, it will look like:
{
"category":0,
"code":200,
"columns":2,
"created":"2015-02-03T18:21:07.001000",
"credits":0,
"dataset":"dataset/5484b109f0a5ea59a6000018",
"description":"",
"fields_meta":{
"count":2,
"limit":2,
"offset":0,
"query_total":14,
"total":14
},
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007",
"000008",
"000009",
"00000a",
"00000b",
"00000c",
"00000d"
],
"locale":"en-US",
"max_columns":14,
"max_rows":32561,
"name":"my dataset",
"private":true,
"project":null,
"resource":"sample/54d9c6f4f0a5ea0b1600003a",
"rows":2,
"sample":{
"fields":[
{
"column_number":0,
"datatype":"int8",
"id":"000000",
"input_column":0,
"name":"age",
"optype":"numeric",
"order":0,
"preferred":true,
"summary":{
"bins":[
[
18.75643,
2410
],
[
21.51515,
1485
],
[
23.47642,
1675
],
[
25.48278,
1626
],
[
27.5094,
1702
],
[
29.51434,
1674
],
[
31.48252,
1716
],
[
33.50312,
1761
],
[
35.5062,
1774
],
[
37.4908,
1685
],
[
39.49317,
1610
],
[
41.49118,
1588
],
[
43.48461,
1494
],
[
46.38942,
2722
],
[
50.4325,
2252
],
[
53.47213,
879
],
[
55.46624,
785
],
[
57.50552,
724
],
[
59.46777,
667
],
[
61.46237,
558
],
[
63.47489,
438
],
[
65.45732,
328
],
[
67.4428,
271
],
[
69.45178,
197
],
[
71.48201,
139
],
[
73.44348,
115
],
[
75.50549,
91
],
[
77.44231,
52
],
[
80.28947,
76
],
[
83.95,
20
],
[
87.75,
4
],
[
90,
43
]
],
"maximum":90,
"mean":38.58165,
"median":37.03324,
"minimum":17,
"missing_count":0,
"population":32561,
"splits":[
18.58199,
20.00208,
21.38779,
22.6937,
23.89609,
25.137,
26.40151,
27.62339,
28.8206,
30.03925,
31.20051,
32.40167,
33.57212,
34.72468,
35.87617,
37.03324,
38.24651,
39.49294,
40.76573,
42.0444,
43.3639,
44.75256,
46.13703,
47.60107,
49.39145,
51.09725,
53.14627,
55.56526,
58.35547,
61.50785,
66.43583
],
"standard_deviation":13.64043,
"sum":1256257,
"sum_squares":54526623,
"variance":186.0614
}
},
{
"column_number":1,
"datatype":"string",
"id":"000001",
"input_column":1,
"name":"workclass",
"optype":"categorical",
"order":1,
"preferred":true,
"summary":{
"categories":[
[
"Private",
22696
],
[
"Self-emp-not-inc",
2541
],
[
"Local-gov",
2093
],
[
"State-gov",
1298
],
[
"Self-emp-inc",
1116
],
[
"Federal-gov",
960
],
[
"Without-pay",
14
],
[
"Never-worked",
7
]
],
"missing_count":1836
},
"term_analysis":{
"enabled":true
}
}
],
"rows":[
[
48,
"Private",
"HS-grad",
9,
"Divorced",
"Transport-moving",
"Not-in-family",
"White",
"Male",
0,
0,
65,
"United-States",
"<=50K"
],
[
71,
"Private",
"9th",
5,
"Married-civ-spouse",
"Other-service",
"Husband",
"White",
"Male",
0,
0,
40,
"United-States",
"<=50K"
]
]
},
"seed":"0493a6f8ca7aeb2aaccca22560e4b8cb",
"size":3292068,
"status":{
"code":5,
"elapsed":1,
"message":"The sample has been created",
"progress":1
},
"subscription":false,
"tags":[
"potential customers",
"2015"
],
"updated":"2015-02-03T18:21:14.537000"
}
< Example sample JSON response
Filtering and Paginating Fields from a Sample
A sample might be composed of hundreds or even thousands of fields. Thus when retrieving a sample, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Filtering Rows from a Sample
A sample might be composed of thousands or even millions of rows. Thus when retrieving a sample, it's possible to specify that only a subset of rows be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored). BigML will never return more than 1000 rows in the same response. However, you can send additional request to get different random samples.
Parameter | Type | Description |
---|---|---|
!field=
optional |
Blank |
With field the identifier of a field, select only those rows where field is not missing (i.e., it has a definite value).
Example:
|
!field=from,to
optional |
List |
With field the identifier of a numeric field, returns the values not in the specified interval. As with inclusion, it's possible to include or exclude the boundaries of the specified interval using square or round brackets
Example:
|
!field=value
optional |
List |
With field the identifier of a numeric field, returns rows for which the field doesn't equal that value.
Example:
|
!field=value1&!field=value2&...
optional |
String |
With field the identifier of a categorical field, select only those rows with the value of that field not one of the provided categories (when the parameter is repeated).
Example:
|
field=
optional |
Blank |
With field the identifier of a field, select only those rows where field is missing.
Example:
|
field=from,to
optional |
List |
With field the identifier of a numeric field and from, to optional numbers, specifies a filter for the numeric values of that field in the range [from, to]. One of the limits can be omitted.
Example:
|
field=value
optional |
List |
With field the identifier of a numeric field, returns rows for which the field equals that value.
Example:
|
field=value1&field=value2&...
optional |
String |
With field the identifier of a categorical field, select only those rows with the value of that field one of the provided categories (when the parameter is repeated).
Example:
|
index
optional |
Boolean |
When set to true, every returned row will have a first extra value which is the absolute row number, i.e., a unique row identifier. This can be useful, for instance, when you're performing various GET requests and want to compute the union of the returned regions.
Example: index=true |
mode
optional |
String |
One amongst deterministic, random, or linear. The way we sample the resulting rows, if needed; random means a random sample, deterministic is also random but using a fixed seed so that it's repeatable, and linear means that BigML just returns the first size rows after filtering; defaults to "deterministic".
Example: mode=random |
occurrence
optional |
Boolean |
When set to true, rows have prepended a value which denotes the number of times the row was present in the sample. You'll want this only when unique is set to true, otherwise all those extra values will be equal to 1. When index is also set to true (see above), the multiplicity column is added after the row index.
Example: occurrence=true |
precision
optional |
Integer |
The number of significant decimal numbers to keep in the returned values, for fields of type float or double. For instance, if you set precision=0, all returned numeric values will be truncated to their integral part.
Example: precision=2 |
row_fields
optional |
List |
You can provide a list of identifiers to be present in the samples rows, specifying which ones you actually want to see and in which order.
Example: row_fields=000000,000002 |
row_offset
optional |
Integer |
Skip the given number of rows. Useful when paginating over the sample in linear mode.
Example: row_offset=300 |
row_order_by
optional |
String |
A field that causes the returned columns to be sorted by the value of the given field, in ascending order or, when the - prefix is used, in descending order.
Example: row_order_by=-000000 |
rows
optional |
Integer |
The total number of rows to be returned; if less than the resulting from the rest of the filter parameters, the latter will be sampled according to mode.
Example: rows=300 |
seed
optional |
String |
When mode is random, you can specify your own seed in this parameter; otherwise, we choose it at random, and return the value we've used in the body of the response: that way you can make a random sampling deterministic if you happen to like a particular result.
Example: seed=mysample |
stat_field
optional |
String |
A field_id that corresponds to the identifier of a numeric field will cause the answer to include the Pearson's and Spearman's correlations, and linear regression terms of this field with all other numeric fields in the sample. Those values will be returned in maps keyed by "other" field id and named spearman_correlations, pearson_correlations, slopes, and intercepts.
Example: stat_field=000000 |
stat_fields
optional |
String |
Two field_ids that correspond to the identifiers of numeric fields will cause the answer to include the Pearson's and Spearman's correlations, and linear regression terms between the two fields. Those values will be returned in maps keyed named spearman_correlation, pearson_correlation, slope, and intercept.
Example: stat_fields=000000,000003 |
unique
optional |
Boolean |
When set to true, repeated rows will be removed from the sample.
Example: unique=true |
Updating a Sample
To update a sample, you need to PUT an object containing the fields that you want to update to the sample' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated sample.
For example, to update a sample with a new name you can use curl like this:
curl "https://bigml.io/sample/54d9c6f4f0a5ea0b1600003a?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a sample's name
Deleting a Sample
To delete a sample, you need to issue a HTTP DELETE request to the sample/id to be deleted.
Using curl you can do something like this to delete a sample:
curl -X DELETE "https://bigml.io/sample/54d9c6f4f0a5ea0b1600003a?$BIGML_AUTH"
$ Deleting a sample from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a sample, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a sample a second time, or a sample that does not exist, you will receive a "404 not found" response.
However, if you try to delete a sample that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Samples
To list all the samples, you can use the sample base URL. By default, only the 20 most recent samples will be returned. You can see below how to change this number using the limit parameter.
You can get your list of samples directly in your browser using your own username and API key with the following links.
https://bigml.io/sample?$BIGML_AUTH
> Listing samples from a browser
Correlations
Last Updated: Thursday, 2018-02-22 12:54
A correlation resource allows you to compute advanced statistics for the fields in your dataset by applying various exploratory data analysis techniques to compare the distributions of the fields in your dataset against an objective_field.
BigML.io allows you to create, retrieve, update, delete your correlation. You can also list all of your correlations.
Jump to:
- Correlation Base URL
- Creating a Correlation
- Correlation Arguments
- Retrieving a Correlation
- Correlation Properties
- Filtering and Paginating Fields from a Correlation
- Updating a Correlation
- Deleting a Correlation
- Listing Correlations
Correlation Base URL
You can use the following base URL to create, retrieve, update, and delete correlations. https://bigml.io/correlation
Correlation base URL
All requests to manage your correlations must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Correlation
To create a new correlation, you need to POST to the correlation base URL an object containing at least the dataset/id that you want to use to create the correlation. The content-type must always be "application/json".
You can easily create a new correlation using curl as follows. All you need is a valid dataset/id and your authentication variable set up as shown above.
curl "https://bigml.io/correlation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c"}'
> Creating a correlation
BigML.io will return the newly created correlation if the request succeeded.
{
"category": 0,
"clones": 0,
"code": 201,
"columns": 0,
"correlations": null,
"created": "2015-06-23T21:45:24.002925",
"credits": 15.161365509033203,
"dataset": "dataset/55806fc2545e5f09b400002b",
"dataset_field_types": {
"categorical": 9,
"datetime": 0,
"numeric": 6,
"preferred": 14,
"text": 0,
"total": 15
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [ ],
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"input_fields": [ ],
"locale": "en-US",
"max_columns": 15,
"max_rows": 32561,
"name": "adult's dataset correlation",
"objective_field": "000000",
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
32561
],
"replacement": false,
"resource": "correlation/5589d374545e5f37fa000000",
"rows": 32561,
"sample_rate": 1,
"shared": false,
"size": 3974461,
"source": "source/5578d034545e5f6a17000006",
"source_status": true,
"status": {
"code": 1,
"message": "The correlation is being processed and will be created soon"
},
"subscription": false,
"tags": [ ],
"updated": "2015-06-23T21:45:24.003040",
"white_box": false
}
< Example correlation JSON response
Correlation Arguments
In addition to the dataset, you can also POST the following arguments.
Argument | Type | Description |
---|---|---|
categories
optional |
Object, default is {}, an empty dictionary. That is no categories are specified. |
A dictionary between input field id and an array of categories to limit the analysis to. Each array must contain 2 or more unique and valid categories in the string format. If omitted, each categorical field is limited to its 100 most frequent categorical values. This field has no impact if the data type of input fields are non-categorical.
Example:
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the correlation. See the category codes for the complete list of categories.
Example: 1 |
dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
description
optional |
String |
A description of the correlation up to 8192 characters long.
Example: "This is a description of my new correlation" |
discretization | Object | Global numeric field transformation parameters. See the discretization table below. |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the correlation.
Example:
|
field_discretizations | Object | Per-field numeric field transformation parameters, taking precedence over discretization. See the field_discretizations table below. |
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the correlation with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the correlation.
Example:
|
name
optional |
String, default is dataset's name |
The name you want to give to the new correlation.
Example: "my new correlation" |
objective_field
optional |
String, default is dataset's pre-defined objective field |
The id of the field to be used as the objective for correlation tests.
Example: "000001" |
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
project
optional |
String |
The project/id you want the correlation to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the correlation.
Example: [1, 150] |
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
significance_levels
optional |
Array, default is [0.01, 0.05, 0.1] |
An array of significance levels between 0 and 1 to test against p_values.
Example: [0.01, 0.025, 0.05, 0075, 0.1] |
tags
optional |
Array of Strings |
A list of strings that help classify and index your correlation.
Example: ["best customers", "2018"] |
ç Discretization is used to transform numeric input fields to categoricals before further processing. It is to be applied globally with all input fields. A Discretization object is composed of any combination of the following properties.
For example, let's say type is set to "width", size is 7, trim is 0.05, and pretty is false. This requests that numeric input fields be discretized into 7 bins of equal width, trimming the outer 5% of counts, and not rounding bin boundaries.
Field Discretizations is also used to transform numeric input fields to categoricals before further processing. However, it allows the user to specify parameters on a per field basis, taking precedence over the global discretization. It is a map whose keys are field ids and whose values are maps with the same format as discretization. It also accepts edges, which is a numeric array manually specifying edge boundary locations. If this parameter is present, the corresponding field will be discretized according to those defined bins, and the remaining discretization parameters will be ignored. The maximum value of the field's distribution is automatically set as the last value in the edges array. A value object of a Field Discretizations object is composed of any combination of the following properties.
You can also use curl to customize a new correlation. For example, to create a new correlation named "my correlation", with only certain rows, and with only three fields:
curl "https://bigml.io/correlation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c",
"objective_field": "000001",
"input_fields": ["000001", "000002", "000003"],
"name": "my correlation",
"range": [25, 125]}'
> Creating customized correlation
If you do not specify a name, BigML.io will assign to the new correlation the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset.
Read the Section on Sampling Your Dataset to lean how to sample your dataset. Here's an example of correlation request with range and sampling specifications:
curl "https://bigml.io/correlation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/505f43223c1920eccc000297",
"range": [1, 5000],
"sample_rate": 0.5}'
> Creating a correlation using sampling
Retrieving a Correlation
Each correlation has a unique identifier in the form "correlation/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the correlation.
To retrieve a correlation with curl:
curl "https://bigml.io/correlation/5589d374545e5f37fa000000?$BIGML_AUTH"
$ Retrieving a correlation from the command line
You can also use your browser to visualize the correlation using the full BigML.io URL or pasting the correlation/id into the BigML.com dashboard.
Correlation Properties
Once a correlation has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the correlation and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the correlation creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields in the correlation. |
correlations | Object | All the information that you need to recreate the correlation. It includes the field's dictionary describing the fields and their summaries, and the correlations. See the Correlations Object definition below. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the correlation was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this correlation. |
dataset
filterable, sortable |
String | The dataset/id that was used to build the correlation. |
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
description
updatable |
String | A text describing the correlation. It can contain restricted markdown to decorate the text. |
excluded_fields | Array | The list of fields's ids that were excluded to build the correlation. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
input_fields | Array | The list of input fields' ids used to build the correlation. |
locale | String | The dataset's locale. |
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the correlation. |
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the correlation. |
name
filterable, sortable, updatable |
String | The name of the correlation as your provided or based on the name of the dataset by default. |
objective_field |
String, default is dataset's pre-defined objective field |
The id of the field to be used as the objective for a correlations test.
Example: "000001" |
objective_field_details | Object | The details of the objective fields. See the Objective Field Details. |
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the correlation instead of the sampled instances. |
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your correlation. |
private
filterable, sortable, updatable |
Boolean | Whether the correlation is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
range | Array | The range of instances used to build the correlation. |
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the correlation were selected using replacement or not. |
resource | String | The correlation/id. |
rows
filterable, sortable |
Integer | The total number of instances used to build the correlation |
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the correlation. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
shared
filterable, sortable, updatable |
Boolean | Whether the correlation is shared using a private link or not. |
shared_hash | String | The hash that gives access to this correlation if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this correlation. |
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this correlation. |
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
status | Object | A description of the status of the correlation. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the correlation was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the correlation was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
white_box
filterable, sortable |
Boolean | Whether the correlation is publicly shared as a white-box. |
The Correlations Object of test has the following properties. Some correlation results will contain a p-value and a significant boolean array, indicating whether the p_value is less than the provided significance_levels (by default, [0.01, 0.05, 0.10] is used if not provided). If p-value is greater than the accepted significance level, then then it fails to reject the null hypothesis, meaning there is no statistically significant difference between the treatment groups. For example, if the significance levels is [0.01, .0.025, 0.05, 0.075, 0.1] and p-value is 0.05, then significant is [false, false, false, true, true].
Property | Type | Description |
---|---|---|
categories | Object | A dictionary between input field id and arrays of category names selected for correlations. |
correlations | Array | Correlation results. See Correlation Results Object. |
fields
updatable |
Object | A dictionary with an entry per field in the dataset used to build the test. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
significance_levels | Array | An array of user provided significance levels to test against p_values. |
The Correlation Results Object has the following properties.
Property | Type | Description |
---|---|---|
name | String | Name of the correlation. Available values are coefficients, contingency_tables, and one_way_anova. |
result | Object | A correlation result which is a dictionary between field ids and the result. The type of result object varies based on the name of the correlation. When name is coefficients, it returns Coefficients Result Object, when contingency_tables, Contingency Tables Result Object, and when one_way_anova, One-way ANOVA Result Object. |
The Coefficients Result Object contains the correlation measures between objective_field and each of the input_fields when the two fields are numeric-numeric pairs. It has the following properties:
Property | Type | Description |
---|---|---|
pearson | Float | A measure of the linear correlation between two variables, giving a value between +1 and -1, where 1 is total positive correlation, 0 is no correlation, and -1 is total negative correlation. See Pearson's correlation coefficients for more information. |
pearson_p_value | Float |
A function used in the context of null hypothesis testing for pearson correlations in order to quantify the idea of statistical significance of evidence.
Example: 0.015 |
spearman | Float | A nonparametric (parameters are determined by the training data, not the model. Thus, the number of parameters grows with the amount of training data) measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect correlation of +1 or -1 occurs when each of the variables is a perfect monotone function of the other. See Spearman's correlation coefficients for more information. |
spearman_p_value | Float |
A function used in the context of null hypothesis testing for spearman correlations in order to quantify the idea of statistical significance of evidence.
Example: 0.015 |
The Contingency Tables Result Object contains the correlation measures between objective_field and each of the input_fields when the two fields are both categorical.It has the following properties:
Property | Type | Description |
---|---|---|
chi_square | Object | See Chi-Square Object. |
cramer | Float | A measure of association between two nominal variables. Its value ranges between 0 (no association between the variables) and 1 (complete association), and can reach 1 only when the two variables are equal to each other. It is based on Pearson's chi-squared statistic. See Cramer's V for more information. |
tschuprow | Float | A measure of association between two nominal variables. Its value ranges ranges between 0 (no association between the variables) and 1 (complete association). It is closely related to Cramer's V, coinciding with it for square contingency tables. See Tschuprow's T for more information. |
two_way_table | Array |
Contingency Table as a nested row-major array with the frequency distribution of the variables. In other words, the table summarizes the distribution of values in the sample.
Example: [[2514, 362, 78, 38, 23], [889, 53, 39, 2, 1]] |
The Chi-Square Object contains the chi-square statistic used to investigate whether distributions of categorical variables differ from one another. This test is used to compare a collection of categorical data with some theoretical expected distribution. The object has the following properties.
The One-way ANOVA Result Object contains correlation measures between objective_field and each of the input_fields when the two fields are categorical-numerical pairs. ANOVA is used to compare the means of numerical data samples. The ANOVA tests the null hypothesis that samples in two or more groups are drawn from populations with the same mean values. See One-way Analysis of Variance for more information. The object has the following properties:
Property | Type | Description |
---|---|---|
eta_square | Float | A measure of effect size, a measure of the strength of the relationship between two variables, for use in ANOVA. Its value ranges ranges between 0 and 1. A rule of thumb is: 0.02 ~ small, 0.13 ~ medium, and 0.26 ~ large. See eta-squared for more information. |
f_ratio | Float | The value of the F statistic, which is used to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other. It is the ratio of the variance calculated among the means to the variance within the samples. |
p_value | Float |
A function used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.
Example: 0.015 |
significant | Array |
A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. The default significance_levels are [0.01, 0.05, 0.1].
Example: [false, true, true] |
An Objective Field Details Object has the following properties.
Correlation Status
Creating correlation is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The correlation goes through a number of states until its fully completed. Through the status field in the correlation you can determine when the correlation has been fully processed and ready to be used to create predictions. These are the properties that correlation's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the correlation creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the correlation. |
message | String | A human readable message explaining the status. |
progress | Float, between 0 and 1 | How far BigML.io has progressed building the correlation. |
Once correlation has been successfully created, it will look like:
{
"category": 0,
"clones": 0,
"code": 200,
"columns": 14,
"correlations": {
"categories": {
"000003": [
"Bachelors",
"Some-college",
"HS-grad"
],
"000005": [
"Divorced",
"Separated",
"Widowed"
]
},
"correlations": [
{
"name": "coefficients",
"result": {
"000002": {
"pearson": -0.07665,
"pearson_p_value": 0,
"spearman": -0.07814,
"spearman_p_value": 0
},
"000004": { … },
"00000a": { … },
"00000b": { … },
"00000c": { … }
}
},
{
"name": "one_way_anova",
"result": {
"000001": {
"eta_square": 0.05254,
"f_ratio": 243.34988,
"p_value": 0,
"significant": [
true,
true
]
},
"000003": { … },
"000005": { … },
"000006": { … },
"000007": { … },
"000008": { … },
"000009": { … },
"00000e": { … }
}
}
],
"fields": { … },
"significance_levels": [
0.025,
0.01
]
},
"created": "2015-06-23T21:45:24.002000",
"credits": 15.161365509033203,
"dataset": "dataset/55806fc2545e5f09b400002b",
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [
],
"fields_meta": {
"count": 14,
"limit": 1000,
"offset": 0,
"query_total": 14,
"total": 14
},
"input_fields": [
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007",
"000008",
"000009",
"00000a",
"00000b",
"00000c",
"00000e"
],
"locale": "en-US",
"max_columns": 15,
"max_rows": 32561,
"name": "Sample correlation",
"objective_field": "000000",
"objective_field_details": {
"column_number": 0,
"datatype": "int8",
"name": "age",
"optype": "numeric",
"order": 0
},
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
32561
],
"replacement": false,
"resource": "correlation/5589d374545e5f37fa000000",
"rows": 32561,
"sample_rate": 1,
"shared": false,
"size": 3974461,
"source": "source/5578d034545e5f6a17000006",
"source_status": true,
"status": {
"code": 5,
"elapsed": 11504,
"message": "The correlation has been created",
"progress": 1
},
"subscription": false,
"tags": [
],
"updated": "2015-06-23T21:45:56.066000",
"white_box": false
}
< Example correlation JSON response
Filtering and Paginating Fields from a Correlation
A correlation might be composed of hundreds or even thousands of fields. Thus when retrieving a correlation, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Correlation
To update a correlation, you need to PUT an object containing the fields that you want to update to the correlation' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated correlation.
For example, to update correlation with a new name you can use curl like this:
curl "https://bigml.io/correlation/5589d374545e5f37fa000000?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a correlation's name
If you want to update correlation with a new label and description for a specific field you can use curl like this:
curl "https://bigml.io/correlation/5589d374545e5f37fa000000?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"fields": {"000000": {
"label": "a longer name",
"description": "an even longer description"}}}'
$ Updating correlation's field,
label, and description
Deleting a Correlation
To delete a correlation, you need to issue a HTTP DELETE request to the correlation/id to be deleted.
Using curl you can do something like this to delete a correlation:
curl -X DELETE "https://bigml.io/correlation/5589d374545e5f37fa000000?$BIGML_AUTH"
$ Deleting a correlation from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a correlation, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a correlation a second time, or a correlation that does not exist, you will receive a "404 not found" response.
However, if you try to delete a correlation that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Correlations
To list all the correlations, you can use the correlation base URL. By default, only the 20 most recent correlations will be returned. You can see below how to change this number using the limit parameter.
You can get your list of correlations directly in your browser using your own username and API key with the following links.
https://bigml.io/correlation?$BIGML_AUTH
> Listing correlations from a browser
Statistical Tests
Last Updated: Tuesday, 2018-03-13 12:20
A statistical test resource automatically runs some advanced statistical tests on the numeric fields of a dataset. The goal of these tests is to check whether the values of individual fields conform or differ from some distribution patterns. Statistical test are useful in tasks such as fraud, normality, or outlier detection.
The tests are grouped in the following three categories:
-
Fraud Detection Tests:
- Benford: This statistical test performs a comparison of the distribution of first significant digits (FSDs) of each value of the field to the Benford's law distribution. Benford's law applies to numerical distributions spanning several orders of magnitude, such as the values found on financial balance sheets. It states that the frequency distribution of leading, or first significant digits (FSD) in such distributions is not uniform. On the contrary, lower digits like 1 and 2 occur disproportionately often as leading significant digits. The test compares the distribution in the field to Bendford's distribution using a Chi-square goodness-of-fit test, and Cho-Gaines d test. If a field has a dissimilar distribution, it may contain anomalous or fraudulent values.
-
Normality tests: These tests can be used to confirm the assumption that the data in each field of a dataset is distributed
according to a normal distribution. The results are relevant because many statistical and machine learning techniques rely on this assumption.
- Anderson-Darling: The Anderson-Darling test computes a test statistic based on the difference between the observed cumulative distribution function (CDF) to that of a normal distribution. A significant result indicates that the assumption of normality is rejected.
- Jarque-Bera: The Jarque-Bera test computes a test statistic based on the third and fourth central moments (skewness and kurtosis) of the data. Again, a significant result indicates that the normality assumption is rejected.
- Z-score: For a given sample size, the maximum deviation from the mean that would expected in a sampling of a normal distribution can be computed based on the 68-95-99.7 rule. This test simply reports this expected deviation and the actual deviation observed in the data, as a sort of sanity check.
-
Outlier tests:
- Grubbs: When the values of a field are normally distributed, a few values may still deviate from the mean distribution. The outlier tests reports whether at least one value in each numeric field differs significantly from the mean using Grubb's test for outliers. If an outlier is found, then its value will be returned.
Note that both the number of tests within each category and the categories may increase in the near future.
BigML.io allows you to create, retrieve, update, delete your statistical test. You can also list all of your statistical tests.
Jump to:
- Statistical Test Base URL
- Creating a Statistical Test
- Statistical Test Arguments
- Retrieving a Statistical Test
- Statistical Test Properties
- Filtering and Paginating Fields from a Statistical Test
- Updating a Statistical Test
- Deleting a Statistical Test
- Listing Statistical Tests
Statistical Test Base URL
You can use the following base URL to create, retrieve, update, and delete statistical tests. https://bigml.io/statisticaltest
Statistical Test base URL
All requests to manage your statistical tests must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Statistical Test
To create a new statistical test, you need to POST to the statistical test base URL an object containing at least the dataset/id that you want to use to create the statistical test. The content-type must always be "application/json".
You can easily create a new statistical test using curl as follows. All you need is a valid dataset/id and your authentication variable set up as shown above.
curl "https://bigml.io/statisticaltest?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c"}'
> Creating a statistical test
BigML.io will return the newly created statistical test if the request succeeded.
{
"category": 0,
"clones": 0,
"code": 201,
"columns": 0,
"created": "2015-06-23T06:14:49.583473",
"credits": 0.09991455078125,
"dataset": "dataset/5579abc3545e5f4f8a000000",
"dataset_field_types": {
"categorical": 1,
"datetime": 0,
"numeric": 8,
"preferred": 9,
"text": 0,
"total": 9
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [ ],
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"input_fields": [ ],
"locale": "en-US",
"max_columns": 9,
"max_rows": 768,
"name": "Diabetes (all numeric) test",
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
768
],
"replacement": false,
"resource": "statisticaltest/5588f959545e5fdc1e000007",
"rows": 768,
"sample_rate": 1,
"shared": false,
"size": 26192,
"source": "source/5578d077545e5f6a17000011",
"source_status": true,
"statistical_tests": null,
"status": {
"code": 1,
"message": "The statistical test is being processed and will be created soon"
},
"subscription": false,
"tags": [ ],
"updated": "2015-06-23T06:14:49.583623",
"white_box": false
}
< Example statistical test JSON response
Statistical Test Arguments
In addition to the dataset, you can also POST the following arguments.
Argument | Type | Description |
---|---|---|
ad_sample_size
optional |
Integer, default is 1024 |
The Anderson-Darling normality test is computed from a sample from the values of each field. This parameter specifies the number of samples to be used during the normality test. If not given, defaults to 1024.
Example: 128 |
ad_seed
optional |
String |
A string to be hashed to generate deterministic samples for the Anderson-Darling normality test.
Example: "MyADSeed" |
category
optional |
Integer, default is the category of the dataset |
The category that best describes the statistical test. See the category codes for the complete list of categories.
Example: 1 |
dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
description
optional |
String |
A description of the statistical test up to 8192 characters long.
Example: "This is a description of my new statistical test" |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the statistical test.
Example:
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the statistical test with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the statistical test.
Example:
|
name
optional |
String, default is dataset's name |
The name you want to give to the new statistical test.
Example: "my new statistical test" |
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
project
optional |
String |
The project/id you want the statistical test to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the statistical test.
Example: [1, 150] |
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
significance_levels
optional |
Array, default is [0.01, 0.05, 0.1] |
An array of significance levels between 0 and 1 to test against p_values.
Example: [0.01, 0.025, 0.05, 0075, 0.1] |
tags
optional |
Array of Strings |
A list of strings that help classify and index your statistical test.
Example: ["best customers", "2018"] |
You can also use curl to customize a new statistical test. For example, to create a new statistical test named "my statistical test", with only certain rows, and with only three fields:
curl "https://bigml.io/statisticaltest?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c",
"input_fields": ["000001", "000002", "000003"],
"name": "my statistical test",
"range": [25, 125]}'
> Creating a customized statistical test
If you do not specify a name, BigML.io will assign to the new statistical test the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset.
Read the Section on to learn how to sample your dataset. Here's an example of statistical test request with range and sampling specifications:
curl "https://bigml.io/statisticaltest?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/505f43223c1920eccc000297",
"range": [1, 5000],
"sample_rate": 0.5}'
> Creating a statistical test using sampling
Retrieving a Statistical Test
Each statistical test has a unique identifier in the form "statisticaltest/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the statistical test.
To retrieve a statistical test with curl:
curl "https://bigml.io/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH"
$ Retrieving a statistical test from the command line
You can also use your browser to visualize the statistical test using the full BigML.io URL or pasting the statisticaltest/id into the BigML.com dashboard.
Statistical Test Properties
Once a statistical test has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the statistical test and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the statistical test creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields in the statistical test. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the statistical test was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this statistical test. |
dataset
filterable, sortable |
String | The dataset/id that was used to build the statistical test. |
dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the statistical test. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
description
updatable |
String | A text describing the statistical test. It can contain restricted markdown to decorate the text. |
excluded_fields | Array | The list of fields's ids that were excluded to build the statistical test. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
input_fields | Array | The list of input fields' ids used to build the statistical test. |
locale | String | The dataset's locale. |
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the statistical test. |
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the statistical test. |
name
filterable, sortable, updatable |
String | The name of the statistical test as your provided or based on the name of the dataset by default. |
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the statistical test instead of the sampled instances. |
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your statistical test. |
private
filterable, sortable, updatable |
Boolean | Whether the statistical test is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
range | Array | The range of instances used to build the statistical test. |
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the statistical test were selected using replacement or not. |
resource | String | The statisticaltest/id. |
rows
filterable, sortable |
Integer | The total number of instances used to build the statistical test |
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the statistical test. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
shared
filterable, sortable, updatable |
Boolean | Whether the statistical test is shared using a private link or not. |
shared_hash | String | The hash that gives access to this statistical test if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this statistical test. |
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this statistical test. |
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
statistical_tests | Object | All the information that you need to recreate the statistical test. It includes the field's dictionary describing the fields and their summaries, and the statistical tests. See the Statistical Tests Object definition below. |
status | Object | A description of the status of the statistical test. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the statistical test was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the statistical test was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
white_box
filterable, sortable |
Boolean | Whether the statistical test is publicly shared as a white-box. |
The Statistical Tests Object of statistical test has the following properties. Many statistical tests will contain a p-value and a significant boolean array, indicating whether the p_value is less than the provided significance_levels (by default, [0.01, 0.05, 0.10] is used if not provided). If p-value is greater than the accepted significance level, then then it fails to reject the null hypothesis, meaning there is no statistically significant difference between the treatment groups. For example, if the significance levels is [0.01, .0.025, 0.05, 0.075, 0.1] and p-value is 0.05, then significant is [false, false, false, true, true].
Property | Type | Description |
---|---|---|
ad_sample_size | Integer | The sample test size used for the Anderson-Darling normality test |
ad_seed | String | A seed used to generate deterministic samples for the Anderson-Darling normality test. |
fields
updatable |
Object | A dictionary with an entry per field in the dataset used to build the test. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
fraud | Array | An array of anomalous fields detection test results for each numeric field. See Fraud Object. |
normality | Array | An array of data normality test results for each numeric field. See Normality Object. |
outliers | Array | An array of outlier detection test results for each numeric field. See Outliers Object. |
significance_levels | Array | An array of user provided significance levels to test against p_values. |
The Fraud Object has the following properties.
Property | Type | Description |
---|---|---|
name | String | Name of the fraud test. Currently only value available is benford. |
result | Object | A test result which is a dictionary between field ids and test result. The type of result object varies based on the name of the test. When name is benford, it returns Benford Result Object. |
The Benford Result Object has the following properties. Benford's Law is a simple yet powerful tool allowing quick screening of data for anomalies.
Property | Type | Description |
---|---|---|
chi_square | Object | See Chi-Square Object. |
cho_gaines | Object | See Cho-Gaines Object. |
distribution | Array |
The distribution of first significant digits (FSDs) to the Benford's law distribution. For example, the FSD for 2015 is 2, and for 0.00609 is 6. The array represents the number of occurences for each digit from 1 to 9.
Example: [0, 0, 0, 22, 61, 54, 0, 0, 0] |
negatives | Integer | The number of negative values. |
zeros | Integer | The number of values exactly equal to 0. |
The Chi-Square Object contains the chi-square statistic used to investigate whether distributions of categorical variables differ from one another. This test is used to compare a collection of categorical data with some theoretical expected distribution. The object has the following properties.
The Cho-Gaines Object has the following properties.
Property | Type | Description |
---|---|---|
d_statistic | Float | A value based on Euclidean distance from Benford's distribution in the 9-dimensional space occupied by any first-digit vector to test Cho-Gaines d test. |
significant | Array |
A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. It does not respect the values passed in significance_levels, but always use [0.01, 0.05, 0.1].
Example: [false, true, true] |
The Normality Object has the following properties.
Property | Type | Description |
---|---|---|
name | String | Name of the normality test. Available values are anderson_darling, jarque_bera, and z_score. |
result | Object | A test result which is a dictionary between field ids and test result. The type of result object varies based on the name of the test. When name is anderson_darling, it returns Anderson-Darling Result Object, when jarque_bera, Jarque-Bera Result Object, and when z-score, Z-Score Result Object. |
The Anderson-Darling Result Object has the following properties. See Anderson-Darling Test for more information.
The Jarque-Bera Result Object has the following properties. See Jarque-Bera Test for more information.
The Z-Score Object has the following properties. A positive standard score indicates a datum above the mean, while a negative standard score indicates a datum below the mean. See z-score for more information.
Property | Type | Description |
---|---|---|
expected-max-z | Float | The expected maximum z-score for the sample size. |
max-z | Float | The maximum z-score. |
The Outliers Object has the following properties.
Property | Type | Description |
---|---|---|
name | String | Name of the outlier detection test. Currently only value available is grubbs. |
result | Object | A test result which is a dictionary between field ids and test result. The type of result object varies based on the name of the test. When name is grubbs, it returns Grubbs Result Object. |
The Grubb's Test for Outliers Result Object has the following properties. It computes a t-test based on the maximum deviation from the mean. A significant result indicates that at least one outlier is present in the data. If an outlier is found, also returns the value of the outlier. Note that this test assumes that the data are normally distributed. See Grubb's test for outliers for more information.
Statistical Test Status
Creating statistical test is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The statistical test goes through a number of states until its fully completed. Through the status field in the statistical test you can determine when the test has been fully processed and ready to be used to create predictions. These are the properties that statistical test's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the statistical test creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the statistical test. |
message | String | A human readable message explaining the status. |
progress | Float, between 0 and 1 | How far BigML.io has progressed building the statistical test. |
Once statistical test has been successfully created, it will look like:
{
"category": 0,
"clones": 0,
"code": 200,
"columns": 9,
"created": "2015-06-23T06:14:49.583000",
"credits": 0.09991455078125,
"dataset": "dataset/5579abc3545e5f4f8a000000",
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [ ],
"fields_meta": {
"count": 9,
"limit": 1000,
"offset": 0,
"query_total": 9,
"total": 9
},
"input_fields": [
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007"
],
"locale": "en-US",
"max_columns": 9,
"max_rows": 768,
"name": "Diabetes test",
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
768
],
"replacement": false,
"resource": "statisticaltest/5588f959545e5fdc1e000007",
"rows": 768,
"sample_rate": 1,
"shared": false,
"size": 26192,
"source": "source/5578d077545e5f6a17000011",
"source_status": true,
"statistical_tests": {
"ad_sample_size": 2048,
"ad_seed": "MyADSeed",
"fields": { … },
"fraud": [
{
"name": "benford",
"result": {
"000000": {
"chi_square": {
"chi_square_value": 5.67791,
"p_value": 0.68326,
"significant": [
false,
false
]
},
"cho_gaines": {
"d_statistic": 0.7654738225941359,
"significant": [
false,
false,
false
]
},
"distribution": [
193,
103,
75,
68,
57,
50,
45,
38,
28
],
"negatives": 0,
"zeros": 111
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
}
],
"normality": [
{
"name": "anderson_darling",
"result": {
"000000": {
"p_value": 0,
"significant": [
true,
true
]
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
},
{
"name": "jarque_bera",
"result": {
"000000": {
"p_value": 0,
"significant": [
true,
true
]
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
},
{
"name": "z_score",
"result": {
"000000": {
"expected_max_z": 3.21552,
"max_z": 3.90403
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
}
],
"outliers": [
{
"name": "grubbs",
"result": {
"000000": {
"p_value": 0.06734,
"significant": [
false,
false
]
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
}
],
"significance_levels": [
0.025,
0.01
]
},
"status": {
"code": 5,
"elapsed": 2244,
"message": "The statistical test has been created",
"progress": 1
},
"subscription": false,
"tags": [ ],
"updated": "2015-06-23T06:15:18.908000",
"white_box": false
}
< Example statistical test JSON response
Filtering and Paginating Fields from a Statistical Test
A statistical test might be composed of hundreds or even thousands of fields. Thus when retrieving a statisticaltest, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Statistical Test
To update a statistical test, you need to PUT an object containing the fields that you want to update to the statistical test' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated statistical test.
For example, to update statistical test with a new name you can use curl like this:
curl "https://bigml.io/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating statistical test' name
If you want to update statistical test with a new label and description for a specific field you can use curl like this:
curl "https://bigml.io/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"fields": {"000000": {
"label": "a longer name",
"description": "an even longer description"}}}'
$ Updating statistical test's field,
label, and description
Deleting a Statistical Test
To delete a statistical test, you need to issue a HTTP DELETE request to the statisticaltest/id to be deleted.
Using curl you can do something like this to delete a statistical test:
curl -X DELETE "https://bigml.io/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH"
$ Deleting a statistical test from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a statistical test, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a statistical test a second time, or a statistical test that does not exist, you will receive a "404 not found" response.
However, if you try to delete a statistical test that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Statistical Tests
To list all the statistical tests, you can use the statisticaltest base URL. By default, only the 20 most recent statistical tests will be returned. You can see below how to change this number using the limit parameter.
You can get your list of statistical tests directly in your browser using your own username and API key with the following links.
https://bigml.io/statisticaltest?$BIGML_AUTH
> Listing statistical tests from a browser
Models
Last Updated: Monday, 2017-10-30 10:31
A model is a tree-like representation of your dataset with predictive power. You can create a model selecting which fields from your dataset you want to use as input fields (or predictors) and which field you want to predict, the objective field.
Each node in the model corresponds to one of the input fields. Each node has an incoming branch except the top node also known as root that has none. Each node has a number of outgoing branches except those at the bottom (the "leaves") that have none.
Each branch represents a possible value for the input field where it originates. A leaf represents the value of the objective field given all the values for each input field in the chain of branches that goes from the root to that leaf.
When you create a new model, BigML.io will automatically compute a classification model or regression model depending on whether the objective field that you want to predict is categorical or numeric, respectively.

BigML.io allows you to create, retrieve, update, delete your model. You can also list all of your models.
Jump to:
- Model Base URL
- Creating a Model
- Model Arguments
- Shuffling the Rows of Your Dataset
- Sampling Your Dataset
- Random Decision Forests
- Retrieving a Model
- Model Properties
- Filtering a Model
- PMML
- Filtering and Paginating Fields from a Model
- Updating a Model
- Deleting a Model
- Listing Models
- Weights
- Weight Field
- Objective Weights
- Automatic Balancing
Model Base URL
You can use the following base URL to create, retrieve, update, and delete models. https://bigml.io/model
Model base URL
All requests to manage your models must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Model
To create a new model, you need to POST to the model base URL an object containing at least the dataset/id that you want to use to create the model. The content-type must always be "application/json".
POST /model?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating model definition
curl "https://bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a model
BigML.io will return the newly created model if the request succeeded.
{
"category": 0,
"code": 201,
"columns": 1,
"created": "2012-11-15T02:32:48.763534",
"credits": 0.017578125,
"credits_per_prediction": 0.0,
"dataset": "dataset/50a453753c1920186d000045",
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_meta": {
"count": 0,
"limit": 200,
"offset": 0,
"total": 0
},
"input_fields": [],
"locale": "en-US",
"max_columns": 5,
"max_rows": 150,
"missing_splits": false,
"name": "iris' dataset model",
"number_of_evaluations": 0,
"number_of_predictions": 0,
"number_of_public_predictions": 0,
"objective_field": null,
"objective_fields": [],
"ordering": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"randomize": false,
"range": [
1,
150
],
"replacement": false,
"resource": "model/50a454503c1920186d000049",
"rows": 150,
"sample_rate": 1.0,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"status": {
"code": 1,
"message": "The model is being processed and will be created soon"
},
"tags": [
"species"
],
"updated": "2012-11-15T02:32:48.763566",
"views": 0,
"white_box": false
}
< Example model JSON response
Model Arguments
In addition to the dataset, you can also POST the following arguments.
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the model. See the category codes for the complete list of categories.
Example: 1 |
dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
depth_threshold
optional |
Integer, default is 512 |
When the depth in the tree exceeds this value, the tree stops growing. It has no effect if it's bigger than the node_threshold.
Example: 128 |
description
optional |
String |
A description of the model up to 8192 characters long.
Example: "This is a description of my new model" |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the model.
Example:
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the model with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be included as predictors in the model.
Example:
|
missing_splits
optional |
Boolean, default is false |
Defines whether to explicitly include missing field values when choosing a split. When this option is enabled, generates predicates whose operators include an asterisk, such as >*, <=*, =*, or !=*. The presence of an asterisk means "or missing". So a split with the operator >* and the value 8 can be read as "x > 8 or x is missing". When using missing_splits there may also be predicates with operators = or !=, but with a null value. This means "x is missing" and "x is not missing" respectively.
Example: true |
name
optional |
String, default is dataset's name |
The name you want to give to the new model.
Example: "my new model" |
node_threshold
optional |
Integer, default is 512 |
When the number of nodes in the tree exceeds this value, the tree stops growing.
Example: 1000 |
objective_field
optional |
String, default is the id of the last field in the dataset |
Specifies the id of the field that you want to predict.
Example: "000003" |
objective_fields
optional |
Array, default is an array with the id of the last field in the dataset |
Specifies the id of the field that you want to predict. Even if this an array BigML.io only accepts one objective field in the current version. If both objective_field and objective_fields are specified then, objective_field takes preference.
Example: ["000003"] |
ordering
optional |
Integer, default is 0 (deterministic). |
Specifies the type of ordering followed to build the model. There are three different types that you can specify:
Example: 1 |
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
project
optional |
String |
The project/id you want the model to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
random_candidate_ratio
optional |
Float |
A real number between 0 and 1. When randomize is true and random_candidate_ratio is given, BigML randomizes the tree and uses random_candidate_ratio * total fields (counting the number of terms in text fields as fields). To get the final number of candidate fields we round down to the nearest integer, but if the result is 0 we'll use 1 instead. If both random_candidates and random_candidate_ratio are given, BigML ignores random_candidate_ratio.
Example: 0.2 |
random_candidates
optional |
Integer, default is the square root of the total number of input fields. |
Sets the number of random fields considered when randomize is true.
Example: 10 |
randomize
optional |
Boolean, default is false |
Setting this parameter to true will consider only a subset of the possible fields when choosing a split. See the Section on Random Decision Forests below.
Example: true |
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the model.
Example: [1, 150] |
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
split_candidates
optional |
Integer, default is 32 |
The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024
Example: 128 |
stat_pruning
optional |
Boolean |
Activates statistical pruning on your decision tree model.
Example: true |
support_threshold
optional |
Float, default is 0 |
The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. So, if it is 3, then a both children of a new split must have 3 instances supporting them. Since instances may have non-integer weights, non-integer values are valid.
Example: 16 |
tags
optional |
Array of Strings |
A list of strings that help classify and index your model.
Example: ["best customers", "2018"] |
You can also use curl to customize a new model. For example, to create a new model named "my model", with only certain rows, and with only three fields:
curl "https://bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"input_fields": ["000001", "000003"],
"name": "my model",
"range": [25, 125]}'
> Creating a customized model
If you do not specify a name, BigML.io will assign to the new model the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset, and if you do not specify an objective field, BigML.io will use the last field in your dataset.
Shuffling the Rows of Your Dataset
By default, rows from the input dataset are deterministically shuffled before being processed, to avoid inaccurate models caused by ordered fields in the input rows. Since the shuffling is deterministic, i.e., always the same for a given dataset, retraining a model for the same dataset will always yield the same result.
However, you can modify this default behaviour by including the ordering argument in the model creation request, where "ordering" here is a shortcut for "ordering for the traversal of input rows". When this property is absent or set to 0, deterministic shuffling takes place; otherwise, you can set it to:
- Linear: If you know that your input is already in random order. Setting "ordering" to 1 in your model request tells BigML to traverse the dataset in a linear fashion, without performing any shuffling (and therefore operating faster).
- Random: If you'd like to perform a really random shuffling, most probably different from any other one attempted before. Setting "ordering" to 2 will shuffle the input rows non-deterministically.
Sampling Your Dataset
You can limit the dataset rows that are used to create a model in two ways (which can be combined), namely, by specifying a row range and by asking for a sample of the (alreaday clipped) input rows.
The row range is specified with the range argument defined in the Section on Arguments above.
To specify a sample, which is taken over the row range or over the whole dataset if a range is not provided, you can add the following arguments to the creation request:
- sample_rate : A positive number that specifies the sampling rate, i.e., how often we pick a row from the range. In other words, the final number of rows will be the size of the range multiply by the sample_rate, unless "out_of_bag" is true (see below).
- replacement : A boolean indicating whether sampling should be performed with or without replacement, i.e., the same instance may be selected multiple times for inclusion in the result set. Defaults to false.
- out_of_bag : If an instance isn't selected as part of a sampling, it's called out of bag. Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. This can be useful when paired with "seed". When replacement is false, the final number of row returned is the size of the range multiply by one minus the sample_rate. Out-of-bag sampling with replacement gives rise to variable-size samples. Defaults to false.
- seed : Rows are sampled probabilistically using a random string, which means that, in general, two identical samples of the same row range of the same dataset will be different. If you provide a seed (as an arbitrary string), its hash value will be used as the seed, and it'll be possible for you to generate deterministic samples.
Finally, note that the "ordering" of the dataset described in the previous subsection is used on the result of the sampling.
Here's an example of a model request with range and sampling specifications:
curl https://bigml.io/model?$BIGML_AUTH \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/505f43223c1920eccc000297", "range": [1, 5000], "sample_rate": 0.5, "replacement": true}'
Creating a model using sampling
Random Decision Forests
A model can be randomized by setting the randomize parameter to true. The default is false.
When randomized, the model considers only a subset of the possible fields when choosing a split. The size of the subset will be the square root of the total number of input fields. So if there are 100 input fields, each split will only consider 10 fields randomly chosen from the 100. Every split will choose a new subset of fields.
Although randomize could be used for other purposes, it's intended for growing random decision forests. To grow tree models for a random forest, set randomize to true and select a sample from the dataset. Traditionally this is a 1.0 sample rate with replacement, but we suggest a 0.63 sample rate without replacement.
Retrieving a Model
Each model has a unique identifier in the form "model/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the model.
To retrieve a model with curl:
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH"
$ Retrieving a model from the command line
You can also use your browser to visualize the model using the full BigML.io URL or pasting the model/id into the BigML.com dashboard.
Model Properties
Once a model has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
boosted_ensemble
filterable, sortable |
Boolean | Whether the model was built as part of an ensemble with boosted trees. |
boosting | Object |
Boosting attribute for the boosted tree. See the Gradient Boosting section for more information.
Example:
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the model and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the model creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields in the model. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the model was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this model. |
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a prediction with your model if you made it public. |
dataset
filterable, sortable |
String | The dataset/id that was used to build the model. |
dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the model. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
description
updatable |
String | A text describing the model. It can contain restricted markdown to decorate the text. |
ensemble
filterable, sortable |
Boolean | Whether the model was built as part of an ensemble of not. |
ensemble_id
filterable, sortable |
String | The ensemble id. |
ensemble_index
filterable, sortable |
Integer | The number of order in the ensemble. |
excluded_fields | Array | The list of fields's ids that were excluded to build the model. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
input_fields | Array | The list of input fields' ids used to build the model. |
locale | String | The dataset's locale. |
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the model. |
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the model. |
missing_splits
filterable, sortable |
Boolean | Whether to explicitly include missing field values when choosing a split while growing a model. |
model | Object | All the information that you need to recreate or use the model on your own. It includes a very intuitive description of the tree-like structure that makes the model up and the field's dictionary describing the fields and their summaries. |
name
filterable, sortable, updatable |
String | The name of the model as your provided or based on the name of the dataset by default. |
node_threshold
filterable, sortable |
String | The maximum number of nodes that the model will grow. |
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this model. |
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this model. |
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this model. |
number_of_public_predictions
filterable, sortable |
Integer | The current number of public predictions that use this model. |
objective_field | String | The id of the field that the model predicts. |
objective_fields | Array | Specifies the list of ids of the field that the model predicts. Even if this an array BigML.io only accepts one objective field in the current version. |
ordering
filterable, sortable |
Integer |
The order used to chose instances from the dataset to build the model. There are three different types:
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the model instead of the sampled instances. |
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your model. |
private
filterable, sortable, updatable |
Boolean | Whether the model is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
random_candidate_ratio
filterable, sortable |
Float | The random candidate ratio considered when randomize is true. |
random_candidates
filterable, sortable |
Integer | The number of random fields considered when randomize is true. |
randomize
filterable, sortable |
Boolean | Whether the model splits considered only a random subset of the fields or all the fields available. |
range | Array | The range of instances used to build the model. |
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the model were selected using replacement or not. |
resource | String | The model/id. |
rows
filterable, sortable |
Integer | The total number of instances used to build the model |
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the model. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
selective_pruning
filterable, sortable |
Boolean | If true, selective pruning throttled the strength of the statistical pruning depending on the size of the dataset. |
shared
filterable, sortable, updatable |
Boolean | Whether the model is shared using a private link or not. |
shared_clonable
filterable, sortable, updatable |
Boolean | Whether the shared model can be cloned or not. |
shared_hash | String | The hash that gives access to this model if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this model. |
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this model. |
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
split_candidates
filterable, sortable |
Integer | The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024 |
stat_pruning
filterable, sortable |
Boolean | Whether statistical pruning was used when building the model. |
status | Object | A description of the status of the model. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the model was created using a subscription plan or not. |
support_threshold
filterable, sortable |
Float | The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the model was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
white_box
filterable, sortable |
Boolean | Whether the model is publicly shared as a white-box. |
A Model Object has the following properties:
Property | Type | Description |
---|---|---|
depth_threshold | Integer | The depth, or generation, limit for a tree. |
distribution | Object | This dictionary gives information about how the training data is distributed across the tree leaves. More concretely, it contains the training data distribution with key training, and the distribution for the actual prediction values of the tree with key predictions. The former is just the objective_summary of the tree root (see below), copied for easier individual retrieval, and both have the format of the objective summary in the tree nodes. |
fields
updatable |
Object | A dictionary with an entry per field in the dataset used to build the model. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
importance | Array of Arrays | A list of pairs [field_id, importance]. Importance is the amount by which each field in the model reduces prediction error, normalized to be between zero and one. Note that fields with an importance of zero may still be correlated with the objective; they were just not used in the model. |
kind | String | The type of model. Currently, only stree is supported. |
missing_strategy | String | Default strategy followed by the model when it finds a missing value. Currently, last_prediction. At prediction time you can opt for using proportional. See this Section for more details. |
model_fields | Object | A dictionary with an entry per field used by the model (not all the fields that were available in the dataset). They follow the same structure as the fields attribute above except that the summary is not present. |
root | Object | A Node Object, a tree-like recursive structure representing the model. |
split_criterion | Integer | Method of choosing best attribute and split point for a given node. DEPRECATED |
support_threshold | Float | A number between 0 and 1. For a split to be valid, each child's support (instances / total instances) must be greater than this threshold. |
Node Objects have the following properties:
Property | Type | Description |
---|---|---|
children | Array | Array of Node Objects. |
confidence | Float | For classification models, a number between 0 and 1 that expresses how certain the model is of the prediction. For regression models, a number mapped to the top end of a 95 confidence interval around the expected error at that node (measured using the variance of the output at the node). See the Section on Confidence for more details. Note that for models you might have created using the first versions of BigML this value might be null. |
count | Integer | Number of instances classfied by this node. |
objective_summary | Object | An Objective Summary Object summarizes the objective field's distribution at this node. |
output | Number or String | Prediction at this node. |
predicate | Boolean or Object | Predicate structure to make a decision at this node. |
Objective Summary Objects have the following properties:
Property | Type | Description |
---|---|---|
bins | Array | If the objective field is numeric and the number of distinct values is greater than 32. An array that represents an approximate histogram of the distribution. It consists of value pairs, where the first value is the mean of a histogram bin and the second value is the bin population. For more information, see our blog post or read this paper. |
categories | Array | If the objective field is categorical, an array of pairs where the first element of each pair is one of the unique categories and the second element is the count for that category. |
counts | Array | If the objective field is numeric and the number of distinct values is less than or equal to 32, an array of pairs where the first element of each pair is one of the unique values found in the field and the second element is the count. |
maximum | Number | The maximum of the objective field's values. Available when 'bins' is present. |
minimum | Number | The minimum of the objective field's values. Available when 'bins' is present. |
Predicate Objects have the following properties:
Model Status
Creating a model is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The model goes through a number of states until its fully completed. Through the status field in the model you can determine when the model has been fully processed and ready to be used to create predictions. These are the properties that a model's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the model creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the model. |
message | String | A human readable message explaining the status. |
progress | Float, between 0 and 1 | How far BigML.io has progressed building the model. |
Once a model has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"columns": 5,
"created": "2012-11-15T02:32:48.763000",
"credits": 0.017578125,
"credits_per_prediction": 0.0,
"dataset": "dataset/50a453753c1920186d000045",
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_meta": {
"count": 5,
"limit": 200,
"offset": 0,
"total": 5
},
"input_fields": [
"000000",
"000001",
"000002",
"000003"
],
"locale": "en_US",
"max_columns": 5,
"max_rows": 150,
"missing_splits": false,
"model": {
"depth_threshold": 20,
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": {
"bins": [
[
4.3,
1
],
[
4.425,
4
],
[
4.6,
4
],
[
4.7,
2
],
[
4.8,
5
],
[
4.9,
6
],
[
5,
10
],
[
5.1,
9
],
[
5.2,
4
],
[
5.3,
1
],
[
5.4,
6
],
[
5.5,
7
],
[
5.6,
6
],
[
5.7,
8
],
[
5.8,
7
],
[
5.9,
3
],
[
6,
6
],
[
6.1,
6
],
[
6.2,
4
],
[
6.3,
9
],
[
6.44167,
12
],
[
6.6,
2
],
[
6.7,
8
],
[
6.8,
3
],
[
6.92,
5
],
[
7.1,
1
],
[
7.2,
3
],
[
7.3,
1
],
[
7.4,
1
],
[
7.6,
1
],
[
7.7,
4
],
[
7.9,
1
]
],
"maximum": 7.9,
"mean": 5.84333,
"median": 5.77889,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"splits": [
4.51526,
4.67252,
4.81113,
4.89582,
4.96139,
5.01131,
5.05992,
5.11148,
5.18177,
5.35681,
5.44129,
5.5108,
5.58255,
5.65532,
5.71658,
5.77889,
5.85381,
5.97078,
6.05104,
6.13074,
6.23023,
6.29578,
6.35078,
6.41459,
6.49383,
6.63013,
6.70719,
6.79218,
6.92597,
7.20423,
7.64746
],
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569
}
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": {
"counts": [
[
2,
1
],
[
2.2,
3
],
[
2.3,
4
],
[
2.4,
3
],
[
2.5,
8
],
[
2.6,
5
],
[
2.7,
9
],
[
2.8,
14
],
[
2.9,
10
],
[
3,
26
],
[
3.1,
11
],
[
3.2,
13
],
[
3.3,
6
],
[
3.4,
12
],
[
3.5,
6
],
[
3.6,
4
],
[
3.7,
3
],
[
3.8,
6
],
[
3.9,
2
],
[
4,
1
],
[
4.1,
1
],
[
4.2,
1
],
[
4.4,
1
]
],
"maximum": 4.4,
"mean": 3.05733,
"median": 3.02044,
"minimum": 2,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.43587,
"sum": 458.6,
"sum_squares": 1430.4,
"variance": 0.18998
}
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true,
"summary": {
"bins": [
[
1,
1
],
[
1.1,
1
],
[
1.2,
2
],
[
1.3,
7
],
[
1.4,
13
],
[
1.5,
13
],
[
1.63636,
11
],
[
1.9,
2
],
[
3,
1
],
[
3.3,
2
],
[
3.5,
2
],
[
3.6,
1
],
[
3.75,
2
],
[
3.9,
3
],
[
4.0375,
8
],
[
4.23333,
6
],
[
4.46667,
12
],
[
4.6,
3
],
[
4.74444,
9
],
[
4.94444,
9
],
[
5.1,
8
],
[
5.25,
4
],
[
5.46,
5
],
[
5.6,
6
],
[
5.75,
6
],
[
5.95,
4
],
[
6.1,
3
],
[
6.3,
1
],
[
6.4,
1
],
[
6.6,
1
],
[
6.7,
2
],
[
6.9,
1
]
],
"maximum": 6.9,
"mean": 3.758,
"median": 4.34142,
"minimum": 1,
"missing_count": 0,
"population": 150,
"splits": [
1.25138,
1.32426,
1.37171,
1.40962,
1.44567,
1.48173,
1.51859,
1.56301,
1.6255,
1.74645,
3.23033,
3.675,
3.94203,
4.0469,
4.18243,
4.34142,
4.45309,
4.51823,
4.61771,
4.72566,
4.83445,
4.93363,
5.03807,
5.1064,
5.20938,
5.43979,
5.5744,
5.6646,
5.81496,
6.02913,
6.38125
],
"standard_deviation": 1.7653,
"sum": 563.7,
"sum_squares": 2582.71,
"variance": 3.11628
}
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true,
"summary": {
"counts": [
[
0.1,
5
],
[
0.2,
29
],
[
0.3,
7
],
[
0.4,
7
],
[
0.5,
1
],
[
0.6,
1
],
[
1,
7
],
[
1.1,
3
],
[
1.2,
5
],
[
1.3,
13
],
[
1.4,
8
],
[
1.5,
12
],
[
1.6,
4
],
[
1.7,
2
],
[
1.8,
12
],
[
1.9,
5
],
[
2,
6
],
[
2.1,
6
],
[
2.2,
3
],
[
2.3,
8
],
[
2.4,
3
],
[
2.5,
3
]
],
"maximum": 2.5,
"mean": 1.19933,
"median": 1.32848,
"minimum": 0.1,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.76224,
"sum": 179.9,
"sum_squares": 302.33,
"variance": 0.58101
}
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true,
"summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-setosa",
50
],
[
"Iris-virginica",
50
]
],
"missing_count": 0
}
}
},
"importance": [
[
"000002",
0.53159
],
[
"000003",
0.4633
],
[
"000000",
0.00511
],
[
"000001",
0
]
],
"kind": "stree",
"missing_strategy": "last_prediction",
"model_fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true
}
},
"root": {
"children": [
{
"confidence": 0.92865,
"count": 50,
"objective_summary": {
"categories": [
[
"Iris-setosa",
50
]
]
},
"output": "Iris-setosa",
"predicate": {
"field": "000002",
"operator": "<=",
"value": 2.45
}
},
{
"children": [
{
"children": [
{
"children": [
{
"children": [
{
"children": [
{
"confidence": 0.34237,
"count": 2,
"objective_summary": {
"categories": [
[
"Iris-virginica",
2
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": ">",
"value": 5.95
}
},
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000000",
"operator": "<=",
"value": 5.95
}
}
],
"confidence": 0.20765,
"count": 3,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
],
[
"Iris-virginica",
2
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": "<=",
"value": 6.4
}
},
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000000",
"operator": ">",
"value": 6.4
}
}
],
"confidence": 0.15004,
"count": 4,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
2
],
[
"Iris-virginica",
2
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000001",
"operator": ">",
"value": 2.9
}
},
{
"confidence": 0.60966,
"count": 6,
"objective_summary": {
"categories": [
[
"Iris-virginica",
6
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000001",
"operator": "<=",
"value": 2.9
}
}
],
"confidence": 0.49016,
"count": 10,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
2
],
[
"Iris-virginica",
8
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": "<=",
"value": 5.05
}
},
{
"confidence": 0.90819,
"count": 38,
"objective_summary": {
"categories": [
[
"Iris-virginica",
38
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": ">",
"value": 5.05
}
}
],
"confidence": 0.86024,
"count": 48,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
2
],
[
"Iris-virginica",
46
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000003",
"operator": ">",
"value": 1.65
}
},
{
"children": [
{
"confidence": 0.92444,
"count": 47,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
47
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000002",
"operator": "<=",
"value": 4.95
}
},
{
"children": [
{
"confidence": 0.43849,
"count": 3,
"objective_summary": {
"categories": [
[
"Iris-virginica",
3
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": ">",
"value": 6.05
}
},
{
"children": [
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-virginica",
1
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000003",
"operator": "<=",
"value": 1.55
}
},
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000003",
"operator": ">",
"value": 1.55
}
}
],
"confidence": 0.09453,
"count": 2,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
],
[
"Iris-virginica",
1
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": "<=",
"value": 6.05
}
}
],
"confidence": 0.37553,
"count": 5,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
],
[
"Iris-virginica",
4
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": ">",
"value": 4.95
}
}
],
"confidence": 0.81826,
"count": 52,
"objective_summary": {
"categories": [
[
"Iris-virginica",
4
],
[
"Iris-versicolor",
48
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000003",
"operator": "<=",
"value": 1.65
}
}
],
"confidence": 0.40383,
"count": 100,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-virginica",
50
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": ">",
"value": 2.45
}
}
],
"confidence": 0.26289,
"count": 150,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-setosa",
50
],
[
"Iris-virginica",
50
]
]
},
"output": "Iris-virginica",
"predicate": true
},
"split_criterion": "information_gain_mix",
"support_threshold": 0
},
"name": "iris' dataset model",
"number_of_evaluations": 0,
"number_of_predictions": 0,
"number_of_public_predictions": 0,
"objective_field": "000004",
"objective_fields": [
"000004"
],
"ordering": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"randomize": false,
"range": [
1,
150
],
"replacement": false,
"resource": "model/50a454503c1920186d000049",
"rows": 150,
"sample_rate": 1.0,
"selective_pruning": true,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"stat_pruning": true,
"status": {
"code": 5,
"elapsed": 413,
"message": "The model has been created",
"progress": 1.0
},
"tags": [
"species"
],
"updated": "2012-11-15T02:32:50.149000",
"views": 0,
"white_box": false
}
< Example model JSON response
Filtering a Model
It is possible to filter the tree returned by a GET to the model location by means of two optional query string parameters, namely support and value.
Filter by Support
Support is a number from 0 to 1 that specifies the minimum fraction of the total number of instances that a given branch must cover to be retained in the resulting tree. Thus, asking for (minimum) support of 0, is just asking for the whole tree, while something like:
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;support=1.0"
Filter Example
will return just the root node, that being the only one that covers all instances. If you repeat the support parameter in the query string, the last one is used. Non-parseable support values are ignored.
Filter by Values and Value Intervals
Value is a concrete value or interval of values (for regression trees) that a leaf must predict to be kept in the returning tree. For instance:
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=Iris-setosa"
Filter Example
will return only those branches in the tree whose leaves predict "Iris-setosa" as the value of the (categorical) objective field, while something like:
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=[10,20]"
Filter Example
for a regression model will include only those leaves predicting an objective value between 10 and 20. You can also specify sharp values for regression models:
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=23.2"
Filter Example
will retrieve only those branches whose predictions are exactly 23.2. It is possible to specify multiple values, as in:
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=Iris-setosa&value=Iris-versicolor"
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=(10,20]&value=[-1.234,3.3)"
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=(10.2,20)&value=28.1&value=0.1"
Filter Example
in which case the union of the different predicates is used (i.e., the first query will return a tree will all leaves predicting "Iris-setosa" and all leaves predicting "Iris-versicolor".
Intervals can be closed or open in either end. For example, "(-2,10]", "[1,2)" or "(-1.234,0)", and the values of the left or right limits can be omitted, in which case they're taken as negative and positive infinity, respectively; thus "(,3]" denotes all values less or equal to three, as does "[,3]" (infinity not being a valid value for a numeric prediction), while "(0,)" accepts any positive value.
Filter by Confidence
Confidence is a concrete value or interval of values that a leaf must have to be kept in the returning tree. The specification of intervals follows the same conventions as those of value. Since confidences are a continuous value, the most common case will be asking for a range, but the service will accept also individual values. It's also possible to specify both a value and a confidence. For instance:
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=Iris-setosa&confidence=[0.3,]"
Filter Example
asks for a tree with only those leaves that predict "Iris-setosa" with a confidence greater or equal to 0.3, while
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;confidence=[,0.25)"
Filter Example
returns a model where all leaves with confidence strictly less than 0.25.
Finally, note that it is also possible to specify support, value, and confidence parameters in the same query.
PMML
The default model output format is JSON. However, the pmml parameter allows to include a PMML version of the model. The model will include a XML document that fullfils PMML v4.1. For example:
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;pmml=yes"
PMML Example
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;pmml=only"
PMML Example
Filtering and Paginating Fields from a Model
A model might be composed of hundreds or even thousands of fields. Thus when retrieving a model, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Model
To update a model, you need to PUT an object containing the fields that you want to update to the model' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated model.
For example, to update a model with a new name you can use curl like this:
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a model's name
If you want to update a model with a new label and description for a specific field you can use curl like this:
curl "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000000": {"label": "a longer name", "description": "an even longer description"}}}' \
-H 'content-type: application/json'
$ Updating a model's field, label, and description
Deleting a Model
To delete a model, you need to issue a HTTP DELETE request to the model/id to be deleted.
Using curl you can do something like this to delete a model:
curl -X DELETE "https://bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH"
$ Deleting a model from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a model, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a model a second time, or a model that does not exist, you will receive a "404 not found" response.
However, if you try to delete a model that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Models
To list all the models, you can use the model base URL. By default, only the 20 most recent models will be returned. You can see below how to change this number using the limit parameter.
You can get your list of models directly in your browser using your own username and API key with the following links.
https://bigml.io/model?$BIGML_AUTH
> Listing models from a browser
Weights
BigML.io has added three new ways in which you can use weights to deal with imbalanced datasets:
- Weight Field: considering the values one of the fields in the dataset as weight for the instances. This is valid for both regression and classification models.
- Objective Weights: submitting a specific weight for each class in classification models.
- Automatic Balancing: setting the balance argument to true to let BigML automatically balance all the classes evenly.
Let's see each method in more detail.
Weight Field
A weight_field may be declared for either regression or classification models. Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the weight field's value. See the toy dataset for credit card transactions below.
online, transaction, pending transactions, days since last transaction, distance, transactions today, balance, mtd, fraud, weight
yes, 10, 3, 31, low, 3, -3250, -1500, no, 1
no, 20, 30, 1, high, 0, 0, -300, no, 1
no, 40, 13, 210, low, 1, -19890, -30, no, 1
yes, 500, 0, 1, high, 0, 0, 0, yes, 10
no, 10, 1, 32, low, 0, -2500, -7891, no, 1
yes, 100, 0, 3, low, 0, -5194, -120, no, 1
yes, 100, 1, 4, low, 0, 0, 1500, no, 1
yes, 1000, 0, 1, high, 0, 0, 0, yes, 10
no, 150, 3, 1, low, 5, -3250, 1500, no, 1
no, 75, 5, 1, high, 1, -3250, 1500, no, 1
yes, 10, 23, 0, low, 1, -3250, 1500, no, 1
yes, 10, 3, 31, low, 3, -3250, -1500, no, 1
Example CSV file
curl "https://bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"weight_field": "000009"
}'
> Using a weight field to create a new model
With Flatline, you can define arbitrarily complex functions to produce weight fields, making this the most flexible and powerful way to produce weighted models.
For instance, the request below would create a new dataset using the example above that will add a new weight field using the previous and multiplying by two when the amount of the transaction is higher than 500.
curl "https://bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"new_fields": [{
"field": "(if (and (= (f fraud) \"yes\") (> (f transaction) 500)) (* (f weight) 2) (f weight))",
"name": "new weight"}]
}'
> Creating a new weight field
Objective Weights
The second method for adding weights only applies to classification models. A set of objective_weights may be defined, one per objective class. Each instance will be weighted according to its class weight.
curl "https://bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"excluded_fields": ["000009"],
"objective_weights": [["yes", 10], ["no", 1]]
}'
> Using a weight field to create a new model
If a class is not listed in the objective_weights, it is assumed to have a weight of 1. This means the example below is equivalent to the example above.
curl "https://bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"excluded_fields": ["000009"],
"objective_weights": [["yes", 10]]
}'
> Using a weight field to create a new model
Weights of zero are valid as long as there are some positive valued weights. If every weight does end up zero (this is possible with sampled datasets), then the resulting model will have a single node with a nil output.
Automatic Balancing
Finally, we provide a convenience shortcut for specifying weights for a classification objective which are proportional to their category counts, by means of the balance_objective flag.
For instance, if the category counts of the objective field are, say:
[["Iris-versicolor", 20], ["Iris-virginica", 10], ["Iris-setosa", 5]]
Category counts
the request:
curl "https://bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bb141a3c19200bdf00000c",
"balance_objective": true
}'
> Using balance_objective to create a new model
would be equivalent to:
curl "https://bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bb141a3c19200bdf00000c",
"objective_weights": [
["Iris-versicolor", 1],
["Iris-virginica", 2],
["Iris-setosa", 4]]}'
> Using objective_weights to create a new model
The next table summarizes all the available arguments to use weights.
The nodes for a weighted tree will include a weight and weighted_objective_distribution, which are the weighted analogs of count and objective_distribution. Confidence, importance, and pruning calculations also take weights into account.
{
"id":0,
"children":[
{
"id":1,
"children":[
{
"output":"Iris-virginica",
"count":10,
"objective_summary":{
"categories":[
[
"Iris-virginica",
10
]
]
},
"predicate":{
"value":1.7,
"operator":">",
"field":"000003"
},
"weighted_objective_summary":{
"categories":[
[
"Iris-virginica",
10
]
]
},
"weight":10,
"confidence":0.72246,
"id":2
},
{
"output":"Iris-versicolor",
"count":20,
"objective_summary":{
"categories":[
[
"Iris-versicolor",
20
]
]
},
"predicate":{
"value":1.7,
"operator":"<=",
"field":"000003"
},
"weighted_objective_summary":{
"categories":[
[
"Iris-versicolor",
20
]
]
},
"weight":20,
"confidence":0.83887,
"id":3
}
],
"weighted_objective_summary":{
"categories":[
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
]
]
},
"weight":30,
"predicate":{
"value":0.6,
"operator":">",
"field":"000003"
},
"confidence":0.4878,
"count":30,
"output":"Iris-versicolor",
"objective_summary":{
"categories":[
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
]
]
}
},
{
"output":"Iris-setosa",
"count":5,
"objective_summary":{
"categories":[
[
"Iris-setosa",
5
]
]
},
"predicate":{
"value":0.6,
"operator":"<=",
"field":"000003"
},
"weighted_objective_summary":{
"categories":[
[
"Iris-setosa",
100
]
]
},
"weight":100,
"confidence":0.56551,
"id":4
}
],
"weighted_objective_summary":{
"categories":[
[
"Iris-setosa",
100
],
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
]
]
},
"weight":130,
"predicate":true,
"confidence":0.60745,
"count":35,
"output":"Iris-setosa",
"objective_summary":{
"categories":[
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
],
[
"Iris-setosa",
5
]
]
}
}
< Example weighted model JSON response
Ensembles
Last Updated: Tuesday, 2018-03-13 12:20
Depending on the nature of your data and the specific parameters of the ensemble, you can significantly boost the predictive performance for single models, using exactly the same data.
You can create an ensemble just as you would create a model with the following three basic machine learning techniques: bagging, random decision forests, and gradient tree boosting.
Bagging, also known as bootstrap aggregating, is one of the simplest ensemble-based strategies but often outperforms strategies that are more complex. The basic idea is to use a different random subset of the original dataset for each model in the ensemble. Specifically BigML uses by default a sampling rate of 100% with replacement for each model. You can read more about bagging here.
Random decision forests is the second ensemble-based strategy that BigML provides. It consists, essentially, in selecting a new random set of the input fields at each split while an individual model is being built instead of considering all the input fields. To create a random decision forest you just need to set the randomize argument to true. You can read more about random decision forests here.
Gradient tree boosting is the third strategy whose predictions are additive. Each tree modifies the predictions of the previously grown tree. You must specify the boosting argument in order to apply this technique.

BigML.io allows you to create, retrieve, update, delete your ensemble. You can also list all of your ensembles.
Jump to:
- Ensemble Base URL
- Creating an Ensemble
- Ensemble Arguments
- Gradient Boosting
- Retrieving an Ensemble
- Ensemble Properties
- Updating an Ensemble
- Deleting an Ensemble
- Listing Ensembles
Ensemble Base URL
You can use the following base URL to create, retrieve, update, and delete ensembles. https://bigml.io/ensemble
Ensemble base URL
All requests to manage your ensembles must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating an Ensemble
To create a new ensemble, you need to POST to the ensemble base URL an object containing at least the dataset/id that you want to use to create the ensemble. The content-type must always be "application/json".
POST /ensemble?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating ensemble definition
curl "https://bigml.io/ensemble?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/50e8d4f03c19202d91000004"}'
> Creating an ensemble
BigML.io will return the newly created ensemble if the request succeeded.
{
"category": 0,
"code": 201,
"columns": 9,
"created": "2013-01-11T00:04:20.202976",
"credits": 0.7992858886718751,
"credits_per_prediction": 0.0,
"dataset": "dataset/50e8d4f03c19202d91000004",
"dataset_status": true,
"description": "",
"ensemble_sample": {
"rate": 0.8,
"replacement": true,
"seed": "my ensemble sample seed"
},
"error_models": 0,
"finished_models": 0,
"locale": "en-US",
"max_columns": 9,
"max_rows": 768,
"missing_splits": true,
"models": [
"model/50ef57043c19208c50000026",
"model/50ef57043c19208c50000029",
"model/50ef57043c19208c5000002c",
"model/50ef57053c19208c5000002f",
"model/50ef57053c19208c50000032",
"model/50ef57053c19208c50000035",
"model/50ef57063c19208c50000038",
"model/50ef57063c19208c5000003b",
"model/50ef57063c19208c5000003e",
"model/50ef57073c19208c50000041"
],
"name": "diabetes' dataset ensemble",
"number_of_evaluations": 0,
"number_of_models": 10,
"number_of_predictions": 0,
"number_of_public_predictions": 0,
"ordering": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"randomize": false,
"range": [
1,
768
],
"replacement": false,
"resource": "ensemble/50ef57043c19208c50000022",
"rows": 768,
"sample_rate": 0.8,
"seed": "a0f717f2b3954111b27fcc23f5a85787",
"size": 209528,
"source": "source/50e8d4ea3c19202d91000000",
"source_status": true,
"status": {
"code": 3,
"message": "The ensemble creation has been started"
},
"tags": [
"diabetes"
],
"updated": "2013-01-11T00:04:20.203007",
"views": 0,
"white_box": false
}
< Example ensemble JSON response
Ensemble Arguments
In addition to the dataset, you can also POST the following arguments, and like models, you can use weights to deal with imbalanced datasets. Click here to find more information about weights.
Argument | Type | Description |
---|---|---|
boosting
optional |
Object |
Gradient boosting options for the ensemble. Required to created an ensemble with boosted trees. See the Gradient Boosting section for more information.
Example:
NEW
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the ensemble. See the category codes for the complete list of categories.
Example: 1 |
dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
depth_threshold
optional |
Integer, default is 512 |
When the depth in the tree exceeds this value, the tree stops growing. It has no effect if it's bigger than the node_threshold.
Example: 128 |
description
optional |
String |
A description of the ensemble up to 8192 characters long.
Example: "This is a description of my new ensemble" |
ensemble_sample
optional |
Object |
The sampling to be used for each tree in the ensemble. It can contain a rate (default 1), and replacement (default true), and seed parameters. Note that this is different from the sample_rate, replacement, and seed used in other models, predictions or datasets, where sampling is applied once to the input dataset; rather, it's applied multiple times to the input, in order to create separate samplings for each tree composing the final ensemble. So there is not out_of_bag parameter here, and the seed is used to create a different seed for each of the generated trees.
Example:
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the models of the ensemble
Example:
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the ensemble with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be included as predictors in the models of the ensemble.
Example:
|
missing_splits
optional |
Boolean, default is true |
Defines whether to explicitly include missing field values when choosing a split while growing the models of an ensemble. When this option is enabled, in each model, generates predicates whose operators include an asterisk, such as >*, <=*, =*, or !=*. The presence of an asterisk means "or missing". So a split with the operator >* and the value 8 can be read as "x > 8 or x is missing". When using missing_splits there may also be predicates with operators = or !=, but with a null value. This means "x is missing" and "x is not missing" respectively.
Example: flase |
name
optional |
String, default is dataset's name |
The name you want to give to the new ensemble.
Example: "my new ensemble" |
node_threshold
optional |
Integer, default is 512 |
When the number of nodes in the tree exceeds this value, the tree stops growing.
Example: 1000 |
number_of_models
optional |
Integer, default is 10 |
The number of models to build the ensemble. This parameter is ignored for boosted trees. See the Gradient Boosting section for more information.
Example: 100 |
objective_field
optional |
String, default is the id of the last field in the dataset |
Specifies the id of the field that the ensemble will predict.
Example: "000003" |
ordering
optional |
Integer, default is 0 (deterministic). |
Specifies the type of ordering followed to build the models of the ensemble. There are three different types that you can specify:
Example: 1 |
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
project
optional |
String |
The project/id you want the ensemble to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
random_candidate_ratio
optional |
Float |
A real number between 0 and 1. When randomize is true and random_candidate_ratio is given, BigML randomizes the trees and uses random_candidate_ratio * total fields (counting the number of terms in text fields as fields). To get the final number of candidate fields we round down to the nearest integer, but if the result is 0 we'll use 1 instead. If both random_candidates and random_candidate_ratio are given, BigML ignores random_candidate_ratio.
Example: 0.2 |
random_candidates
optional |
Integer, default is the square root of the total number of input fields. |
Sets the number of random fields considered when randomize is true.
Example: 10 |
randomize
optional |
Boolean, default is false |
Setting this parameter to true will consider only a subset of the possible fields when choosing a split. See the Section on Random Decision Forests for further details.
Example: true |
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the ensemble.
Example: [1, 150] |
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
split_candidates
optional |
Integer, default is 32 |
The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024
Example: 128 |
stat_pruning
optional |
Boolean, default is false |
Activates statistical pruning on each decision tree model. It doesn't apply to boosted trees.
Example: true |
support_threshold
optional |
Float, default is 0 |
The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. So, if it is 3, then a both children of a new split must have 3 instances supporting them. Since instances may have non-integer weights, non-integer values are valid.
Example: 16 |
tags
optional |
Array of Strings |
A list of strings that help classify and index your ensemble.
Example: ["best customers", "2018"] |
You can use curl to customize a new ensemble from the command line. For example, to create a new ensemble named "my ensemble", with only certain rows, and with only three fields:
curl "https://bigml.io/ensemble?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006", "input_fields": ["000001", "000003"], "name": "my dataset", "range": [25, 125]}'
> Creating a customized ensemble
If you do not specify a name, the dataset's name will be assigned to the new ensemble. If you do not specify a range of instances, the complete set of instances in the dataset will be used. If you do not specify any input fields, all the preferred input fields in the dataset will be included, and if you do not specify an objective field, the last field in your dataset will be considered the objective field.
Gradient Boosting
When doing boosting, the number_of_models parameter described above is no longer valid as an input. The number_of_models will now indicate the maximum number of boosting iterations explained below. Note that when gradient boosting option is applied to classification models, the actual number of models created will be a product of the number of classes (categories) and the iterations. For example, if you set boosting iterations to 12 and the number of classes is 3, then the number of models created will be 36 or less depending on whether an early stopping strategy is used or not.
In addition, our implementation of boosted trees support the following parameters, which are all part of the boosting object:
If the boosted trees are using one of the early stopping tests (early_out_of_bag or early_holdout), then it will also have a list of scores indicating the quality of the boosted trees after each iteration.
Individual trees in the boosted trees differ from trees in bagged or random forest ensembles. Primarily the difference is that boosted trees do not try to predict the objective field directly. Instead, they try to fit a gradient (correcting for mistakes made in previous iterations), and this will be stored under a new field, named gradient.
This means the predictions from boosted trees cannot be combined with using the regular ensemble combiners. Instead, boosted trees use their own combiner that relies on a few new parameters included with individual boosted trees. These new parameters will be contained in the boosting attribute in each boosted tree, which may contain the following properties.
- objective_class will indicate the class that each tree helps predict if boosting is used for a classification problem (there will be one tree for each class for every boosting iteration).
- objective_field: contains the field id of the original objective field, as boosted trees will always be regression trees whose new objective is a new generated field (the previously mentioned gradient).
- weight: captures the influence each tree has when computing predictions.
- lambda: helps regulate the strength of a tree's output. It's included for generating predictions when encountering missing data and using the proportional strategy.
Nodes in boosted trees will also contain two new boosting related parameters, g_sum and h_sum. These are sums of the first and second order gradients, and are needed for generating predictions when encountering missing data and using the proportional strategy.
For regression problems, a prediction is generated by finding the prediction from each individual tree and doing a weighted sum using each tree's weight. Predictions for classification problems are similar, but separate weighted sums are found for each objective_class. That vector of weighted sums is then transformed into class probabilities using the soft max function.
Retrieving an Ensemble
Each ensemble has a unique identifier in the form "ensemble/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the ensemble.
To retrieve an ensemble with curl:
curl "https://bigml.io/ensemble/50ef57043c19208c50000022?$BIGML_AUTH"
$ Retrieving a ensemble from the command line
You can also use your browser to visualize the ensemble using the full BigML.io URL or pasting the ensemble/id into the BigML.com dashboard.
Ensemble Properties
Once an ensemble has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
boosting | Object | Gradient boosting options for the ensemble including scores which indicates the quality of the boosted trees after each iteration and final_iterations.See the Gradient Boosting section for more information. |
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the ensemble and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the ensemble creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields in the ensemble. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the ensemble was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this ensemble. |
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a prediction with your ensemble in case you decide to make it public. |
dataset
filterable, sortable |
String | The dataset/id that was used to build the ensemble. |
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
description
updatable |
String | A text describing the ensemble. It can contain restricted markdown to decorate the text. |
distributions | Array | Unordered list of distributions for each model in the ensemble. Each distribution is an Object with a entry for the distribution of instances in the training set and the distribution of predictions in the model. See a model distribution field for more details. Note that distributions must be accessed by the model_order below. |
ensemble_sample | Object | The sampling to be used for each tree in the ensemble. |
error_models
filterable, sortable |
Integer | The number of models in the ensemble that have failed. |
excluded_fields | Array | The list of fields's ids that were excluded to build the ensemble. |
finished_models
filterable, sortable |
Integer | The number of models in the ensemble that have finished correctly. |
importance | Object | Average importance per field |
input_fields | Array | The list of input fields' ids used to build the models of the ensemble. |
locale | String | The dataset's locale. |
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the ensemble. |
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the ensemble. |
missing_splits
filterable, sortable |
Boolean | Whether to explicitly include missing field values when choosing a split while growing the models of an ensemble. |
model_order | Array | Order in which each model in the list of models was finished. The distributions above must be accessed following this index. |
models | Array | Unordered list of model/ids that compose the ensemble. Models are ordered by the model_order above. |
name
filterable, sortable, updatable |
String | The name of the ensemble as your provided or based on the name of the dataset by default. |
node_threshold
filterable, sortable |
String | The maximum number of nodes that the model will grow. |
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this ensemble. |
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this ensemble. |
number_of_models
filterable, sortable |
Integer | The number of models in the ensemble. |
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this ensemble. |
number_of_public_predictions
filterable, sortable |
Integer | The current number of public predictions that use this ensemble. |
objective_field | String |
Specifies the id of the field that the ensemble predicts.
Example: "000003" |
ordering
filterable, sortable |
Integer |
The order used to chose instances from the dataset to build the models of the ensemble. There are three different types:
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the ensemble instead of the sampled instances. |
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your ensemble. |
private
filterable, sortable, updatable |
Boolean | Whether the ensemble is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
random_candidate_ratio
filterable, sortable |
Float | The random candidate ratio considered when randomize is true. |
random_candidates
filterable, sortable |
Integer | The number of random fields considered when randomize is true. |
randomize
filterable, sortable |
Boolean | Whether the splits of each model in the ensemble considered only a random subset of the fields or all the fields available. |
range | Array | The range of instances used to build the models of the ensemble. |
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the ensemble were selected using replacement or not. |
resource | String | The ensemble/id. |
rows
filterable, sortable |
Integer | The total number of instances used to build the models of the ensemble |
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the models of the ensemble. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
shared
filterable, sortable, updatable |
Boolean | Whether the ensemble is shared using a private link or not. |
shared_hash | String | The hash that gives access to this ensemble if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this ensemble. |
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this ensemble. |
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
split_candidates
filterable, sortable |
Integer | The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024 |
status | Object | A description of the status of the ensemble. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the ensemble was created using a subscription plan or not. |
support_threshold
filterable, sortable |
Float | The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the ensemble was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
white_box
filterable, sortable |
Boolean | Whether the ensemble is publicly shared as a white-box. |
Ensemble Status
Creating a ensemble is a process that can take just a few seconds or a few days depending on the size of the dataset used as input, the number of models, and on the workload of BigML's systems. The ensemble goes through a number of states until its fully completed. Through the status field in the ensemble you can determine when the ensemble has been fully processed and ready to be used to create predictions. These are the properties that an ensemble's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the ensemble creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the ensemble. |
message | String | A human readable message explaining the status. |
progress | Float, between 0 and 1 | How far BigML.io has progressed building the ensemble. |
Once a ensemble has been successfully created, it will look like:
{
"balance_objective":false,
"category":0,
"code":200,
"columns":9,
"created":"2016-07-08T18:26:36.351000",
"credits":0.09991455078125,
"credits_per_prediction":0,
"dataset":"dataset/5747ae334e172785fd000000",
"dataset_field_types":{
"categorical":1,
"datetime":0,
"effective_fields":9,
"items":0,
"numeric":8,
"preferred":9,
"text":0,
"total":9
},
"dataset_status":true,
"datasets":[],
"description":"",
"ensemble_sample": {
"rate": 0.8,
"replacement": true,
"seed": "my ensemble sample seed"
},
"distributions":[
{
"importance":[
[
"000001",
0.35199
],
[
"000005",
0.21031
],
[
"000006",
0.13889
],
[
"000007",
0.11932
],
[
"000002",
0.08733
],
[
"000003",
0.03628
],
[
"000000",
0.03297
],
[
"000004",
0.02291
]
],
"predictions":{
"categories":[
[
"false",
512
],
[
"true",
256
]
]
},
"training":{
"categories":[
[
"false",
512
],
[
"true",
256
]
]
}
},
{
"importance":[
[
"000001",
0.33276
],
[
"000005",
0.24432
],
[
"000006",
0.15996
],
[
"000007",
0.15378
],
[
"000003",
0.05712
],
[
"000002",
0.02431
],
[
"000000",
0.02199
],
[
"000004",
0.00575
]
],
"predictions":{
"categories":[
[
"false",
515
],
[
"true",
253
]
]
},
"training":{
"categories":[
[
"false",
515
],
[
"true",
253
]
]
}
},
{
"importance":[
[
"000001",
0.34203
],
[
"000005",
0.21501
],
[
"000006",
0.15173
],
[
"000007",
0.11991
],
[
"000003",
0.05734
],
[
"000002",
0.04411
],
[
"000000",
0.03615
],
[
"000004",
0.03372
]
],
"predictions":{
"categories":[
[
"false",
501
],
[
"true",
267
]
]
},
"training":{
"categories":[
[
"false",
501
],
[
"true",
267
]
]
}
},
{
"importance":[
[
"000001",
0.38199
],
[
"000005",
0.23932
],
[
"000007",
0.17913
],
[
"000006",
0.06526
],
[
"000002",
0.06305
],
[
"000000",
0.05733
],
[
"000003",
0.00766
],
[
"000004",
0.00625
]
],
"predictions":{
"categories":[
[
"false",
461
],
[
"true",
307
]
]
},
"training":{
"categories":[
[
"false",
461
],
[
"true",
307
]
]
}
},
{
"importance":[
[
"000001",
0.39081
],
[
"000005",
0.16745
],
[
"000007",
0.14195
],
[
"000006",
0.09129
],
[
"000002",
0.088
],
[
"000004",
0.07009
],
[
"000003",
0.03207
],
[
"000000",
0.01834
]
],
"predictions":{
"categories":[
[
"false",
495
],
[
"true",
273
]
]
},
"training":{
"categories":[
[
"false",
495
],
[
"true",
273
]
]
}
},
{
"importance":[
[
"000001",
0.31956
],
[
"000005",
0.23029
],
[
"000006",
0.12127
],
[
"000007",
0.11578
],
[
"000002",
0.06947
],
[
"000003",
0.05644
],
[
"000000",
0.04405
],
[
"000004",
0.04314
]
],
"predictions":{
"categories":[
[
"false",
511
],
[
"true",
257
]
]
},
"training":{
"categories":[
[
"false",
511
],
[
"true",
257
]
]
}
},
{
"importance":[
[
"000001",
0.33974
],
[
"000007",
0.17589
],
[
"000005",
0.15404
],
[
"000002",
0.14244
],
[
"000006",
0.099
],
[
"000000",
0.04024
],
[
"000004",
0.03316
],
[
"000003",
0.0155
]
],
"predictions":{
"categories":[
[
"false",
493
],
[
"true",
275
]
]
},
"training":{
"categories":[
[
"false",
493
],
[
"true",
275
]
]
}
},
{
"importance":[
[
"000001",
0.32296
],
[
"000005",
0.18728
],
[
"000007",
0.18258
],
[
"000006",
0.15218
],
[
"000002",
0.07172
],
[
"000003",
0.04563
],
[
"000000",
0.03449
],
[
"000004",
0.00316
]
],
"predictions":{
"categories":[
[
"false",
501
],
[
"true",
267
]
]
},
"training":{
"categories":[
[
"false",
501
],
[
"true",
267
]
]
}
},
{
"importance":[
[
"000001",
0.32899
],
[
"000005",
0.21858
],
[
"000000",
0.10723
],
[
"000006",
0.10542
],
[
"000007",
0.09207
],
[
"000004",
0.06614
],
[
"000003",
0.0455
],
[
"000002",
0.03606
]
],
"predictions":{
"categories":[
[
"false",
478
],
[
"true",
290
]
]
},
"training":{
"categories":[
[
"false",
478
],
[
"true",
290
]
]
}
},
{
"importance":[
[
"000001",
0.36743
],
[
"000005",
0.20641
],
[
"000007",
0.13267
],
[
"000006",
0.09049
],
[
"000002",
0.0669
],
[
"000004",
0.05171
],
[
"000003",
0.0514
],
[
"000000",
0.03299
]
],
"predictions":{
"categories":[
[
"false",
517
],
[
"true",
251
]
]
},
"training":{
"categories":[
[
"false",
517
],
[
"true",
251
]
]
}
}
],
"error_models":0,
"fast":true,
"fields_maps":null,
"finished_models":10,
"importance":{
"000000":0.04258,
"000001":0.34783,
"000002":0.06934,
"000003":0.04049,
"000004":0.0336,
"000005":0.2073,
"000006":0.11755,
"000007":0.14131
},
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007"
],
"locale":"en-us",
"max_columns":9,
"max_rows":768,
"missing_splits":false,
"models":[
"model/577ff05f4e17277ffd000003",
"model/577ff0604e17277ffd000005",
"model/577ff0604e17277ffd000007",
"model/577ff0604e17277ffd000009",
"model/577ff0604e17277ffd00000b",
"model/577ff0604e17277ffd00000d",
"model/577ff0604e17277ffd00000f",
"model/577ff0614e17277ffd000011",
"model/577ff0614e17277ffd000013",
"model/577ff0614e17277ffd000015"
],
"name":"diabetes dataset's ensemble",
"node_threshold":512,
"number_of_batchpredictions":0,
"number_of_evaluations":0,
"number_of_models":10,
"number_of_predictions":0,
"number_of_public_predictions":0,
"objective_field":"000008",
"objective_field_name":"diabetes",
"objective_field_type":"categorical",
"ordering":0,
"out_of_bag": false,
"price":0,
"private":true,
"project": null,
"randomize":false,
"range":[
1,
768
],
"replacement": false,
"resource":"ensemble/50ef57043c19208c50000022",
"rows":768,
"sample_rate":0.8,
"seed": "a0f717f2b3954111b27fcc23f5a85787",
"shared":false,
"size":209528,
"source":"source/5747ae194e172785fc000000",
"source_status":true,
"stat_pruning":false,
"status":{
"code":5,
"elapsed":2704,
"message":"The ensemble has been created",
"progress":1
},
"subscription":true,
"tags":[
"diabetes"
],
"updated":"2016-07-08T18:26:45.524000"
}
< Example ensmeble JSON response
Updating an Ensemble
To update an ensemble, you need to PUT an object containing the fields that you want to update to the ensemble' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated ensemble.
For example, to update an ensemble with a new name you can use curl like this:
curl "https://bigml.io/ensemble/50ef57043c19208c50000022?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating an ensemble's name
Deleting an Ensemble
To delete an ensemble, you need to issue a HTTP DELETE request to the ensemble/id to be deleted.
Using curl you can do something like this to delete an ensemble:
curl -X DELETE "https://bigml.io/ensemble/50ef57043c19208c50000022?$BIGML_AUTH"
$ Deleting an ensemble from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete an ensemble, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete an ensemble a second time, or an ensemble that does not exist, you will receive a "404 not found" response.
However, if you try to delete an ensemble that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Ensembles
To list all the ensembles, you can use the ensemble base URL. By default, only the 20 most recent ensembles will be returned. You can see below how to change this number using the limit parameter.
You can get your list of ensembles directly in your browser using your own username and API key with the following links.
https://bigml.io/ensemble?$BIGML_AUTH
> Listing ensembles from a browser
Logistic Regressions
Last Updated: Tuesday, 2018-03-13 12:20
A logistic regression is a supervised machine learning method for solving classification problems. The probability of the objective being a particular class is modeled as the value of a logistic function, whose argument is a linear combination of feature values. You can create a logistic regression selecting which fields from your dataset you want to use as input fields (or predictors) and which categorical field you want to predict, the objective field.
Logistic regression seeks to learn the coefficient values b0, b1, b2, ... bk from the training data, using maximum likelihood estimation techniques:

where

For this formulation to be valid the features X1, X2, ... Xk must be numeric values. To adapt this model to all the datatypes that BigML supports, we apply the following transformations to the inputs:
- Categorical fields are 'one-hot' encoded by default. That is, a separate 0-1 numeric field is created for each category, and exactly one of those fields has a value of 1, corresponding to the categorical value for the individual instance. To specify different coding behavior, see the Coding Categorical Fields for more details.
- Each term present in a text field is mapped to a corresponding numeric field, whose value is the number of occurrences of that term in the instance. Text fields without term analysis enabled are excluded from the model.
- Each item present in an items field is mapped to a corresponding numeric field, whose value is the number of occurrences of that item in the instance.
- Missing values in numeric fields can be explicitly included as another valid value by using the argument missing_numerics or they can be replaced specifying a default_numeric_value. If none of those arguments are enabled, instances containing missing numeric values will be ignored for training the model.
BigML.io allows you to create, retrieve, update, delete your logistic regression. You can also list all of your logistic regressions.
Jump to:
- Logistic Regression Base URL
- Creating a Logistic Regression
- Logistic Regression Arguments
- Retrieving a Logistic Regression
- Logistic Regression Properties
- Filtering and Paginating Fields from a Logistic Regression
- Updating a Logistic Regression
- Deleting a Logistic Regression
- Listing Logistic Regressions
- Weights
- Objective Weights
- Automatic Balancing
Logistic Regression Base URL
You can use the following base URL to create, retrieve, update, and delete logistic regressions. https://bigml.io/logisticregression
Logistic Regression base URL
All requests to manage your logistic regressions must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Logistic Regression
To create a new logistic regression, you need to POST to the logistic regression base URL an object containing at least the dataset/id that you want to use to create the logistic regression. The content-type must always be "application/json".
POST /logisticregression?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating logistic regression definition
curl "https://bigml.io/logisticregression?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a logistic regression
BigML.io will return the newly created logistic regression if the request succeeded.
{
"category":0,
"code":201,
"columns":5,
"created":"2015-09-29T18:28:38.755738",
"credits":0.01815032958984375,
"credits_per_prediction":0,
"dataset":"dataset/554e8fcf545e5f1474000010",
"dataset_field_types":{
"categorical":1,
"datetime":0,
"numeric":4,
"preferred":5,
"text":0,
"total":5
},
"dataset_status":true,
"description":"",
"excluded_fields":[],
"fields_meta":{
"count":0,
"limit":1000,
"offset":0,
"total":0
},
"input_fields":[],
"locale":"en-US",
"logistic_regression":null,
"max_columns":5,
"max_rows":150,
"name":"iris' dataset's logistic regression",
"number_of_batchpredictions":0,
"number_of_evaluations":0,
"number_of_predictions":0,
"objective_field":"000004",
"objective_field_name":null,
"objective_field_type":null,
"objective_fields":[
"000004"
],
"out_of_bag":false,
"private":true,
"project":"project/54dc6d05545e5f822c00043f",
"range":[
1,
150
],
"replacement":false,
"resource":"logisticregression/55efc3564e1727d635000004",
"rows":150,
"sample_rate":1,
"shared":false,
"size":4758,
"source":"source/554e8fac545e5f1474000004",
"source_status":true,
"status":{
"code":1,
"message":"The logistic regression is being processed and will be created soon"
},
"subscription":false,
"tags":[
"species"
],
"updated":"2015-09-29T18:28:38.755806",
"white_box":false
}
< Example logistic regression JSON response
Logistic Regression Arguments
In addition to the dataset, you can also POST the following arguments, and like models, you can use weights to deal with imbalanced datasets. Click here to find more information about weights.
Argument | Type | Description |
---|---|---|
balance_fields
optional |
Boolean, default is false |
Whether to scale each numeric field such that its values are zero mean with a standard deviation of 1, based on the field summary statistics at training time.
Example: true |
bias
optional |
Boolean, default is true |
Whether to include the bias term from the solution.
Example: false |
c
optional |
Float, default is 1 |
The inverse of the regularization strength. Must be greater than 0.
Example: 2 |
category
optional |
Integer, default is the category of the dataset |
The category that best describes the logistic regression. See the category codes for the complete list of categories.
Example: 1 |
compute_stats
optional |
Boolean, default is false |
Whether to compute statistics and significance tests.
Example: true |
dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
description
optional |
String |
A description of the logistic regression up to 8192 characters long.
Example: "This is a description of my new logistic regression" |
eps
optional |
Float, default is 0.0001 |
Stopping criteria for solver. If the difference between the results from the current and last iterations is less than eps, then the solver is finished.
Example: 0.1 |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the logistic regression.
Example:
|
field_codings
optional |
List |
Coding schemes for categorical fields: dummy, contrast, or other. Value is a map between field identifiers and a coding scheme for that field. See the Coding Categorical Fields for more details. If not specified, one numeric variable is created per categorical value, plus one for missing values.
Example:
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the logistic regression with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be included as predictors in the logistic regression.
Example:
|
missing_numerics
optional |
Boolean, default is true |
Whether to create an additional binary predictor each numeric field which denotes a missing value. If false, these predictors are not created, and rows containing missing numeric values are dropped.
Example: false |
name
optional |
String, default is dataset's name |
The name you want to give to the new logistic regression.
Example: "my new logistic regression" |
normalize
optional |
Boolean, default is false |
Whether to normalize feature vectors in training and predicting.
Example: true |
objective_field
optional |
String, default is the id of the last field in the dataset |
Specifies the id of the field that you want to predict. The type of the field must be categorical.
Example: "000003" |
objective_fields
optional |
Array, default is an array with the id of the last field in the dataset |
Specifies the id of the field that you want to predict. Even if this an array BigML.io only accepts one objective field in the current version. If both objective_field and objective_fields are specified then, objective_field takes preference. The type of the fields must be categorical.
Example: ["000003"] |
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
project
optional |
String |
The project/id you want the logistic regression to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the logistic regression.
Example: [1, 150] |
regularization
optional |
String, default is "l2" |
Either l1 or l2, which selects the norm to minimize when regularizing the solution. Regularizing with respect to the l1 norm causes more coefficients to be zero, using the l2 norm forces the magnitudes of all coefficients towards zero.
Example: "l1" |
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
stats_sample_seed
optional |
String |
Random seed value used for stats sampling
Example: "My stats seed" |
stats_sample_size
optional |
Integer, default is -1 |
The number of rows to sample for calculating statistics. If -1 is given, then the number of rows will be calculated such that (rows x coefficients) <= 1E+8. The minimum between that number and the total number of input rows will be used.
Example: 1000 |
tags
optional |
Array of Strings |
A list of strings that help classify and index your logistic regression.
Example: ["best customers", "2018"] |
Coding Categorical Fields
Categorical fields must be converted to numerical values in order to be used in training a logistic regression model. By default, they are "one-hot" coded. That is, one numeric variable is created per categorical value, plus one for missing values. For a given instance, the variable corresponding to the instance's categorical value has its value set to 1, while the other variables are set to 0.
Using the iris dataset as an example, we can express this coding scheme as the following table:
| Value | C0 | C1 | C2 | C3 || setosa | 1 | 0 | 0 | 0 | | versicolor | 0 | 1 | 0 | 0 | | virginica | 0 | 0 | 1 | 0 | | [MISSING] | 0 | 0 | 0 | 1 |
To specify different coding behavior, use the field_codings parameter.
The parameter value is an array where each element is a map describing the coding scheme to apply to a particular field, and containing the following keys:
- field: The name or identifier of the field to code.
- coding: The type of coding to use, either dummy, contrast, or other.
- dummy_class: The class value to treat as the control value in dummy coding.
- coefficients: The coefficients, which is a nested array of floating point values, to be used with contrast, or other coding.
The value for coding determines which of the following methods is used to code the field:
-
dummy: Use dummy coding.
The value is a string specifying the value to use as the control.
For example, the value {"field": "species", "coding": "dummy", "dummy_class": "virginica"} defines the following coding:
| Value | C0 | C1 | C2 |
| setosa | 1 | 0 | 0 | | versicolor | 0 | 1 | 0 | | virginica | 0 | 0 | 0 | | [MISSING] | 0 | 0 | 1 | -
contrast: Use contrast coding.
The value is an array of vectors, each specifying the coding of an individual variable.
The vectors are checked for length.
If the lengths are less than the expected length by 1, then a 0 is implicitly appended to the end of each array,
so that missing values are ignored for the model.
In addition, each vector is checked that its elements sum to 0, and the entire collection of vectors is checked for orthogonality.
For example, the value {"field": "species", "coding": "contrast", "coefficients": [[0.5,-0.25,-0.25,0],[-1,2,0,-1]]}
defines the following coding:
| Value | C0 | C1 |
| setosa | 0.50 | -1 | | versicolor | -0.25 | 2 | | virginica | -0.25 | 0 | | [MISSING] | 0.00 | -1 | - other: A user-specified coding scheme. Uses an array of vectors like in contrast, but only length is checked. For coding vectors, the coefficients should be listed in the same order in which the corresponding values appear in the field summary like [[1, 2, 3, 4, 5, 6, 7, 8], [-2 , 0, -2, 0, 2, 0, 2, 0]].
If multiple coding schemes are listed for a single field, then the coding closest to the end of the list is used. Codings given for non-categorical variables are ignored.
If compute_stats is set to true, then all categorical fields without specified codings will be assigned dummy coding. The dummy class will be the first by alphabetical order. This is because the default one-hot encoding produces collinearity effects which result in an ill-formed covariance matrix.
You can also use curl to customize a new logistic regression. For example, to create a new logistic regression named "my logistic regression", with only certain rows, and with only three fields:
curl "https://bigml.io/logisticregression?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"input_fields": ["000001", "000003"],
"name": "my logistic regression",
"range": [25, 125]}'
> Creating a customized logistic regression
If you do not specify a name, BigML.io will assign to the new logistic regression the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset, and if you do not specify an objective field, BigML.io will use the last field in your dataset.
Retrieving a Logistic Regression
Each logistic regression has a unique identifier in the form "logisticregression/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the logistic regression.
To retrieve a logistic regression with curl:
curl "https://bigml.io/logisticregression/55efc3564e1727d635000004?$BIGML_AUTH"
$ Retrieving a logistic regression from the command line
You can also use your browser to visualize the logistic regression using the full BigML.io URL or pasting the logisticregression/id into the BigML.com dashboard.
Logistic Regression Properties
Once a logistic regression has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the logistic regression and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the logistic regression creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields in the logistic regression. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the logistic regression was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this logistic regression. |
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a prediction with your logistic regression if you made it public. |
dataset
filterable, sortable |
String | The dataset/id that was used to build the logistic regression. |
dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the logistic regression. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
description
updatable |
String | A text describing the logistic regression. It can contain restricted markdown to decorate the text. |
excluded_fields | Array | The list of fields's ids that were excluded to build the logistic regression. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
input_fields | Array | The list of input fields' ids used to build the logistic regression. |
locale | String | The dataset's locale. |
logistic_regression | Object | All the information that you need to recreate or use the logistic regression on your own. It includes a list of coefficients and the field's dictionary describing the fields and their summaries. See here for more details. |
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the logistic regression. |
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the logistic regression. |
name
filterable, sortable, updatable |
String | The name of the logistic regression as your provided or based on the name of the dataset by default. |
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this logistic regression. |
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this logistic regression. |
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this logistic regression. |
objective_field | String | The id of the field that the logistic regression predicts. |
objective_fields | Array | Specifies the list of ids of the field that the logistic regression predicts. Even if this an array BigML.io only accepts one objective field in the current version. |
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the logistic regression instead of the sampled instances. |
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your logistic regression. |
private
filterable, sortable, updatable |
Boolean | Whether the logistic regression is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
range | Array | The range of instances used to build the logistic regression. |
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the logistic regression were selected using replacement or not. |
resource | String | The logisticregression/id. |
rows
filterable, sortable |
Integer | The total number of instances used to build the logistic regression |
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the logistic regression. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
shared
filterable, sortable, updatable |
Boolean | Whether the logistic regression is shared using a private link or not. |
shared_hash | String | The hash that gives access to this logistic regression if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this logistic regression. |
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this logistic regression. |
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
status | Object | A description of the status of the logistic regression. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the logistic regression was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the logistic regression was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
white_box
filterable, sortable |
Boolean | Whether the logistic regression is publicly shared as a white-box. |
A Logistic Regression Object has the following properties:
Property | Type | Description |
---|---|---|
balance_fields | Boolean | Whether to scale each numeric field such that its values are zero mean with a standard deviation of 1, based on the field summary statistics at training time. |
bias | Boolean | Whether to include the bias term from the solution. |
c | Float | The inverse of the regularization strength. |
coefficients | Array of Arrays | Coefficients of the logistic regression for each category in the objective field. |
compute_stats | Boolean | Whether to compute statistics and significance tests. |
eps | Float | Stopping criteria for solver. If the difference between the results from the current and last iterations is less than eps, then the solver is finished. |
field_codings | List | Coding schemes for categorical fields. See the Coding Categorical Fields for more details. |
fields
updatable |
Object | A dictionary with an entry per field in the dataset used to build the logistic regression. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
missing_class_in_coefficients | Boolean | Whether there is a missing class in the coefficients of the logistic regression. |
missing_numerics | Boolean | Whether to create an additional binary predictor each numeric field which denotes a missing value. If false, these predictors are not created, and rows containing missing numeric values are dropped. |
normalize | Boolean | Whether to normalize feature vectors in training and predicting. |
regularization | String | Either l1 or l2 at the moment. It selects the norm to minimize when regularizing the solution. Regularizing with respect to the l1 norm causes more coefficients to be zero, and using the l2 norm forces the magnitudes of all coefficients towards zero. |
stats | Object | Statistical tests to assess the quality of the model's fit to the data. See this Section for more details. |
stats_sample_seed | String | Random seed value used for stats sampling. |
stats_sample_size | Integer | The number of rows sampled for calculating statistical tests. |
Coefficients Structure
The coefficients output field is an array of pairs, one pair per class. The first element in the pair is a class value, and the second element is a nested array of coefficients for the logistic model that gives the probability of that class. Each inner array within the nested array contains the group of coefficients that pertain to a single input field. The groups are listed in the same order as in input_fields, with a final singleton array corresponding to the bias term. The class-coefficient pairs are listed in the same order as the class values in the objective field summary. If the model was trained with missing values in the objective field, then a vector of coefficients will also be created for the missing class value, labeled with "", and listed last.
- Numeric fields correspond to two coefficients. The first predictor is the numeric value, and the second predictor is a binary value corresponding to missing values. For example, a numeric field value of 5 maps to a value of 5 in the first predictor, and 0 in the second, while a missing value maps to 0 in the first predictor, and 1 in the second. If the missing_numerics parameter is false, then only a single predictor will be generated for numeric fields.
- Categorical fields correspond to n+1 coefficients, where the first n coefficients correspond to class values, and the final coefficient corresponds to a binary missing value predictor.
- Text and items fields correspond to m+1 coefficients, where the first m coefficients correspond to each term in the field's tag cloud, listed in the same order as in the field summary. The final term corresponds to an empty string or itemset, or in the case of text fields, a string which does not contain any terms in the text analysis vocabulary.
- The final coefficient in the list corresponds to the bias term.
Significance Tests
If the compute_stats parameter is true, then the logistic regression output contains a number of statistical tests to assess the quality of the model's fit to the data. These are found under a field named stats. For each set of coefficients, the following statistics are computed:
- likelihood_ratio: the difference in log likelihood between the fitted model and an intercept-only model. Given as a pair [p-value, ratio]. This statistic tests whether the coefficients as a whole have any predictive power over an intercept-only model.
- standard_errors: the variance of the coefficients estimates.
- z_scores: those values in terms of number of standard deviations.
- p_values: from a 1-DOF Chi Squared test of z^2 (Wald test).
- confidence_intervals: the size of the 95% confidence interval for each coefficient estimate. That is, for a coefficient estimate x, and an interval value n, the value of the coefficient is x ± n with a confidence of 95%.
standard_errors, z_scores, p_values, confidence_intervals: These statistics test the significance of individual coefficient estimates, and are grouped in the same nested array fashion as the coefficients themselves.
To avoid lengthy computation times, statistics from large input datasets will be computed from a sub-sample of the dataset such that the number of coefficients * rows is less than or equal to 1E+8.
It is possible for null to appear among the values contained in stats. Wald test statistics cannot be computed for zero-value coefficients, and so their corresponding entries are null. Moreover, if the coefficients' information matrix is ill-conditioned, e.g. if there are fewer instances of the positive class than the number of coefficients, then it is impossible to perform the Wald test on the entire set of coefficients. In this case standard_errors, z_scores, p_values, and confidence_intervals will have a value of null.
Logistic Regression Status
Creating a logistic regression is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The logistic regression goes through a number of states until its fully completed. Through the status field in the logistic regression you can determine when the logistic regression has been fully processed and ready to be used to create predictions. These are the properties that a logistic regression's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the logistic regression creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the logistic regression. |
message | String | A human readable message explaining the status. |
progress | Float, between 0 and 1 | How far BigML.io has progressed building the logistic regression. |
Once a logistic regression has been successfully created, it will look like:
{
"category":0,
"code":200,
"columns":5,
"created":"2015-09-28T06:03:17.128000",
"credits":0.01815032958984375,
"credits_per_prediction":0,
"dataset":"dataset/554e8fcf545e5f1474000010",
"dataset_field_types":{
"categorical":1,
"datetime":0,
"numeric":4,
"preferred":5,
"text":0,
"total":5
},
"dataset_status":true,
"description":"",
"excluded_fields":[],
"fields_meta":{
"count":5,
"limit":1000,
"offset":0,
"query_total":5,
"total":5
},
"input_fields":[
"000000",
"000001",
"000002",
"000003"
],
"locale":"en-US",
"logistic_regression":{
"balance_fields":false,
"bias":true,
"c":1,
"coefficients":[
[
"Iris-virginica",
[
-1.7725500512039691,
-2.0714411671485604,
1.9765289540667237,
1.2116274344668618,
-0.0006009280238702336
]
],
[
"Iris-setosa",
[
0.4234123201880313,
2.446210746782367,
-4.558271526802624,
-2.0557583244325253,
0.0004628272837137537
]
],
[
"Iris-versicolor",
[
-1.1362239209763645,
-1.658799944046014,
0.9215245039579112,
0.30082088849717076,
-0.00028554600426412813
]
]
],
"eps":0.345,
"missing_numerics":true,
"normalize":true,
"regularization":"l2"
},
"max_columns":5,
"max_rows":150,
"name":"iris' dataset's logistic regression",
"number_of_batchpredictions":0,
"number_of_evaluations":0,
"number_of_predictions":0,
"objective_field":"000004",
"objective_field_name":"species",
"objective_field_type":"categorical",
"objective_fields":[
"000004"
],
"out_of_bag":false,
"private":true,
"project":"project/54dc6d05545e5f822c00043f",
"range":[
1,
150
],
"replacement":false,
"resource":"logisticregression/55efc3564e1727d635000004",
"rows":150,
"sample_rate":1,
"shared":false,
"size":4758,
"source":"source/554e8fac545e5f1474000004",
"source_status":true,
"status":{
"code":5,
"elapsed":21,
"message":"The logistic regression has been created",
"progress":1
},
"subscription":false,
"tags":[
"species"
],
"updated":"2015-09-28T06:03:20.546000",
"white_box":false
}
< Example logistic regression JSON response
Filtering and Paginating Fields from a Logistic Regression
A logistic regression might be composed of hundreds or even thousands of fields. Thus when retrieving a logisticregression, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Logistic Regression
To update a logistic regression, you need to PUT an object containing the fields that you want to update to the logistic regression' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated logistic regression.
For example, to update a logistic regression with a new name you can use curl like this:
curl "https://bigml.io/logisticregression/55efc3564e1727d635000004?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a logistic regression's name
If you want to update a logistic regression with a new label and description for a specific field you can use curl like this:
curl "https://bigml.io/logisticregression/55efc3564e1727d635000004?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000000": {"label": "a longer name", "description": "an even longer description"}}}' \
-H 'content-type: application/json'
$ Updating a logistic regression's field, label, and description
Deleting a Logistic Regression
To delete a logistic regression, you need to issue a HTTP DELETE request to the logisticregression/id to be deleted.
Using curl you can do something like this to delete a logistic regression:
curl -X DELETE "https://bigml.io/logisticregression/55efc3564e1727d635000004?$BIGML_AUTH"
$ Deleting a logistic regression from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a logistic regression, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a logistic regression a second time, or a logistic regression that does not exist, you will receive a "404 not found" response.
However, if you try to delete a logistic regression that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Logistic Regressions
To list all the logistic regressions, you can use the logisticregression base URL. By default, only the 20 most recent logistic regressions will be returned. You can see below how to change this number using the limit parameter.
You can get your list of logistic regressions directly in your browser using your own username and API key with the following links.
https://bigml.io/logisticregression?$BIGML_AUTH
> Listing logistic regressions from a browser
Weights
BigML.io has two ways in which you can use weights to deal with imbalanced datasets:
- Objective Weights: submitting a specific weight for each class in classification models.
- Automatic Balancing: setting the balance argument to true to let BigML automatically balance all the classes evenly.
Let's see each method in more detail.
Objective Weights
A set of objective_weights may be defined, one per objective class. Each instance will be weighted according to its class weight.
curl "https://bigml.io/logisticregression?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"excluded_fields": ["000009"],
"objective_weights": [["yes", 10], ["no", 1]]
}'
> Using a weight field to create a new logistic regression
If a class is not listed in the objective_weights, it is assumed to have a weight of 1. This means the example below is equivalent to the example above.
curl "https://bigml.io/logisticregression?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"excluded_fields": ["000009"],
"objective_weights": [["yes", 10]]
}'
> Using a weight field to create a new logistic regression
Weights of zero are valid as long as there are some positive valued weights. If every weight does end up zero (this is possible with sampled datasets), then the logistic regression creation will fail.
Automatic Balancing
We also provide a convenience shortcut for specifying weights for a classification objective which are proportional to their category counts, by means of the balance_objective flag.
For instance, if the category counts of the objective field are, say:
[["Iris-versicolor", 20], ["Iris-virginica", 10], ["Iris-setosa", 5]]
Category counts
the request:
curl "https://bigml.io/logisticregression?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bb141a3c19200bdf00000c",
"balance_objective": true
}'
> Using balance_objective to create a new logistic regression
would be equivalent to:
curl "https://bigml.io/logisticregression?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bb141a3c19200bdf00000c",
"objective_weights": [
["Iris-versicolor", 1],
["Iris-virginica", 2],
["Iris-setosa", 4]]}'
> Using objective_weights to create a new logistic regression
The next table summarizes all the available arguments to use weights with logistic regressions.
{
"category":0,
"code":200,
"columns":5,
"created":"2015-09-28T06:03:17.128000",
"credits":0.01815032958984375,
"credits_per_prediction":0,
"dataset":"dataset/554e8fcf545e5f1474000010",
"dataset_field_types":{
"categorical":1,
"datetime":0,
"numeric":4,
"preferred":5,
"text":0,
"total":5
},
"dataset_status":true,
"description":"",
"excluded_fields":[],
"fields_meta":{
"count":5,
"limit":1000,
"offset":0,
"query_total":5,
"total":5
},
"input_fields":[
"000000",
"000001",
"000002",
"000003"
],
"locale":"en-US",
"logistic_regression":{
"balance_fields":false,
"bias":true,
"c":1,
"coefficients":[
[
"Iris-virginica",
[
-1.7725500512039691,
-2.0714411671485604,
1.9765289540667237,
1.2116274344668618,
-0.0006009280238702336
]
],
[
"Iris-setosa",
[
0.4234123201880313,
2.446210746782367,
-4.558271526802624,
-2.0557583244325253,
0.0004628272837137537
]
],
[
"Iris-versicolor",
[
-1.1362239209763645,
-1.658799944046014,
0.9215245039579112,
0.30082088849717076,
-0.00028554600426412813
]
]
],
"eps":0.345,
"missing_numerics":true,
"normalize":true,
"regularization":"l2"
},
"max_columns":5,
"max_rows":150,
"name":"iris' dataset's logistic regression",
"number_of_batchpredictions":0,
"number_of_evaluations":0,
"number_of_predictions":0,
"objective_field":"000004",
"objective_field_name":"species",
"objective_field_type":"categorical",
"objective_fields":[
"000004"
],
"objective_weights":
[
"Iris-versicolor",
2
],
[
"Iris-virginica",
1
],
[
"Iris-setosa",
1
]
],
"out_of_bag":false,
"private":true,
"project":"project/54dc6d05545e5f822c00043f",
"range":[
1,
150
],
"replacement":false,
"resource":"logisticregression/55efc3564e1727d635000004",
"rows":150,
"sample_rate":1,
"shared":false,
"size":4758,
"source":"source/554e8fac545e5f1474000004",
"source_status":true,
"status":{
"code":5,
"elapsed":21,
"message":"The logistic regression has been created",
"progress":1
},
"subscription":false,
"tags":[
"species"
],
"updated":"2015-09-28T06:03:20.546000",
"white_box":false
}
< Example weighted logistic regression JSON response
Clusters
Last Updated: Tuesday, 2018-03-13 12:20
A cluster is a set of groups (i.e., clusters) of instances of a dataset that have been automatically classified together according to a distance measure computed using the fields of the dataset. Clusters can handle numeric, categorical, text and items fields as inputs:
- Numeric fields: the Eucledian distance is computed between the instances numeric values.
- Categorical fields: a common way to handle categorical data is to take each category as a new field and assign 0 or 1 depending on the category. So a field with 20 categories will become 20 separate binary fields. BigML uses a technique called k-prototypes which modifies the distance function to operate as though the categories were transformed to binary values.
- Text and item fields: each instance is assigned a vector of terms and then cosine similarity is computed to determine closeness between instances.
To create a cluster, you can select an arbitrary number of clusters (i.e., k) and also select an arbitrary subset of fields from your dataset as input_fields. You can use scales to select how each field influences the distance measure used to group instances together.

BigML.io allows you to create, retrieve, update, delete your cluster. You can also list all of your clusters.
Jump to:
- Cluster Base URL
- Creating a Cluster
- Cluster Arguments
- Retrieving a Cluster
- Cluster Properties
- Create a Dataset Using a Cluster and a Centroid
- Create a Model Using a Cluster and a Centroid
- Filtering and Paginating Fields from a Cluster
- Updating a Cluster
- Deleting a Cluster
- Listing Clusters
- Sampling Your Dataset
Cluster Base URL
You can use the following base URL to create, retrieve, update, and delete clusters. https://bigml.io/cluster
Cluster base URL
All requests to manage your clusters must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Cluster
To create a new cluster, you need to POST to the cluster base URL an object containing at least the dataset/id that you want to use to create the cluster. The content-type must always be "application/json".
POST /cluster?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating cluster definition
curl "https://bigml.io/cluster?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/537639383c19207026000004"}'
> Creating a cluster
BigML.io will return the newly created cluster if the request succeeded.
{
"balance_fields": true,
"category": 0,
"cluster_datasets": {},
"cluster_models": {},
"cluster_seed": null,
"code": 201,
"columns": 0,
"created": "2014-05-17T15:54:02.419411",
"credits": 0.017578125,
"credits_per_prediction": 0.0,
"dataset": "dataset/537639383c19207026000004",
"dataset_field_types": {
"categorical": 1,
"datetime": 0,
"numeric": 4,
"preferred": 5,
"text": 0,
"total": 5
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [],
"field_scales": null,
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"input_fields": [],
"k": 8,
"locale": "en-US",
"max_columns": 5,
"max_rows": 150,
"model_clusters": false,
"name": "Iris' dataset cluster",
"number_of_batchcentroids": 0,
"number_of_centroids": 0,
"number_of_public_centroids": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"range": [
1,
150
],
"replacement": false,
"resource": "cluster/5377861a3c1920126000001c",
"rows": 150,
"sample_rate": 1.0,
"scales": {},
"shared": false,
"size": 4608,
"source": "source/5341a53c3c19206725000000",
"source_status": true,
"status": {
"code": 1,
"message": "The cluster is being processed and will be created soon"
},
"subscription": false,
"summary_fields": [ ],
"tags": [],
"updated": "2014-05-17T15:54:02.419546",
"white_box": false
}
< Example cluster JSON response
Cluster Arguments
In addition to the dataset, you can also POST the following arguments.
Argument | Type | Description |
---|---|---|
balance_fields
optional |
Boolean, default is true. |
When this parameter is enabled, all the numeric fields will be scaled so that their standard deviations are 1. This makes each field have roughly equivalent influence.
Example: true |
category
optional |
Integer, default is the category of the dataset |
The category that best describes the cluster. See the category codes for the complete list of categories.
Example: 1 |
cluster_seed
optional |
String |
A string to generate deterministic clusters.
Example: "My Seed" |
critical_value
optional |
Integer, default is 5 |
The clustering algorithm G-means is parameter free except for one, the critical_value parameter. G-means iteratively takes existing clusters and tests whether the cluster's neighborhood appears Gaussian. If it doesn't the cluster is split into two. The critical_value sets how strict the test is when deciding whether data looks Gaussian. The default is to 5, which seems to work well in most cases. A range of 1 - 10 is acceptable. A critical_value of 1 means data must look very Gaussian to pass the test, and can lead to more clusters being detected. Higher critical_value will tend to find fewer clusters.
Example: 3 |
dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
description
optional |
String |
A description of the cluster up to 8192 characters long.
Example: "This is a description of my new cluster" |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the cluster.
Example:
|
field_scales
optional |
Object, default is {}, an empty dictionary. That is, no special scaling is used. |
With this argument you can pick your own scaling for each field. If a field isn't included in field_scales, BigML will treat the scale as 1 (no scale change). If both balance_fields and field_scales are present, then balance_fields will be applied first. This will make it easy for you do things like balancing age and salary, but then request that age be twice as important.
Example:
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the cluster with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the clusters.
Example:
|
k
optional |
Integer, default is null to use g-means cluster |
The number of clusters. Must be null or a number greater than or equal to 1 and less than or equal to 300.
Example: 3 |
model_clusters
optional |
Boolean, default is false |
Whether a model for every cluster will be generated or not. Each model predicts whether or not an instance is part of its respective cluster.
Example: true |
name
optional |
String, default is dataset's name |
The name you want to give to the new cluster.
Example: "my new cluster" |
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
project |