- API
- Quick Start
- Overview
- Authentication
- Organizations
- Requests
- Responses
- Status Codes
- Category Codes
- RESOURCES
- Projects
- External Connectors
- Sources
- Datasets
- Samples
- Correlations
- Statistical Tests
- Configurations
- Composites
- Supervised
- Models
- Ensembles
- Linear Regressions
- Logistic Regressions
- Deepnets
- Time Series
- Fusions
- Evaluations
- OptiMLs
- Unsupervised
- Clusters
- Anomaly Detectors
- Associations
- Topic Models
- PCA
- Predictions
- Predictions
- Batch Predictions
- Forecasts
- Centroids
- Batch Centroids
- Anomaly Scores
- Batch Scores
- Association Sets
- Topic Distributions
- Batch Distributions
- Projections
- Batch Projections
- WhizzML
- Libraries
- Scripts
- Executions
BigML.io—The BigML API
Documentation
Quick Start
Last Updated: Thursday, 2020-10-08 20:05
This page helps you quickly create your first source, dataset, model, and prediction.
To get started with BigML.io you need:- Your username and your API key.
- A terminal with curl or any other command-line tool that implements standard HTTPS methods.
-
Some sample data. You can use:
- A csv file with some data. You can download the "Iris dataset" or "Diabetes dataset" from our servers.
- Even easier, you can just use a URL that points to your data. For example, you can use https://static.bigml.com/csv/iris.csv or https://static.bigml.com/csv/diabetes.csv.
- Even even easier, you can just send some inline test data.
Jump to:
- Getting a Toy Data File
- Authentication
- Creating a Source
- Creating a Remote Source
- Creating an Inline Source
- Creating a Dataset
- Creating a Model
- Creating a Prediction
Getting a Toy Data File
If you do not have any dataset handy, you can download Fisher’s Iris dataset using the curl command below or by just clicking on the link.
curl -o iris.csv https://static.bigml.com/csv/iris.csv
$ Getting iris.csv
Authentication
The following snippet will help you set up an environment variable (i.e., BIGML_AUTH) to store your username and API key and avoid typing them again in the rest of examples. See this section for more details.
Note: Use your own username and API Key.
export BIGML_USERNAME=alfred
export BIGML_API_KEY=79138a622755a2383660347f895444b1eb927730
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY"
$ Setting Alfred's Authentication Parameters
Creating a Source
To create a new source, POST the file containing your data to the source base URL.
curl "https://bigml.io/andromeda/source?$BIGML_AUTH" -F file=@iris.csv
> Creating a source
To create more sources simply repeat the curl command above using another file. Make sure to use the full path if the file is not in your current directory.
Creating a Remote Source
You can also create a source using a valid URL that points to your data or some public data. For example:
curl "https://bigml.io/andromeda/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"remote": "https://static.bigml.com/csv/iris.csv"}'
> Creating a remote source
Creating an Inline Source
You can also create a source using some inline data. For example:
curl "https://bigml.io/andromeda/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"data": "a,b,c,d\n1,2,3,4\n5,6,7,8"}'
> Creating an inline source
{
"code": 201,
"content_type": "application/octet-stream",
"created": "2012-03-01T05:29:07.217968",
"credits": 0.0087890625,
"file_name": "iris.csv",
"md5": "d1175c032e1042bec7f974c91e4a65ae",
"name": "iris.csv",
"number_of_datasets": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"resource": "source/4f52824203ce893c0a000053",
"size": 4608,
"source_parser": {
"header": true,
"locale": "en-US",
"missing_tokens": [
"N/A",
"n/a",
"NA",
"na",
"-",
"?"
],
"quote": "\"",
"separator": ",",
"trim": true
},
"status": {
"code": 2,
"elapsed": 0,
"message": "The source creation has been started"
},
"type": 0,
"updated": "2012-03-01T05:29:07.217990"
}
< Example source JSON response
Creating a Dataset
To create a dataset, POST the source/id from the previous step to the dataset base URL as follows.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source": "source/4f52824203ce893c0a000053"}'
> Creating a dataset
{
"code": 201,
"columns": 5,
"created": "2012-03-04T02:58:11.910363",
"credits": 0.0087890625,
"fields": {
"000000": {
"column_number": 0,
"name": "sepal length",
"optype": "numeric"
},
"000001": {
"column_number": 1,
"name": "sepal width",
"optype": "numeric"
},
"000002": {
"column_number": 2,
"name": "petal length",
"optype": "numeric"
},
"000003": {
"column_number": 3,
"name": "petal width",
"optype": "numeric"
},
"000004": {
"column_number": 4,
"name": "species",
"optype": "categorical"
}
},
"name": "iris' dataset",
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"resource": "dataset/4f52da4303ce896fe3000000",
"rows": 0,
"size": 4608,
"source": "source/4f52824203ce893c0a000053",
"source_parser": {
"header": true,
"locale": "en-US",
"missing_tokens": [
"N/A",
"n/a",
"NA",
"na",
"-",
"?"
],
"quote": "\"",
"separator": ",",
"trim": true
},
"source_status": true,
"status": {
"code": 1,
"message": "The dataset is being processed and will be created soon"
},
"updated": "2012-03-04T02:58:11.910387"
}
< Dataset
Creating a Model
To create a model, POST the dataset/id from the previous step to the model base URL. By default BigML.io will include all fields as predictors and will treat the last non-text field as the objective. In the Models Section you will learn how to customize the input fields or the objective field.
curl "https://bigml.io/andromeda/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f52da4303ce896fe3000000"}'
> Creating a model
{
"code": 201,
"columns": 5,
"created": "2012-03-04T03:46:53.033372",
"credits": 0.03515625,
"dataset": "dataset/4f52da4303ce896fe3000000",
"dataset_status": true,
"holdout": 0.0,
"input_fields": [],
"max_columns": 5,
"max_rows": 150,
"name": "iris' dataset model",
"number_of_predictions": 0,
"objective_fields": [],
"private": true,
"range": [
1,
150
],
"resource": "model/4f52e5ad03ce898798000000",
"rows": 150,
"size": 4608,
"source": "source/4f52824203ce893c0a000053",
"source_status": true,
"status": {
"code": 1,
"message": "The model is being processed and will be created soon"
},
"updated": "2012-03-04T03:46:53.033396"
}
< Model
Creating a Prediction
To create a prediction, POST the model/id and some input data to the prediction base URL.
curl "https://bigml.io/andromeda/prediction?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"model": "model/4f52e5ad03ce898798000000", "input_data": {"000000": 5, "000001": 3}}'
> Creating a prediction
{
"code": 201,
"created": "2012-03-04T04:11:10.433996",
"credits": 0.01,
"dataset": "dataset/4f52da4303ce896fe3000000",
"dataset_status": true,
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric"
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric"
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric"
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical"
}
},
"input_data": {
"000000": 5,
"000001": 3
},
"model": "model/4f52e5ad03ce898798000000",
"model_status": true,
"name": "Prediction for species",
"objective_fields": [
"000004"
],
"prediction": {
"000004": "Iris-virginica"
},
"prediction_path": {
"bad_fields": [],
"next_predicates": [
{
"count": 100,
"field": "000002",
"operator": ">",
"value": 2.45
},
{
"count": 50,
"field": "000002",
"operator": "<=",
"value": 2.45
}
],
"path": [],
"unknown_fields": []
},
"private": true,
"resource": "prediction/4f52eb5e03ce898798000009",
"source": "source/4f52824203ce893c0a000053",
"source_status": true,
"status": {
"code": 5,
"message": "The prediction has been created"
},
"updated": "2012-03-04T04:11:10.434030"
}
< Prediction
Overview
Last Updated: Thursday, 2020-10-08 20:05
This page provides an introduction to BigML.io—The BigML API. A quick start guide for the impatient is here.
BigML.io is a Machine Learning REST API to easily build, run, and bring predictive models to your project. You can use BigML.io for basic supervised and unsupervised machine learning tasks and also to create sophisticated machine learning pipelines.
BigML.io is a REST-style API for creating and managing BigML resources programmatically. That is to say, using BigML.io you can create, retrieve, update and delete BigML resources using standard HTTP methods.
BigML.io gives you:
- Secure programmatic access to all your BigML resources.
- Fully white-box access to your datasets, models, clusters and anomaly detectors.
- Asynchronous creation of resources.
- Near real-time predictions.
Jump to:
- BigML Resources
- REST API
- HTTPS
- Base URL
- Version
- Summary of HTTP Methods
- Resource ID
- Libraries
- Limits
BigML Resources
BigML.io gives you access to the following resources: project, externalconnector, source, dataset, sample, correlation, statisticaltest, configuration, and composite.
The four original BigML resources are: source, dataset, model, and prediction.
As shown in the picture below, the most basic flow consists of using some local (or remote) training data to create a source, then using the source to create a dataset, later using the dataset to create a model, and, finally, using the model and new input data to create a prediction.

The training data is usually in tabular format. Each row in the data represents an instance (or example) and each column a field (or attribute). These fields are also known as predictors or covariates.
When the machine learning task to learn from training data is supervised one of the columns (usually the last column) represents a special attribute known as objective field (or target) that assigns a label (or class) to each instance. The training data in this format is named labeled and the machine learning task to learn from is named supervised learning.
Once a source is created, it can be used to create multiple datasets. Likewise, a dataset can be used to create multiple models and a model can be used to create multiple predictions.
A model can be either a classification or a regression model depending on whether the objective field is respectively categorical or numeric.
Often an ensemble (or collection of models) can perform better than just a single model. Thus, a dataset can also be used to create an ensemble instead of a single model.
A dataset can also be used to create a cluster or an anomaly detector. Clusters and Anomaly Detectors are both built using unsupervised learning and therefore an objective field is not needed. In these cases, the training data is named unlabeled.
A centroid is to a cluster what a prediction is to a model. Likewise, an anomaly score is to an anomaly detector what a prediction is to a model.
There are scenarios where generating predictions for a relative big collection of input data is very convenient. For these scenarios, BigML.io offers batch resources such as: batchprediction, batchcentroid, and batchanomalyscore. These resources take a dataset and respectively a model (or ensemble), a cluster, or an anomaly detector to create a new dataset that contains a new column with the corresponding prediction, centroid or anomaly score computed for each instance in the dataset.
When dealing with multiple projects, it's better to keep the resources that belong to each project separated. Thus, BigML also has a resource named project that helps you group together all the other resources. As you will see, you just need to assign a source to a pre-existing project and all the subsequent resources will be created in that project.Note: In the snippets below you should substitute Alfred's username and API key for your own username and API Key.
REST API
BigML.io conforms to the design principles of Representational State Transfer (REST). BigML.io is entirely HTTPS-based.
You can create, read, update, and delete resources using the respective standard HTTP methods: POST, GET, PUT and DELETE.
All communication with BigML.io is JSON formatted except for source creation. Source creation is handled with a HTTP PUT using the "multipart/form-data" content-type.
HTTPS
All access to BigML.io must be performed over HTTPS. In this way communication between your application and BigML.io is encrypted and the integrity of traffic between both is verified.
Base URL
All BigML.io HTTP commands use the following base URL:
https://bigml.io/andromeda
Base URL
Version
The BigML.io API is versioned using code names instead of version numbers. The current version name is "andromeda" so URLs for this version can be written to require this version as follows: https://bigml.io/andromeda/andromeda/
Version
Specifying the version name is optional. If you omit the version name in your API requests, you will always get access to the latest API version. While we will do our best to make future API versions backward compatible it is possible that a future API release could cause your application to fail.
Specifying the API version in your HTTP calls will ensure that your application continues to function for the life cycle of the API release.
Summary of HTTP Methods
BigML.io uses the standard POST, GET, PUT, and DELETE HTTP methods to create, retrieve, update and delete resources, respectively.Operation | HTTP method | Semantics |
---|---|---|
CREATE | POST | Creates a new resource. Only certain fields are "postable". This method is not idempotent. Each valid POST request results in a new directly accessible resource. |
RETRIEVE | GET | Retrieves either a specific resource or a list of resources. This methods is idempotent. The content type of the resources is always "application/json; charset=utf-8". |
UPDATE | PUT | Updates partial content of a resource. Only certain fields are "putable". This method is idempotent. |
DELETE | DELETE | Deletes a resource. This method is idempotent. |
Resource ID
All BigML resources are identified by a name composed of two parts separated by a slash "/". The first part is the type of the resource and the second part is a 24-char unique identifier. See the examples below:
source/4f510d2003ce895676000069
dataset/4f510cfc03ce895676000040
model/4f51473203ce89b7ef000005
ensemble/523e9017035d0772e600b285
prediction/4f51473b03ce89b7ef000008
evaluation/50a30a453c19200bd1000839
Example of resources ids
Libraries
We have developed light-weight API bindings for Python, Node.js, and Java.
A number of libraries for many other languages have been developed by the growing BigML community: C#, Ruby, PHP , and iOS. If you are interested in library support for a particular language let us know. Or if you are motivated to develop a library, we will give you all the support that we can.
Limits
BigML.io is currently limited to 1,000,000 (one million) requests per API key per hour. Please email us if you have a specific use case that requires a higher rate limit.Authentication
Last Updated: Thursday, 2020-10-08 20:05
https://bigml.io/andromeda/source?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730
Example URL to list your sources
Your BigML API Key is a unique identifier that is assigned exclusively to your account. You can manage your BigML API Key in your account settings. Remember to keep your API key secret.
To use BigML.io from the command line, we recommend setting your username and API key as environment variables. Using environment variables is also an easy way to keep your credentials out of your source code.
Note: Use your own username and API Key.
export BIGML_USERNAME=alfred
export BIGML_API_KEY=79138a622755a2383660347f895444b1eb927730
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY"
$ Setting Alfred's Authentication Parameters
set BIGML_USERNAME=alfred
set BIGML_API_KEY=79138a622755a2383660347f895444b1eb927730
set BIGML_AUTH=username^=%BIGML_USERNAME%;api_key^=%BIGML_API_KEY%
$ Setting Alfred's Authentication Parameters in Windows
Here is an example of an authenticated API request to list the sources in your account from a command line.
curl "https://bigml.io/andromeda/source?$BIGML_AUTH"
$ Example request to list your sources
Alternative Keys
Alternative Keys allow you to give fine-grained access to your BigML resources.To create an alternative key you need to use BigML's web interface. There you can define what resources an alternative key can access and what operations (i.e., create, list, retrieve, update or delete) are allowed with it. This is useful in scenarios where you want to grant different roles and privileges to different applications. For example, an application for the IT folks that collects data and creates sources in BigML, another that is accessed by data scientists to create and evaluate models, and a third that is used by the marketing folks to create predictions.
You can read more about alternative keys here.
Organizations
Last Updated: Wednesday, 2020-12-09 09:40
An organization is a permission-based grouping of resources that helps you centralize your organization's resources. The permissions can be managed in a company-specific dashboard, and a user can be a member of multiple organizations at the same time. All resources are created under a specific project in the the organization. A project can be configured as private or public, and you can control who has the access to your projects and resources under the projects.
Organization Member Types
There are 4 types of membership for an organization.
- A restricted member can create, retrieve, update, and delete resources in the organization project, and view public or private projects that the user has access to.
-
A member has the restricted member privilege and also
can create public or private projects in the organization.
A public project can be accessed by any users of the organization, and
a private project can be accessed only by those who have permission to the project.
When a project is created or updated, certain organization users can be assigned with the manage, write, or read permission. A user with the admin permission or an organization administrator can update and delete the project. A user with the write permission can create, retrieve, update, and delete resources in the project, and a user with the read permission can only read existing resources in the project. The user who creates the project will automatically have the admin permission until the user is specifically removed from the project or the organization.
For example, let's say a user with a member role John is in the sales department. John has created a private project Sales Reports and added users Amy and Mike to the write permission list. Now John has been transferred to the marketing department and he shouldn't have access to the Sales Reports project anymore. John can delegate Amy or another organization user with the admin permission allowing the user to update or delete the project in the future and remove himself from the list. If John is already removed or unavailable, it can also be done by any administrator.
Any user with the write permission of the project can create, update, and delete resources and move their personal resources to the project. However, once personal resource is moved under a organization project, it cannot be moved back to the personal account.
Last, the users with read permissions can view all resources in the project. However, they cannot update or delete them, or create a new resource. - An administrator has the full access to all projects and resources in the organization, and can manage the users and their membership of the organization.
- The owner has all privileges that an administrator has plus billing, and is the only one who can update and delete the organization.
Each user can have only one role. If a user is assigned with multiple roles, then only the role with the highest privilege will be considered. For example, a user is assigned with the member and restricted member roles, then the user's final role in the organization will be member.
All resources created under the organization have the username and user_id properties filled with the owner's username and id, and a separate property creator which is the username of the user who actually created the resource.
Authentication
In addition to your username and api_key, all access to BigML organization resources requires an additional parameter in the query string to authenticate.
As explained above, an organization resource must be created under a project. In order to create, retrieve, update, and delete an organization resource, you must pass project in the query string. Thus, even if project is defined in the HTTP POST body, it will simply be ignored in favor of the project property in the query string. For HTTP GET requests for retrieving a list of resources, the project property is used as a filter, so the response contains only the resources under the specified project. For HTTP GET requests for retrieving an individual resource, HTTP PUT, or HTTP DELETE requests, the project property will only be used for authentication to the organization. It means the resource doesn't have to exist in the authenticated project. If you have the read permission for the specific resource, you can retrieve the resource even if the resource is not in the project defined in the query string. Likewise, if you have the read-write permission for the resource, you can update or delete the resource. There is one exception though.
For scripts or libraries, if the resource is shared across all projects in the organization, (i.e., public_in_organization: true) then only the creator of the resource or administrators can update or delete the resource. Note that such resources will also be included in responses of the HTTP GET requests for retrieving a list of resources regardless in which project the resources really belong to.
Finally, in order to retrieve organization project resources, you need to pass the organization parameter instead of the project parameter. See the examples below.
https://bigml.io/andromeda/source?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730;project=project/5948be694e17273079000000
Example URL to list your sources in an organization project
https://bigml.io/andromeda/project?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730;organization=organization/5728cce44e1727587a000000
Example URL to list your projects in an organization
Requests
Last Updated: Thursday, 2020-10-08 20:05
BigML.io uses the standard POST, GET, PUT, and DELETE HTTP methods to create, retrieve, update, and delete individual resources, respectively. You can also list all your resources for each resource type.
Jump to:
- Creating a Resource
- Retrieving a Resource
- Updating a Resource
- Deleting a Resource
- Listing Resources
- Paginating Resources
- Filtering Resources
- Ordering Resources
- Webhooks
Creating a Resource
To create a new resource, you need to POST an object to the resource's base URL. The content-type must always be "application/json". The only exception is source creation which requires the "multipart/form-data" content type.
For example, to create a model with a dataset, you can use curl like this:
curl "https://bigml.io/andromeda/model/?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a model
The following is an example of what a request header would look like for the request:
POST /model?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
> Example model create request
BigML.io will return a newly create resource document if the request is succeeded.
A number of required and optional arguments exist for each type of resource. You can see a detailed arguments list for each resource in their respective sections: project, external connector, source, dataset, sample, correlation, statistical test, configuration, and composite.
Retrieving a Resource
To retrieve a resource, you need to issue a HTTP GET request to the resource/id to be retrieved. Each resource has a unique identifier in the form resource/id where resource is a type of the resource such as dataset, model, and etc, and id is a string of 24 alpha-numeric characters that you can use to retrieve the resource or as a parameter to create other resources from the resource.
For example, using curl you can do something like this to retrieve a dataset:
curl "https://bigml.io/andromeda/dataset/54d86680f0a5ea5fc0000011?$BIGML_AUTH"
$ Retrieving a dataset from the command line
The following is an example of what a request header would look like for a dataset GET request:
GET /dataset/54d86680f0a5ea5fc0000011?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
> Example dataset retreive request
Once a resource has been successfully created, it will have properties. A number of properties exist for each type of resource. You can see a detailed property list for each resource in their respective sections: projects, externalconnectors, sources, datasets, samples, correlations, statisticaltests, configurations, and composites.
Updating a Resource
To update a resource, you need to PUT an object containing the fields that you want to update to the resource's base URL. The content-type must always be: "application/json".
If the request succeeds, BigML.io will respond with a 202 accepted code and with the new updated resource in the body of the message.
For example, to update a project with a new name, a new category, a new description, and new tags you can use curl like this:
curl "https://bigml.io/andromeda/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "My new Project",
"category": 3,
"description": "My first BigML Project",
"tags": ["fraud", "detection"]}'
$ Updating a project
The following is an example of what a request header would look like for the request:
PUT /project/54d9553bf0a5ea5fc0000016?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
> Example project update request
Deleting a Resource
To delete a resource, you need to issue a HTTP DELETE request to the resource/id to be deleted.
For example, using curl you can do something like this to delete a dataset:
curl -X DELETE "https://bigml.io/andromeda/dataset/54d86680f0a5ea5fc0000011?$BIGML_AUTH"
$ Deleting a dataset from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return HTTP 204 responses with no body.
HTTP/1.1 204 NO CONTENT
Content-Length: 0
< Successful response
Once you delete a resource, it is permanently deleted. That is, a delete request cannot be undone.
For example, if you try to delete a dataset a second time, or a dataset that does not exist you will receive an error like this:
{
"code": 404,
"status": {
"code": -1201,
"extra": [
"A dataset matching the provided arguments could not be found"
],
"message": "Id does not exist"
}
}
Error trying to delete a dataset that does not exist
The following is an example of what a request header would look like for a dataset DELETE request:
DELETE /dataset/54d86680f0a5ea5fc0000011?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
> Example dataset delete request
Listing Resources
To list all the resources, you can use its base URLs. By default, only the 20 most recent resources will be returned. You can see below how to change this number using the limit parameter.
You can get the list of each resource type directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/project?$BIGML_AUTH
https://bigml.io/andromeda/externalconnector?$BIGML_AUTH
https://bigml.io/andromeda/source?$BIGML_AUTH
https://bigml.io/andromeda/dataset?$BIGML_AUTH
https://bigml.io/andromeda/sample?$BIGML_AUTH
https://bigml.io/andromeda/correlation?$BIGML_AUTH
https://bigml.io/andromeda/statisticaltest?$BIGML_AUTH
https://bigml.io/andromeda/configuration?$BIGML_AUTH
https://bigml.io/andromeda/composite?$BIGML_AUTH
> Listing resources from a browser
You can also easily list them from the command line using curl as follows:
curl "https://bigml.io/andromeda/project?$BIGML_AUTH"
curl "https://bigml.io/andromeda/externalconnector?$BIGML_AUTH"
curl "https://bigml.io/andromeda/source?$BIGML_AUTH"
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH"
curl "https://bigml.io/andromeda/sample?$BIGML_AUTH"
curl "https://bigml.io/andromeda/correlation?$BIGML_AUTH"
curl "https://bigml.io/andromeda/statisticaltest?$BIGML_AUTH"
curl "https://bigml.io/andromeda/configuration?$BIGML_AUTH"
curl "https://bigml.io/andromeda/composite?$BIGML_AUTH"
$ Listing resources from the command line
The following is an example of what a request header would look like when you request a list of models:
GET /model?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730
Host: bigml.io
> Example model list request
Property | Type | Description |
---|---|---|
meta | Object | Specifies in which page of the listing you are, how to get to the previous page and next page, and the total number of resources. |
objects | Array of resources | A list of resources filtered and ordered according to the criteria that you supply in your request. See the filtering and ordering options for more details. |
Meta objects have the following properties:
For example, when you list your projects, they will be displayed as below:
{
"meta": {
"limit": 20,
"next": "/?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730&offset=20"
"offset": 0,
"previous": null,
"total_count": 54
},
"objects": [
{
"category": 0,
"code": 200,
"created": "2015-01-27T22:51:57.488000",
"description": "",
"name": "Project 1",
"private": true,
"resource": "project/54c8168df0a5eae58c000019",
...
},
{
"category": 0,
"code": 200,
"created": "2015-01-29T04:08:12.696000",
"description": "",
"name": "Project 2",
"private": true,
"resource": "project/54c9b22cf0a5ea7765000000",
...
},
...
]
}
< Listing of projects template
Paginating Resources
There are two parameters that can help you retrieve just a portion of your resources and paginate them.
If a limit is given, no more than that many resources will be returned but possibly less, if the request itself yields less resources.
For example, if you want to retrieve only the third and forth latest projects:
curl "https://bigml.io/andromeda/project?$BIGML_AUTH;limit=2;offset=2"
$ Paginating projects from the command line
To paginate results, you need to start off with an offset of zero, then increment it by whatever value you use for the limit each time. So if you wanted to return resources 1-10, then 11-20, then 21-30, etc., you would use "limit=10;offset=0", "limit=10;offset=10", and limit=10;offset=20", respectively.
Filtering Resources
The listings of resources can be filtered by any of the fields that we labeled as filterable in the table describing the properties of a resource type. For example, to retrieve all the projects tagged with "fraud":
https://bigml.io/andromeda/project?$BIGML_AUTH;tags__in=fraud
> Filtering projects by tag from a browser
curl "https://bigml.io/andromeda/project?$BIGML_AUTH;tags__in=fraud"
$ Filtering projects by tag from the command line
In addition to exact match, there are more filters that you can use. To add one of these filters to your request you just need to append one of the suffixes in the following table to the name of the property that you want to use as a filter.
Filter | Description |
---|---|
! optional |
Not Example: !size=1048576 (<>1MB) |
__gt optional |
Greater than Example: size__gt=1048576 (>1MB) |
__gte optional |
Greater than or equal to Example: size__gte=1048576 (>=1MB) |
__contains optional |
Case-sensitive word match Example: name__contains=test |
__icontains optional |
Case-insensitive word match Example: name__icontains=test |
__in optional |
Case-sensitive list word match Example: tags__in=fraud,test |
__lt optional |
Less than Example: created__lt=2016-08-20T00:00:00.000000 (before 2016-08-20) |
__lte optional |
Less than or equal to Example: created__lte=2016-08-20T00:00:00.000000 (before or on 2016-08-20) |
Ordering Resources
A list of resources can also be ordered by any of the fields that we labeled as sortable in the table describing the properties of a resource type.
For example, you can list your projects ordered by descending name directly in your browser, using your own username and API key, with the following link.
https://bigml.io/andromeda/project?$BIGML_AUTH;order_by=-name
> Listing projects ordered by name from a browser
You can do the same thing from the command line using curl as follows:
curl "https://bigml.io/andromeda/project?$BIGML_AUTH;order_by=-name"
$ Listing projets ordered by name from the command line
Webhooks
Webhooks allow you to build or set up apps which subscribe to the events triggered when the resource creation is complete or halted with an error. When the finished or error event is triggered, BigML.io can send an HTTP POST payload to the webhook's configured URL.
When you create a resource, you can specify the webhook parameter in the POST payload. For example, to create a model with a dataset, you can use curl like this:
curl "https://bigml.io/andromeda/model/?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"webhook":{"url": "http://myhost/path/to/webhook"}}'
> Creating a model with a webhook
When the resource creation is complete, BigML.io calls the provided URL and send an HTTP POST payload, and expects to receive an HTTP 201 status code. Optionally, you can provide the secret parameter to secure your webhook. The value of secret will be used as the key to generate the HMAC hex digest value of the request body in the X-BigML-Signature header if provided. It uses the sha1 hash function and the value will always have the prefix of sha1=. The headers also contain X-BigML-Delivery, a GUID to identify the delivery, and User-Agent, which is BigML.io. The payload of the POST request is in the JSON format, so be sure to accept Content-Type: application/json.
"webhook":{
"url": "http://myhost/path/to/webhook",
"secret": "mysecret"
}
> Example webhook parameter with secret
The following is an example of a POST request to the webhook server. Note that the headers contain X-BigML-Signature when secret is provided.
POST /path/to/webhook HTTP/1.1
Host: localhost:800
X-BigML-Delivery: dd04ace6-c2c7-4c62-afff-d6514c016ad7
X-BigML-Signature: sha1=b7f0e0b9401f85ab00c8c8c575a5d71006788eec
User-Agent: BigML.io
Content-Type: application/json;charset=utf-8
Content-Length: 162
{
"event": "finished",
"message": "The model has been created",
"resource": "model/5ba2ccc54e172745a0000000",
"timestamp": "2018-09-19 22:25:11 GMT"
}
> Example POST request to the webhook server
The following is an example of the webhook property in the response body of a model resource.
"webhook": {
"delivery": {
"confirmation_id": "dd04ace6-c2c7-4c62-afff-d6514c016ad7",
"method": "queue",
"status": "delivered"
},
"event": "finished",
"secret": "mysecret",
"signature": "sha1=b7f0e0b9401f85ab00c8c8c575a5d71006788eec",
"timestamp": "2018-09-19T22:25:11.536000",
"url": "http://myhost/path/to/webhook"
}
> Example webhook property in a response
Responses
Last Updated: Thursday, 2020-10-08 20:05
HTTP/1.1 201 CREATED
Server: nginx/1.0.5
Date: Sat, 03 Mar 2012 23:28:59 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Location: https://bigml.io/andromeda/dataset/4f5a59b203ce8945c200000a
< Example HTTP response
{
"code": 201,
"columns": 5,
"created": "2012-03-03T23:28:59.404542",
"credits": 0.0087890625,
"fields": {
"000000": {
"column_number": 0,
"name": "sepal length",
"optype": "numeric"
},
"000001": {
"column_number": 1,
"name": "sepal width",
"optype": "numeric"
},
"000002": {
"column_number": 2,
"name": "petal length",
"optype": "numeric"
},
"000003": {
"column_number": 3,
"name": "petal width",
"optype": "numeric"
},
"000004": {
"column_number": 4,
"name": "species",
"optype": "categorical"
}
},
"name": "iris' dataset",
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"resource": "dataset/4f5a59b203ce8945c200000a",
"rows": 0,
"size": 4608,
"source": "source/4f52824203ce893c0a000053",
"source_parser": {
"header": true,
"locale": "en-US",
"missing_tokens": [
"N/A",
"n/a",
"NA",
"na",
"-",
"?"
],
"quote": "\"",
"separator": ",",
"trim": true
},
"source_status": true,
"status": {
"code": 1,
"message": "The dataset is being processed and will be created soon"
},
"updated": "2012-03-03T23:28:59.404561"
}
< Example JSON response
Error Codes
Errors also use conventional HTTP response headers. For example, here is the header for a 404 response:
HTTP/1.1 404 NOT FOUND
Content-Type: application/json; charset=utf-8
Date: Fri, 03 Mar 2012 23:29:18 GMT
Server: nginx/1.1.11
Content-Length: 169
Connection: keep-alive
< Example HTTP error response
{
"code": 404,
"status": {
"code": -1201,
"extra": [
"4f5157f1035d07306600005b"
],
"message": "Id does not exist"
}
}
< Example JSON error response
Status Codes
Last Updated: Thursday, 2020-10-08 20:05
This section lists the different status codes BigML.io sends in responses. First, we list the HTTP status codes, then the codes that define a resource creation status, and finally detailed error codes for every resource.
Jump to:
- HTTP Status Code Summary
- Resource Status Code Summary
- Error Code Summary
- Source Error Code Summary
- Dataset Error Code Summary
- Download Dataset Unsuccessful Requests
- Sample Error Code Summary
- Correlation Error Code Summary
- Statistical Test Error Code Summary
- Model Error Code Summary
- Ensemble Error Code Summary
- Linear Regression Error Code Summary
- Logistic Regression Error Code Summary
- Cluster Error Code Summary
- Anomaly Error Code Summary
- Association Error Code Summary
- Topic Model Error Code Summary
- PCA Error Code Summary
- Time Series Error Code Summary
- Deepnet Error Code Summary
- Composite Error Code Summary
- Fusion Error Code Summary
- OptiML Error Code Summary
- Prediction Error Code Summary
- Centroid Error Code Summary
- Anomaly Score Error Code Summary
- Association Set Error Code Summary
- Topic Distribution Error Code Summary
- Projection Error Code Summary
- Forecast Error Code Summary
- Batch Prediction Error Code Summary
- Batch Centroid Error Code Summary
- Batch Anomaly Score Error Code Summary
- Batch Topic Distribution Error Code Summary
- Batch Projection Error Code Summary
- Evaluation Error Code Summary
- Whizzml Library Error Code Summary
- Whizzml Script Error Code Summary
- Whizzml Execution Error Code Summary
HTTP Status Code Summary
BigML.io returns meaningful HTTP status codes for every request. The same status code is returned in both the HTTP header of the response and in the JSON body.
Code | Status | Semantics |
---|---|---|
200 | OK | Your request was successful and the JSON response should include the resource that you requested. |
201 | Created | A new resource was created. You can get the new resource complete location through the HTTP headers or the resource/id through the resource key of the JSON response. |
202 | Accepted | Received after sending a request to update a resource if it was processed successfully. |
204 | No Content | Received after sending a request to delete a resource if it was processed successfully. |
400 | Bad Request | Your request is malformed, missed a required parameter, or used an invalid value as parameter. |
401 | Unauthorized | Your request used the wrong username or API key. |
402 | Payment Required | Your subscription plan does not allow to perform this action because it has exceeded your subscription limit. Please wait until your running tasks complete or upgrade your plan. |
403 | Forbidden | Your request is trying to access to a resource that you do not own. |
404 | Not Found | The resource that you requested or used as parameter in a request does not exist anymore. |
405 | Not Allowed | Your request is trying to use a HTTP method that is not supported or to change fields of a resource that can not be modified. |
411 | Length required | Your request is trying to PUT or POST without sending any content or specifying its length. |
413 | Request Entity Too Large | The size of the content in your request is greater than what support to PUT or POST. |
415 | Unsupported Media Type | Your request is trying to POST 'multipart/form-data' content but it is actually sending the wrong content-type. |
429 | Too Many Requests | You have sent too many requests in a given amount of time |
500 | Internal Server Error | Your request could not be processed because something went wrong on BigML's end. |
503 | Service Unavailable | BigML.io is undergoing maintenance. |
Resource Status Code Summary
The creation of resources involves a computational task that can last a few seconds or a few days depending on the size of the data. Consequently, some HTTP POST requests to create a resource may launch an asynchronous task and return immediately. In order to know the completion status of this task, each resource has a status field that reports the current state of the request. This status is useful to monitor the progress during their creation. The possible states for a task are:
Code | Status | Semantics | |
---|---|---|---|
0 | Waiting | The resource is waiting for another resource to be finished before BigML.io can start processing it. | |
1 | Queued | The task that is going to create the resource has been accepted but has been queued because there are other tasks using the system. | |
2 | Started | The task to create the resource has been is started and you should expect partial results soon. | |
3 | In Progress | The task has computed the first partial resource but still needs to do more computations. | |
4 | Summarized | This status is specific to datasets. It happens when the dataset has been computed but its data has not been serialized yet. The dataset is final but you cannot use it yet to create a model or if you use it the model will be waiting until the dataset is finished. | |
5 | Finished | The task is completed and the resource is final. | |
-1 | Faulty | The task has failed. We either could not process the task as you requested it or have an internal issue. | |
-2 | Unknown | The task has reached a state that we cannot verify at this time. This a status you should never see unless BigML.io suffers a major outage. |
Error Code Summary
This is the list of possible general error codes you can receive fromBigML.io managing any type of resources.
Error Code | Semantics |
---|---|
-1100 | Unauthorized use |
-1101 | Not enough credits |
-1102 | Wrong resource |
-1104 | Cloned resourced cannot be public |
-1105 | Price cannot be changed |
-1107 | Too many projects |
-1108 | Too many tasks |
-1109 | Subscription required |
-1200 | Missing parameter |
-1201 | Invalid Id |
-1203 | Field Error |
-1204 | Bad Request |
-1205 | Value Error |
-1206 | Validation Error |
-1207 | Unsupported Format |
-1208 | Invalid Sort Error |
Source Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing sources.
Error Code | Semantics |
---|---|
-2000 | This source cannot be read properly |
-2001 | Bad request to create a source |
-2002 | The source could not be created |
-2003 | The source cannot be retrieved |
-2004 | The source cannot be deleted now |
-2005 | Faulty source |
Dataset Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing datasets.
Error Code | Semantics |
---|---|
-3000 | The source is not ready yet |
-3001 | Bad request to create a dataset |
-3002 | The dataset cannot be created |
-30021 | The dataset cannot be created now |
-3003 | The dataset cannot be retrieved |
-3004 | The dataset cannot be deleted now |
-3005 | Faulty dataset |
-3008 | The dataset could not be cloned properly |
Download Dataset Unsuccessful Requests
This is the list of possible specific error codes you can receive from BigML.io managing downloads.
Error Code | Semantics |
---|---|
-9000 | The dataset export is not ready yet |
-9001 | Bad request to perform a dataset export |
-9002 | The dataset export cannot be performed |
-90021 | The dataset export cannot be performed now |
-9003 | The dataset export cannot be retrieved now |
-9004 | The dataset export cannot be deleted now |
-9005 | The dataset export could not be performed |
-9006 | Dataset exports aren't available for cloned datasets |
Sample Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing samples.
Error Code | Semantics |
---|---|
-16000 | The sample is not ready yet |
-16001 | Bad request to create a sample |
-16002 | Your sample cannot be created |
-16021 | Your sample cannot be created now |
-16003 | The sample cannot be retrieved now |
-16004 | Cannot delete sample now |
-16005 | The sample could not be created |
Correlation Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing correlations.
Error Code | Semantics |
---|---|
-18000 | The correlation is not ready yet |
-18001 | Bad request to create a correlation |
-18002 | Your correlation cannot be created |
-18021 | Your correlation cannot be created now |
-18003 | The correlation cannot be retrieved now |
-18004 | Cannot delete correlation now |
-18005 | The correlation could not be created |
Statistical Test Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing statistical tests.
Error Code | Semantics |
---|---|
-17000 | The statistical test is not ready yet |
-17001 | Bad request to create a statistical test |
-17002 | Your statistical test cannot be created |
-17021 | Your statistical test cannot be created now |
-17003 | The statistical test cannot be retrieved now |
-17004 | Cannot delete statistical test now |
-17005 | The statistical test could not be created |
Model Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing models.
Error Code | Semantics |
---|---|
-4000 | The dataset is not ready. A one-click model has been requested but the corresponding dataset is not ready yet |
-4001 | Bad request to create a model |
-4002 | The model cannot be created |
-40021 | The model cannot be created now |
-4003 | The model cannot be retrieved |
-4004 | The model cannot be deleted now |
-4005 | Faulty model |
-4008 | The model could not be cloned properly |
Ensemble Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing ensembles.
Error Code | Semantics |
---|---|
-8001 | Bad request to create an ensemble |
-8002 | The ensemble cannot be created |
-80021 | The ensemble cannot be created now |
-8003 | The ensemble cannot be retrieved now |
-8004 | The ensemble cannot be deleted now |
-8005 | The ensemble could not be created |
-8008 | The ensemble could not be cloned properly |
Logistic Regression Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing logistic regressions.
Error Code | Semantics |
---|---|
-22000 | The logistic regression is not ready yet |
-22001 | Bad request to create a logistic regression |
-22002 | Your logistic regression cannot be created |
-22021 | Your logistic regression cannot be created now |
-22003 | The logistic regression cannot be retrieved now |
-22004 | Cannot delete logistic regression now |
-22005 | The logistic regression could not be created |
Cluster Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing clusters.
Error Code | Semantics |
---|---|
-10000 | The cluster is not ready yet |
-10001 | Bad request to create a cluster |
-10002 | The cluster cannot be created |
-10003 | The cluster cannot be created now |
-10004 | The cluster cannot be retrieved now |
-10005 | The cluster cannot be deleted now |
-10008 | The cluster could not be cloned properly |
Anomaly Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing anomaly detectors.
Error Code | Semantics |
---|---|
-13000 | The anomaly detector is not ready yet |
-13001 | Bad request to create an anomaly detector |
-13002 | The anomaly detector cannot be created |
-13021 | The anomaly detector cannot be created now |
-13003 | The anomaly detector cannot be retrieved now |
-13004 | The anomaly detector cannot be deleted now |
-13005 | The anomaly detector could not be created |
-13008 | The anomaly detector could not be cloned properly |
Association Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing associations.
Error Code | Semantics |
---|---|
-23000 | The association is not ready yet |
-23001 | Bad request to create an association |
-23002 | Your association cannot be created |
-23021 | Your association cannot be created now |
-23003 | The association cannot be retrieved now |
-23004 | Cannot delete association now |
-23005 | The association could not be created |
Topic Model Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing topic models.
Error Code | Semantics |
---|---|
-26000 | The topic model is not ready yet |
-26001 | Bad request to create a topic model |
-26002 | Your topic model cannot be created |
-26021 | Your topic model cannot be created now |
-26003 | The topic model cannot be retrieved now |
-26004 | Cannot delete topic model now |
-26005 | The topic model could not be created |
PCA Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing pca.
Error Code | Semantics |
---|---|
-37000 | The PCA is not ready yet |
-37001 | Bad request to create a PCA |
-37002 | Your PCA cannot be created |
-37021 | Your PCA cannot be created now |
-37003 | The PCA cannot be retrieved now |
-37004 | Cannot delete PCA now |
-37005 | The PCA could not be created |
Time Series Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing time series.
Error Code | Semantics |
---|---|
-30000 | The time series is not ready yet |
-30001 | Bad request to create a time series |
-30002 | Your time series cannot be created |
-30021 | Your time series cannot be created now |
-30003 | The time series cannot be retrieved now |
-30004 | Cannot delete time series now |
-30005 | The time series could not be created |
Deepnet Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing deepnets.
Error Code | Semantics |
---|---|
-33001 | Bad request to create an deepnet |
-33002 | The deepnet cannot be created |
-330021 | The deepnet cannot be created now |
-33003 | The deepnet cannot be retrieved now |
-33004 | The deepnet cannot be deleted now |
-33005 | The deepnet could not be created |
Composite Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing composites.
Error Code | Semantics |
---|---|
-34001 | Bad request to create an composite |
-34002 | The composite cannot be created |
-340021 | The composite cannot be created now |
-34003 | The composite cannot be retrieved now |
-34004 | The composite cannot be deleted now |
-34005 | The composite could not be created |
Fusion Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing fusions.
Error Code | Semantics |
---|---|
-35001 | Bad request to create an fusion |
-35002 | The fusion cannot be created |
-350021 | The fusion cannot be created now |
-35003 | The fusion cannot be retrieved now |
-35004 | The fusion cannot be deleted now |
-35005 | The fusion could not be created |
OptiML Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing optimls.
Error Code | Semantics |
---|---|
-36001 | Bad request to create an optiml |
-36002 | The optiml cannot be created |
-360021 | The optiml cannot be created now |
-36003 | The optiml cannot be retrieved now |
-36004 | The optiml cannot be deleted now |
-36005 | The optiml could not be created |
Prediction Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing predictions.
Error Code | Semantics |
---|---|
-5000 | This model is not ready yet |
-5001 | Bad request to create a prediction |
-5002 | The prediction can not be created |
-5003 | The prediction cannot be retrieved |
-5004 | The prediction cannot be deleted now |
-5005 | The prediction could not be created |
Centroid Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing centroids.
Error Code | Semantics |
---|---|
-11001 | Bad request to create a centroid |
-11002 | Your centroid cannot be created now |
-11003 | The centroid cannot be retrieved now |
-11004 | Cannot delete centroid now |
-11005 | The centroid could not be created |
Anomaly Score Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing anomaly scores.
Error Code | Semantics |
---|---|
-14001 | Bad request to create an anomaly score |
-14002 | Your anomaly score cannot be created now |
-14003 | The anomaly score cannot be retrieved now |
-14004 | Cannot delete anomaly score now |
-14005 | The anomaly score could not be created |
Association Set Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing association set.
Error Code | Semantics |
---|---|
-24001 | Bad request to create an association set |
-24002 | Your association set cannot be created now |
-24003 | The association set cannot be retrieved now |
-24004 | Cannot delete association set now |
-24005 | The association set could not be created |
Topic Distribution Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing topic distributions.
Error Code | Semantics |
---|---|
-27001 | Bad request to create a topic distribution |
-27002 | Your topic distribution cannot be created now |
-27003 | The topic distribution cannot be retrieved now |
-27004 | Cannot delete topic distribution now |
-27005 | The topic distribution could not be created |
Projection Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing projections.
Error Code | Semantics |
---|---|
-38001 | Bad request to create a projection |
-38002 | Your projection cannot be created now |
-38003 | The projection cannot be retrieved now |
-38004 | Cannot delete projection now |
-38005 | The projection could not be created |
Forecast Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing forecasts.
Error Code | Semantics |
---|---|
-31001 | Bad request to create a forecast |
-31002 | Your forecast cannot be created now |
-31003 | The forecast cannot be retrieved now |
-31004 | Cannot delete forecast now |
-31005 | The forecast could not be created |
Batch Prediction Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch predictions.
Error Code | Semantics |
---|---|
-6001 | Bad request to perform a batch prediction |
-6002 | The batch prediction cannot be performed |
-60021 | The batch prediction cannot be performed now |
-6003 | The batch prediction cannot be retrieved now |
-6004 | The batch prediction cannot be deleted now |
-6005 | The batch prediction could not be performed |
Batch Centroid Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch centroids.
Error Code | Semantics |
---|---|
-12001 | Bad request to perform a batch centroid |
-12002 | The batch centroid cannot be performed |
-12021 | The batch centroid cannot be performed now |
-12003 | The batch centroid cannot be retrieved now |
-12004 | The batch centroid cannot be deleted now |
-12005 | The batch centroid could not be performed |
Batch Anomaly Score Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch anomaly scores.
Error Code | Semantics |
---|---|
-15001 | Bad request to perform a batch anomaly score |
-15002 | The batch anomaly score cannot be performed |
-15021 | The batch anomaly score cannot be performed now |
-15003 | The batch anomaly score cannot be retrieved now |
-15004 | The batch anomaly score cannot be deleted now |
-15005 | The batch anomaly score could not be performed |
Batch Topic Distribution Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch topic distributions.
Error Code | Semantics |
---|---|
-28001 | Bad request to perform a batch topic distribution |
-28002 | The batch topic distribution cannot be performed |
-28021 | The batch topic distribution cannot be performed now |
-28003 | The batch topic distribution cannot be retrieved now |
-28004 | The batch topic distribution cannot be deleted now |
-28005 | The batch topic distribution could not be performed |
Batch Projection Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch projections.
Error Code | Semantics |
---|---|
-39001 | Bad request to perform a batch projection |
-39002 | The batch projection cannot be performed |
-39021 | The batch projection cannot be performed now |
-39003 | The batch projection cannot be retrieved now |
-39004 | The batch projection cannot be deleted now |
-39005 | The batch projection could not be performed |
Evaluation Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing evaluations.
Error Code | Semantics |
---|---|
-7001 | Bad request to perform an evaluation |
-7002 | The evaluation cannot be performed |
-70021 | The evaluation cannot be performed now |
-7003 | The evaluation cannot be retrieved now |
-7004 | The evaluation cannot be deleted now |
-7005 | The evaluation could not be performed |
Whizzml Library Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing libraries.
Error Code | Semantics |
---|---|
-19000 | The library is not ready yet |
-19001 | Bad request to create a library |
-19002 | Your library cannot be created |
-19021 | Your library cannot be created now |
-19003 | The library cannot be retrieved now |
-19004 | Cannot delete library now |
-19005 | The library could not be created |
Whizzml Script Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing scripts.
Error Code | Semantics |
---|---|
-20000 | The script is not ready yet |
-20001 | Bad request to create a script |
-20002 | Your script cannot be created |
-20021 | Your script cannot be created now |
-20003 | The script cannot be retrieved now |
-20004 | Cannot delete script now |
-20005 | The script could not be created |
-20008 | The script could not be cloned properly |
Whizzml Execution Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing executions.
Error Code | Semantics |
---|---|
-21000 | The execution is not ready yet |
-21001 | Bad request to create an execution |
-21002 | Your execution cannot be created |
-21021 | Your execution cannot be created now |
-21003 | The execution cannot be retrieved now |
-21004 | Cannot delete execution now |
-21005 | The execution could not be created |
Category Codes
Last Updated: Thursday, 2020-10-08 20:05
Category | Description |
---|---|
-1 | Uncategorized |
0 | Miscellaneous |
1 | Automotive, Engineering & Manufacturing |
2 | Energy, Oil & Gas |
3 | Banking & Finance |
4 | Fraud & Crime |
5 | Healthcare |
6 | Physical, Earth & Life Sciences |
7 | Consumer & Retail |
8 | Sports & Games |
9 | Demographics & Surveys |
10 | Aerospace & Defense |
11 | Chemical & Pharmaceutical |
12 | Higher Education & Scientific Research |
13 | Human Resources & Psychology |
14 | Insurance |
15 | Law & Order |
16 | Media, Marketing & Advertising |
17 | Public Sector & Nonprofit |
18 | Professional Services |
19 | Technology & Communications |
20 | Transportation & Logistics |
21 | Travel & Leisure |
22 | Utilities |
Category | Description |
---|---|
-1 | Uncategorized |
0 | Miscellaneous |
1 | Advanced Workflow |
2 | Anomaly Detection |
3 | Association Discovery |
4 | Basic Workflow |
5 | Boosting |
6 | Classification |
7 | Classification/Regression |
8 | Correlations |
9 | Cluster Analysis |
10 | Data Transformation |
11 | Evaluation |
12 | Feature Engineering |
13 | Feature Extraction |
14 | Feature Selection |
15 | Hyperparameter Optimization |
16 | Model Selection |
17 | Prediction and Scoring |
18 | Regression |
19 | Stacking |
20 | Statistical Test |
Projects
Last Updated: Thursday, 2020-10-08 20:05
A project is an abstract resource that helps you group related BigML resources together.
A project must have a name and optionally a category, description, and multiple tags to help you organize and retrieve your projects.
When you create a new source you can assign it to a pre-existing project. All the subsequent resources created using that source will belong to the same project.
All the resources created within a project will inherit the name, description, and tags of the project unless you change them when you create the resources or update them later.
When you select a project on your BigML's dashboard, you will only see the BigML resources related to that project. Using your BigML dashboard you can also create, update and delete projects (and all their associated resources).
BigML.io allows you to create, retrieve, update, delete your project. You can also list all of your projects.
Jump to:
- Project Base URL
- Creating a Project
- Project Arguments
- Retrieving a Project
- Project Properties
- Updating a Project
- Deleting a Project
- Listing Projects
Project Base URL
You can use the following base URL to create, retrieve, update, and delete projects. https://bigml.io/andromeda/project
Project base URL
All requests to manage your projects must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Project
To create a new project, you just need to POST the name you want to give to the new project to the project base URL.
You can easily do this using curl.
curl "https://bigml.io/andromeda/project?$BIGML_AUTH" \
-H 'content-type: application/json' \
-d '{"name": "My First Project"}'
> Creating a project
BigML.io will return a newly created project document, if the request succeeded.
{
"category":0,
"created":"2015-02-02T07:49:20.226764",
"description":"",
"name":"My First Project",
"private":true,
"resource":"project/54d9553bf0a5ea5fc0000016",
"stats":{
"anomalies":{
"count":0
},
"anomalyscores":{
"count":0
},
"batchanomalyscores":{
"count":0
},
"batchcentroids":{
"count":0
},
"batchpredictions":{
"count":0
},
"batchtopicdistributions":{
"count":0
},
"centroids":{
"count":0
},
"clusters":{
"count":0
},
"configurations":{
"count":0
},
"correlations":{
"count":0
},
"datasets":{
"count":0
},
"ensembles":{
"count":0
},
"evaluations":{
"count":0
},
"models":{
"count":0
},
"predictions":{
"count":0
},
"sources":{
"count":0
},
"statisticaltests":{
"count":0
},
"topicmodels":{
"count":0
},
"topicdistributions":{
"count":0
}
},
"status":{
"code":5,
"message":"The project has been created"
},
"tags":[],
"updated":"2015-02-02T07:49:20.226781"
}
< Example project JSON response
In addition to the name, you can also use the following arguments.
Project Arguments
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is 0 |
The category that best describes the project. See the category codes for the complete list of categories.
Example: 1 |
description
optional |
String |
A description of the project up to 8192 characters long.
Example: "This is a description of my new project" |
name
optional |
String, default is Project Number |
The name you want to give to the new project.
Example: "my new project" |
tags
optional |
Array of Strings |
A list of strings that help classify and index your project.
Example: ["best customers", "2018"] |
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize your new project with a category, description, or tags. For example, you can create a new project with all those arguments as follows:
curl "https://bigml.io/andromeda/project?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{
"name": "Fraud Detection",
"category": 4,
"description": "Detecting fraud in bank transactions",
"tags": ["fraud", "detection"]
}'
> Creating a project with arguments
Retrieving a Project
Each project has a unique identifier in the form "project/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the project.
To retrieve a project with curl:
curl "https://bigml.io/andromeda/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH"
$ Retrieving a project from the command line
Project Properties
Once a project has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the project and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the project creation has been completed without errors. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the project was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
description
updatable |
String | A text describing the project. It can contain restricted markdown to decorate the text. |
name
filterable, sortable, updatable |
String | The name of the project as provided. |
private
filterable, sortable |
Boolean | Whether the project is public or not. |
resource | String | The project/id. |
stats | Object | An object keyed by resource that informs of the number of resources created. |
status | Object | A description of the status of the project. It includes a code, a message, and some extra information. See the table below. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the project was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Updating a Project
To update a project, you need to PUT an object containing the fields that you want to update to the project' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated project.
For example, to update a project with a new name, a new category, a new description, and new tags you can use curl like this:
curl "https://bigml.io/andromeda/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "My New Project",
"category": 3,
"description": "My first BigML project",
"tags": ["fraud", "detection"]}'
$ Updating a project
Deleting a Project
To delete a project, you need to issue a HTTP DELETE request to the project/id to be deleted.
Using curl you can do something like this to delete a project:
curl -X DELETE "https://bigml.io/andromeda/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH"
$ Deleting a project from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a project, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a project a second time, or a project that does not exist, you will receive a "404 not found" response.
However, if you try to delete a project that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Projects
To list all the projects, you can use the project base URL. By default, only the 20 most recent projects will be returned. You can see below how to change this number using the limit parameter.
You can get your list of projects directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/project?$BIGML_AUTH
> Listing projects from a browser
External Connectors
Last Updated: Thursday, 2020-10-08 20:05
An external connector is an abstract resource that helps you create sources from several external data sources like relational databases or ElasticSearch engine.
An external connector must have a name and optionally a category, description, and multiple tags to help you organize and retrieve your external connectors.
External connector stores all required data to access to an external data source and create a new source.
BigML.io allows you to create, retrieve, update, delete your external connector. You can also list all of your external connectors.
Jump to:
- External Connector Base URL
- Creating an External Connector
- External Connector Arguments
- Retrieving an External Connector
- External Connector Properties
- Updating an External Connector
- Deleting an External Connector
- Listing External Connectors
External Connector Base URL
You can use the following base URL to create, retrieve, update, and delete external connectors. https://bigml.io/andromeda/externalconnector
External Connector base URL
All requests to manage your external connectors must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating an External Connector
To create a new external connector, you just need to POST to the external connector base URL.
You can easily do this using curl.
curl "https://bigml.io/andromeda/externalconnector?$BIGML_AUTH" \
-H 'content-type: application/json' \
-d '{"source": "elasticsearch"}'
> Creating an external connector
BigML.io will return a newly created external connector document, if the request succeeded.
You can also use the following arguments.
External Connector Arguments
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is 0 |
The category that best describes the external connector. See the category codes for the complete list of categories.
Example: 1 |
connection | Object | Set of parameters that describes the connection to the data source to generate the external connector. See the Creating a Source Using External Data for more detail. |
description
optional |
String |
A description of the external connector up to 8192 characters long.
Example: "This is a description of my new external connector" |
engine | String |
Name of the external data source to connect to. Available options are: "elasticsearch", "postgresql", "mysql", "sqlserver"
Example: "elasticsearch" |
name
optional |
String, default is External Connector |
The name you want to give to the new external connector.
Example: "my new external connector" |
project
optional |
String |
The project/id you want the external connector to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
public_in_organization
optional |
Boolean, default is false |
Whether the external connector is public within the organization.
Example: true |
source
optional |
String |
Name of the external data source to connect to. Available options are: "elasticsearch", "postgresql", "mysql", "sqlserver"
Example: "elasticsearch" DEPRECATED |
tags
optional |
Array of Strings |
A list of strings that help classify and index your external connector.
Example: ["best customers", "2018"] |
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize your new external connector with a category, source or connection. For example, you can create a new external connector with all those arguments as follows:
curl "https://bigml.io/andromeda/externalconnector?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{
"name": "My external connector",
"category": 4,
"source": "elasticsearch",
"connection": {"hosts": ["localhost:9200"]}
}'
> Creating an external connector with arguments
Retrieving an External Connector
Each external connector has a unique identifier in the form "externalconnector/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the external connector.
To retrieve an external connector with curl:
curl "https://bigml.io/andromeda/externalconnector/5e2e96931f386f11e1000002?$BIGML_AUTH"
$ Retrieving a external connector from the command line
You can also use your browser to visualize the external connector using the full BigML.io URL or pasting the externalconnector/id into the BigML.com dashboard.
External Connector Properties
Once an external connector has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the external connector and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the external connector creation has been completed without errors. |
connection
filterable, sortable, updatable |
Object | Set of parameters that describes the connection to the external data source. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the external connector was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
description
updatable |
String | A text describing the external connector. It can contain restricted markdown to decorate the text. |
name
filterable, sortable, updatable |
String | The name of the external connector as provided. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
public_in_organization | Boolean | Whether the external connector is public within the organization. |
resource | String | The externalconnector/id. |
source
filterable, sortable |
String | The name of the external data source to connect to. |
status | Object | A description of the status of the external connector. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the external connector was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the external connector was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Updating an External Connector
To update an external connector, you need to PUT an object containing the fields that you want to update to the external connector' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated external connector.
For example, to update a external connector with a new name, a new category, a new description, and new tags you can use curl like this:
curl "https://bigml.io/andromeda/externalconnector/5e2e96931f386f11e1000002?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "My New External Connector",
"category": 3,
"description": "My first External Connector",
"tags": ["connector"]}'
$ Updating an external connector
Deleting an External Connector
To delete an external connector, you need to issue a HTTP DELETE request to the externalconnector/id to be deleted.
Using curl you can do something like this to delete an external connector:
curl -X DELETE "https://bigml.io/andromeda/externalconnector/5e2e96931f386f11e1000002?$BIGML_AUTH"
$ Deleting an external connector from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete an external connector, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete an external connector a second time, or an external connector that does not exist, you will receive a "404 not found" response.
However, if you try to delete an external connector that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing External Connectors
To list all the external connectors, you can use the externalconnector base URL. By default, only the 20 most recent external connectors will be returned. You can see below how to change this number using the limit parameter.
You can get your list of external connectors directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/externalconnector?$BIGML_AUTH
> Listing external connectors from a browser
Sources
Last Updated: Wednesday, 2020-12-09 09:40
A source is the raw data that you want to use to create a predictive model. A source is usually a (big) file in a comma separated values (CSV) format. See the example below. Each row represents an instance (or example). Each column in the file represents a feature or field. The last column usually represents the class or objective field. The file might have a first row named header with a name for each field.
Plan,Talk,Text,Purchases,Data,Age,Churn?
family,148,72,0,33.6,50,TRUE
business,85,66,0,26.6,31,FALSE
business,83,64,0,23.3,32,TRUE
individual,9,66,94,28.1,21,FALSE
family,15,0,0,35.3,29,FALSE
individual,66,72,175,25.8,51,TRUE
business,0,0,0,30,32,TRUE
family,18,84,230,45.8,31,TRUE
individual,71,110,240,45.4,54,TRUE
family,59,64,0,27.4,40,FALSE
Example CSV file
A source:
- Should be a comma-separated values (CSV) file. Spaces, tabs, semicolons and tabs are also valid separators.
- Weka's ARFF files are also supported.
- JSON in a few formats is also supported. See below for more details.
- Microsoft Excel files or Mac OS numbers files should also work most times. But it would be better if you please export them to CSV (commad-separated values).
- Cannot be bigger than 64GB.
- Can be gzipped (.gz) or compressed (.bz2). It can be zipped (.zip), but only if the archive contains one single file.
You can also create sources from remote locations using a variety of protocols like https, hdfs, s3, asv, odata/odatas, dropbox, gcs, gdrive, etc. See below for more details.
BigML.io allows you to create, retrieve, update, delete your source. You can also list all of your sources.
Jump to:
- JSON Sources
- Source Base URL
- Creating a Source
- Creating a Source Using a Local File
- Creating a Source Using a URL
- Creating a Source Using Inline Data
- Creating a Source with Automatically Generated Synthetic Data
- Creating a Source Using External Data
- Text Processing
- Items Detection
- Datetime Detection
- Source Arguments
- Retrieving a Source
- Source Properties
- Filtering and Paginating Fields from a Source
- Updating a Source
- Deleting a Source
- Listing Sources
JSON Sources
BigML.io can parse JSON data in one of the following formats:
-
A top-level list of lists of atomic values, each one defining a row.
Valid JSON Source format (a list of lists)[ ["length","width","height","weight","type"], [5.1,3.5,1.4,0.2,"A"], [4.9,3.0,1.4,0.2,"B"], ... ]
-
A top-level list of dictionaries,
where each dictionary's values represent the row values and the corresponding keys the column names.
The first dictionary defines the keys that will be selected.
Valid JSON Source format (a list of dictionaries)[ {"length":5.1,"width":3.5,"height":1.4,"weight":0.2,"type":"A"}, {"length":4.9,"width":3.0,"height":1.4,"weight":0.2,"type":"B"}, ... ]
-
A top-level list of dictionaries with the request parameter
json_key defined under source_parser
with the value of one of its keys having any of the two formats above.
For the following example, you can set "source_parser": {"json_key": "data"}.
Valid JSON Source format (a list of dictionaries with json_key){ "name": "Shipping Class", "data": [ {"length":5.1,"width":3.5,"height":1.4,"weight":0.2,"type":"A"}, {"length":4.9,"width":3.0,"height":1.4,"weight":0.2,"type":"B"}, ... ], "size": 5148 }
-
A nested dictionary key, with the final value having any of the formats already described.
For the following example, you can set "source_parser": {"json_key": "results.data"}.
Valid JSON Source format (a nested dictionary key){ "name": "Shipping Class", "results": { "meta": "Shipping class info", "data": [ {"length":5.1,"width":3.5,"height":1.4,"weight":0.2,"type":"A"}, {"length":4.9,"width":3.0,"height":1.4,"weight":0.2,"type":"B"}, ... ] }, "size": 5148 }
-
A top-level dictionary of dictionaries whose values represent rows.
Valid JSON Source format (a dictionary of dictionaries){ "GnCC": {"length":5.1,"width":3.5,"height":1.4,"weight":0.2,"type":"A"}, "4R3R": {"length":4.9,"width":3.0,"height":1.4,"weight":0.2,"type":"B"}, ... }
-
Rows of JSON dictionaries where the full file is not a valid JSON document, but each individual line is.
Each line in the file (or input stream) must be a separate JSON either list or map per line,
and is thus parsed individually as a top-level JSON document, and interpreted as a row of data.
To be a valid JSON row, each line must fall into one of two categories:
* A JSON dictionary, with at least some of its values being atomic. In that case, the keys in the dictionary are taken as the field names and the corresponding atomic values as the actual columns of the row. Keys with values that are composite are just ignored. The first map in the file determines what are the fields that will be extracted in subsequent rows unless they are specified via the request parameter json_fields defined under source_parser or the query string parameter bigml_json_fields just as with the json_key explained above.
* A JSON list, again with at list some of its values being atomic. Here, the field names can be inferred from the values of the first list in the file, using the same heuristics that are used to auto-detect headers in CSVs. Or you can set the header flag as well as use json_fields under source_parser (or bigml_json_fields in the query string) to give explicit names in the creation request. Non-atomic values appearing in the lists are translated to missing values.
Here's a snippet of JSON rows using maps:
JSON Rows using Maps{"length":5.1,"width":3.5,"height":1.4,"weight":0.2,"type":"A"} {"length":4.9,"width":3.0,"height":1.4,"weight":0.2,"type":"B"} {"length":4.7,"width":3.2,"height":1.3,"weight":0.2,"type":"B"} {"length":4.6,"width":3.1,"height":1.5,"weight":0.2,"type":"C"} {"length":5.0,"width":3.6,"height":1.4,"weight":0.2,"type":"A"} {"length":5.4,"width":3.9,"height":1.7,"weight":0.4,"type":"C"} ...
and here's JSON rows with lists:
JSON Rows using Lists["length","width","height","weight","type"] [5.1,3.5,1.4,0.2,"A"] [4.9,3.0,1.4,0.2,"B"] [4.7,3.2,1.3,0.2,"B"] [4.6,3.1,1.5,0.2,"C"] [5.0,3.6,1.4,0.2,"A"] [5.4,3.9,1.7,0.4,"C"] ...
Source Base URL
You can use the following base URL to create, retrieve, update, and delete sources. https://bigml.io/andromeda/source
Source base URL
All requests to manage your sources must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Source
You can create a new source in any of the following five ways:
- Local Sources: Using a local file. You need to post the file content in "multipart/form-data". The maximum size allowed is 64 GB per file.
- Remote Sources: Using a URL that points to your data. The maximum size allowed is 64 GB or 5 TB if you use a file stored in Amazon S3.
- Inline Sources: Using some inline data. The content type must be "application/json". The maximum size in this case is limited 10 MB per post.
- Synthetic Sources: Automatically generate synthetic data sources, presumably for activities such as testing, prototyping, and benchmarking.
- External Data Sources: Using connection data to an external data database or document repository.
Creating a Source Using a Local File
To create a new source, you need to POST the file containing your data to the source base URL. The file must be attached in the post as a file upload.The Content-Type in your HTTP request must be "multipart/form-data" according to RFC2388. This allows you to upload binary files in compressed format (.Z, .gz, etc) that will be uploaded faster.
You can easily do this using curl. The option -F (--form) lets curl emulate a filled-in form in which a user has pressed the submit button. You need to prefix the file path name with "@".
curl "https://bigml.io/andromeda/source?$BIGML_AUTH" -F file=@iris.csv
> Creating a source
Creating a Source Using a URL
To create a new remote source you need a URL that points to the data file that you want BigML to download for you.
You can easily do this using curl. The option -H lets curl set the content type header while the option -X sets the http method. You can send the URL within a JSON object as follows:
curl "https://bigml.io/andromeda/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"remote": "https://static.bigml.com/csv/iris.csv"}'
> Creating a remote source
You can use the following types of URLs to create remote sources:
- HTTP or HTTPS. They can also include basic realm authorization.
Example URLshttps://test:test@static.bigml.com/csv/iris.csv http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
- Public or private files in Amazon S3.
Example Amazon S3 URLss3://bigml-public/csv/iris.csv s3://bigml-test/csv/iris.csv?access-key=AKIAIF6IUYDYUQ7BALJQ&secret-key=XgrQV/hHBVymD75AhFOzveX4qz7DYrO6q8WsM6ny
Creating a remote source from Google Drive and Google Storage
You have two options to create a remote datasource from Google Drive and Google Storage via API:
- Using BigML:
Allow BigML to access to your Google Drive or Google Storage from the Cloud Storages section from your Account or from your Dashboard sources list. You will get the access token and the refresh token.
Google Drive example:- Select the option to create source from Google Drive:
- Allow BigML access to your Google Drive:
- Get the access token and refresh token:
You can easily create the remote source using curl as in the examples below:
> Creating a remote source from Google Drivecurl "https://bigml.io/andromeda/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gdrive://noserver/0BxGbAMhJezOScTFBUVFPMy1xT1E?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8"}'
> Creating a remote source from Google Cloud Storagecurl "https://bigml.io/andromeda/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gcs://company_bucket/Iris.csv?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8"}'
- Using your own app:
You can also create a remote source from your own App. You first need to authorize BigML access from your own Google Apps application. BigML only needs authorization for read-only authentication scope (
https://www.googleapis.com/auth/devstorage.read_only
,https://www.googleapis.com/auth/drive.readonly
), but you can have any of the other available scopes (find authentication scopes available for Google Drive and Google Storage). After the authorization process you will get your access token and refresh token from the Google Authorization Server.
Then the process is the same as creating a remote source using BigML application described above. You need to POST to the source endpoint an object containing at least the file ID (for Google Drive) or the bucket and the file name (for Google Storage) and the access token, but in this case you will also need to include the app secret and app client from your App. Again, including the refresh token is optional.
Your values for app client and app secret appear as Client secret and Client ID in Google developers console respectively. See image below.
> Creating a remote source from Google Drive using your appcurl "https://bigml.io/andromeda/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gdrive://noserver/0BxGbAMhJezOSXy1oRU5MSU90SUU?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8&app-secret=AvFake1Secretjt27HQWTm4h&app-client=667300000007-07gjg5o912o1v422hfake2cli3nt3no6.apps.googleusercontent.com"}'
> Creating a remote source from Google Cloud Storage using your appcurl "https://bigml.io/andromeda/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gcs://company_bucket/Iris.csv?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8&app-secret=AvFake1Secretjt27HQWTm4h&app-client=667300000007-07gjg5o912o1v422hfake2cli3nt3no6.apps.googleusercontent.com"}'
Creating a Source Using Inline Data
You can also create sources sending some inline data within the body of a POST http request. This way is specially useful if you want to model small amounts of data generated by an application.
To create an inline source using curl you can use the following example:
curl "https://bigml.io/andromeda/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"data": "a,b,c,d\n1,2,3,4\n5,6,7,8"}'
> Creating an inline source
Creating a Source with Automatically Generated Synthetic Data
You can also synthetically create sources using automatically generated data for activities such as testing, prototyping, and benchmarking.
To create a syntheric source using curl you can use the following example:
curl "https://bigml.io/andromeda/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"synthetic": {"fields": 10, "rows": 10}}'
> Creating a synthetic source
Creating a Source Using External Data
You can also create sources getting data from external data sources like databases or document repositories. You will need to pass connection data to those repositories with permissions to access the data to import in BigML as well as querying information to filter data to include in the source.
Note that for RDBMS connectors, the 'table', 'offset', 'sort', 'fields', and 'fields_exclude' parameters are ignored if a 'query' is provided. With respect to the 'limit', any limit within the provided 'query' will by applied first, before that of the 'limit' parameter value.
curl "https://bigml.io/andromeda/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"external_data": {
"source": "mysql",
"connection": {
"host":"mysql.host.com",
"port": 3306,
"database": "dbname",
"user": "dbuser",
"password": "dbpwd"
},
"query": "SELECT * FROM table_name"
}}'
> Creating a source from MySQL
PostgreSQL, MySQL, SQL Server
All RDBMS connectors share connection parameters.
Elasticsearch
Elasticsearch exposes several parameters to specify the connection to a server, including the hosts, authenticaton info, ssl config, etc. See Elasticsearch for complete reference.
Independently of how you create a new source (local, remote, inline, synthetic or external data) BigML.io will return a newly created source document, if the request succeeded.
{
"category": 0,
"code": 201,
"content_type": "application/octet-stream",
"created": "2012-11-15T02:24:59.686739",
"credits": 0.0,
"description": "",
"disable_datetime": false,
"fields_meta": {
"count": 0,
"limit": 200,
"offset": 0,
"total": 0
},
"file_name": "iris.csv",
"md5": "d1175c032e1042bec7f974c91e4a65ae",
"name": "iris.csv",
"number_of_datasets": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"project": null,
"resource": "source/4f603fe203ce89bb2d000000",
"size": 4608,
"source_parser": {},
"status": {
"code": 1,
"message": "The request has been queued and will be processed soon"
},
"tags": [],
"type": 0,
"updated": "2012-11-15T02:24:59.686758"
}
< Example source JSON response
Source Arguments
In addition to the file, you can also use the following arguments.
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is 0 |
The category that best describes the data. See the category codes for the complete list of categories.
Example: 1 |
data
optional |
String |
Data for inline source creation.
Example: "a,b,c,d\n1,2,3,4\n5,6,7,8" |
description
optional |
String |
A description of the source up to 8192 characters long.
Example: "This is a description of my new source" |
disable_datetime
optional |
Boolean, default is false |
Whether BigML has to generate or not new fields from existing date-time fields.
Example: true |
external_data
optional |
Object | Set of parameters to generate a source from external data. See this section. |
file
optional |
multipart/form-data; charset=utf-8 | File containing your data in csv format. It can be compressed, gzipped, or zipped if the archive contains only one file |
item_analysis
optional |
Object, default is shown in the table below |
Set of parameters to activate item analysis for the source.
Example:
|
name
optional |
String, default is Unnamed source |
The name you want to give to the new source.
Example: "my new source" |
project
optional |
String |
The project/id you want the source to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
remote
optional |
String |
A URL pointing to file containing your data in csv format. It can be compressed, gzipped, or zipped.
Example: https://static.bigml.com/csv/iris.csv |
source_parser
optional |
Object, default is shown in the table below |
Set of parameters to parse the source.
Example:
|
synthetic
optional |
Object, default is shown in the table below |
Set of parameters to generate a synthetic source presumably for activities such as testing, prototyping and benchmarking.
Example:
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your source.
Example: ["best customers", "2018"] |
term_analysis
optional |
Object, default is shown in the table below |
Set of parameters to activate text analysis for the source.
Example:
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
A source parser object is composed of any combination of the following properties.
Property | Type | Description |
---|---|---|
header
optional |
Boolean, default is true |
Whether the source contains a header or not.
Example: true |
json_fields | Array of Strings |
The columns to be used when the source is in the JSON format and the rows are a list of dictionaries. See the JSON Sources for more information.
Example: ["age", "height", "weight"] |
json_key | String |
A top-level dictionary key containing the rows when the source is the JSON format. See the JSON Sources for more information.
Example: "data" |
locale
optional |
String, default is "en-US" |
The locale of the source.
Example: "es-ES" |
missing_tokens
optional |
Array of Strings, default is ["", "N/A", "n/a", "NULL", "null", "-", "#DIV/0", "#REF!", "#NAME?", "NIL", "nil", "NA", "na", "#VALUE!", "#NULL!", "NaN", "#N/A", "#NUM!", "?"] |
Tokens that represent a missing value
Example: ["?"] |
quote
optional |
Char, default is """ |
The source quote character.
Example: "'" |
separator
optional |
Char, default is "," |
The source separator character. Empty string if the source has only single column.
Example: ";" |
trim
optional |
Boolean, default is true |
Whether to trim field strings or not.
Example: true |
You can also use curl to customize your new source with a name and different parser. For example, to create a new source named "my source", without a header and with "x" as the only missing token.
curl "https://bigml.io/andromeda/source?$BIGML_AUTH" \
-F file=@iris.csv \
-F 'name=my source' \
-F 'source_parser={"header": false, "missing_tokens":["x"]}'
> Creating a source with arguments
If you do not specify a name, BigML.io will assign to the source the same name as the file that you uploaded. If you do not specify a source_parser, BigML.io will do its best to automatically select the parsing parameters for you. However, if you do specify it, BigML.io will not try to second-guess you.
A item_analysis object is composed of any combination of the following properties.
A term_analysis object is composed of any combination of the following properties.
Property | Type | Description |
---|---|---|
bigrams
optional |
Boolean, default is false |
Whether to include a contiguous sequence of two items from a given sequence of text. See n-gram for more information.This argument is deprecated in favor of ngrams and is equivalent to ngrams=2.
Example: true DEPRECATED |
case_sensitive
optional |
Boolean, default is false |
Whether text analysis should be case sensitive or not.
Example: true |
enabled
optional |
Boolean, default is true |
Whether text processing should be enabled or not.
Example: true |
excluded_terms
optional |
Array of Strings, default is [], an empty list. |
Specifies a list of terms to ignore when performing term analysis.
Example:
|
language
optional |
String, default is "en" |
The default language of text fields in a two-letter language code, which will change the resulting stemming and tokenization. Available options are: "ar", "ca", "cs", "da", "de", "en", "es", "fa", "fi", "fr", "hu", "it", "ja", "ko", "nl", "pl", "pt", "ro", "ru", "sv", "tr", "zh", "none", or null for auto-detect.
Example: "es" |
ngrams
optional |
Integer, default is 1 |
A positive integer n that specifies the use of all sequences of consecutive tokens of length n should be considered as terms, in addition to their constituent tokens (when separated by a single space and no stopwords). See n-gram for more information.The minimum value is 1 and maximum value is 5.
Example: 5 |
stem_words
optional |
Boolean, default is true |
Whether lemmatization (stemming) of terms should be done, according to linguistic rules in the provided language. Note that if the language is, for example, zh, even English words will not be lemmatized as with English rules.
Example: true |
stopword_diligence
optional |
String, default is "light" |
The aggressiveness of stopword removal, where the levels are light, normal or aggressive, in order, where each level is a superset of words in the previous ones. The most common languages will add stopwords at each level, but less common languages may not.
Example: "normal" |
stopword_removal
optional |
String, default is "selected_language" |
A string or keyword specifying the type of stopword removal to perform. Available options are where it can be none (remove no stopwords), selected_language (remove stopwords from the provided language), and all_languages (remove stopwords from all languages). Note that this parameter supersedes use_stopwords if provided. Also note that the null language does have a non-empty stopword list such as single numeric digits.
Example: "all_languages" |
term_filters
optional |
Array of Strings |
Filters that should be applied to the chosen terms. Available options are:
Example: "html_keywords" |
term_regexps
optional |
Array | A list of strings specifying regular expressions to be matched against input documents. If present, these regular expressions will automatically be chosen for the final term list, and their per-document occurrence counts will be the number of matches of the expression in that document. |
token_mode
optional |
String, default is "all" |
Whether tokens_only, full_terms_only or all should be tokenized.
Example: "tokens_only" |
use_stopwords
optional |
Boolean, default is true |
Whether to use stop words or not. This fields is deprecated in favor of stopword_removal.
Example: true DEPRECATED |
A synthetic object is composed of the following properties.
Text Processing
While the handling of numeric, categorical, or items fields within a decision tree framework is fairly straightforward, the handling of text fields can be done in a number of different ways. BigML.io takes a basic and reasonably robust approach, leverging some basic NLP techniques along with a simple bag-of-words style method of feature generation.
At the source level, BigML.io attempts to do basic language detection. Initially the language can be English ("en"), Spanish ("es"), Catalan/Valencian ("ca"), Dutch ("nl"), French ("fr"), German ("de"), Portuguese ("pt"), or "none" if no language is detected. In the near future, BigML.io will support many more languages.
For text fields, BigML.io adds potentially five keys to the detected fields, all of which are placed in a map under term_analysis.
The first is language, which is mapped to the detected language.
There are also three boolean keys, case_sensitive, use_stopwords, and stem_words. The case_sensitive key is false by default. use_stopwords should be true if we should include stopwords in the vocabulary for the detected field during text summarization. stem_words should be true if BigML.io should perform word stemming on this field, which maps forms of the same term to the same key when summarizing or generating models. By default, use_stopwords is false and stem_words is true for languages other than "none" and they are not present otherwise.
Finally, token_mode determines the tokenization strategy. It may be set as either tokens_only, full_terms_only, and all. When set as tokens_only then individual words are used as terms. For example, "ML for all" becomes ["ML", "for", "all"]. However, when full_terms_only is selected, then the entire field is treated as a single term as long as it is shorter than 256 characters. In this case "ML for all" stays ["ML for all"]. If all is selected, then both full terms and tokenized terms are used. In this case ["ML for all"] becomes ["ML", "for", "all", "ML for all"]. The default for token_mode is all.
There are a few details to note:
- If full_terms_only is selected, then no stemming will occur even if stem_words is true.
- Also, when either all or tokens_only are selected, a term must appear at least twice to be selected for the tag cloud. However full_terms_only lowers this limit to a single occurrence.
- Finally, if the language is "none", or if a language does not have an algorithm available for stopword removal or stemming, the use_stopwords and stem_words keys will have no effect.
Items Detection
BigML automatically detects as items fields that have many different categorical values per instance separated by non-alphanumeric characters, so they can’t be considered either categorical or text fields
These kind of fields can be found in transactional datasets where each instance is associated to a different set of products contained within one field. For example, datasets containing all products bought by users or prescription datasets where each patient is associated to different treatments. These datasets are commonly used for Association Discovery to find relationships between different items.
Find the two CSV examples below that could be considered items fields:
User, Prescription
John Doe, medicine 1; medicine 2
Jane Roe, medicine 1; medicine 3; medicine 4; medicine 6
Transaction, Product
12345, product 1; product 2; product 5; product 6; product 7
67890, product 1; product 3; product 4
In the examples above, the fields Prescription and Products will be considered as items and each different value will be a unique item.
Once a field has been detected as items, BigML tries to automatically detect which is the best separator for your items. For example, for the following itemset {hot dog; milk, skimmed; chocolate}, the best separator is the semicolon which yields three different items: 'hot dog', 'milk, skimmed' and 'chocolate'.
For items fields, there are five different parameters you can configure under the property group item_analysis, which includes separator that allows you to specify which separator you want to set for your items.
Note that items fields can’t be eligible as target fields for models, logistic regression, and ensembles, but they can be used as predictors. For anomaly detection, they can’t be included as an input field to calculate the anomaly score, although they can be selected as summary fields.
Datetime Detection
During the source pre-scan BigML tries to determine the data type of each field in your file. This process automatically detects datetime fields and, if disable_datetime is not explicitly set to "false", BigML will generate additional fields with their components.
For instance, if a field named "date" has been identified as a datetime with format "YYYY-MM-dd", four new fields will be automatically added to the source, namely "date.year", "date.month", "date.day-of-month" and "date.day-of-week". For each row, these new fields will be filled in automatically by parsing the value of their parent field, "date". For example, if the latter contains the value "1969-07-14", the autogenerated columns in that row will have the values 1969, 7, 14 and 1 (because that day was Monday). As noted before, autogenaration can be disabled by setting disable_datetime option to "true", either in the create source request or later in an update source operation.
When a field is detected as datetime, BigML tries to determine its format for parsing the values and generate the fields with their components. By default, BigML accepts ISO 8601 time formats (YYYY-MM-DD) as well as a number of other common European and US formats, as seen in the table below:
time_format Name | Example |
---|---|
basic-date-time | 19690714T173639.592Z |
basic-date-time-no-ms | 19690714T173639Z |
basic-iso-date | 19690714Z |
basic-ordinal-date-time | 1969195T173639.592Z |
basic-ordinal-date-time-no-ms | 1969195T173639Z |
basic-t-time | T173639.592Z |
basic-t-time-no-ms | T173639Z |
basic-time | 173639.592Z |
basic-time-no-ms | 173639Z |
basic-week-date | 1969W291 |
basic-week-date-time | 1969W291T173639.592Z |
basic-week-date-time-no-ms | 1969W291T173639Z |
clock-minute | 5:36 PM |
clock-minute-nospace | 5:36PM |
clock-second | 5:36:39 PM |
clock-second-nospace | 5:36:39PM |
date | 1969-07-14 |
date-hour | 1969-07-14T17 |
date-hour-minute | 1969-07-14T17:36 |
date-hour-minute-second | 1969-07-14T17:36:39 |
date-hour-minute-second-fraction | 1969-07-14T17:36:39.592 |
date-hour-minute-second-fraction-with-solidus | 1969/07/14T17:36:39.592 |
date-hour-minute-second-ms | 1969-07-14T17:36:39.592 |
date-hour-minute-second-ms-with-solidus | 1969/07/14T17:36:39.592 |
date-hour-minute-second-with-solidus | 1969/07/14T17:36:39 |
date-hour-minute-with-solidus | 1969/07/14T17:36 |
date-hour-with-solidus | 1969/07/14T17 |
date-time | 1969-07-14T17:36:39.592Z |
date-time-no-ms | 1969-07-14T17:36:39Z |
date-time-no-ms-with-solidus | 1969/07/14T17:36:39Z |
date-time-with-solidus | 1969/07/14T17:36:39.592Z |
date-with-solidus | 1969/07/14 |
eu-date | 14/7/1969 |
eu-date-clock-minute | 14/7/1969 5:36 PM |
eu-date-clock-minute-nospace | 14/7/1969 5:36PM |
eu-date-clock-second | 14/7/1969 5:36:39 PM |
eu-date-clock-second-nospace | 14/7/1969 5:36:39PM |
eu-date-millisecond | 14/7/1969 17:36:39.592 |
eu-date-minute | 14/7/1969 17:36 |
eu-date-second | 14/7/1969 17:36:39 |
eu-ddate | 14.7.1969 |
eu-ddate-clock-minute | 14.7.1969 5:36 PM |
eu-ddate-clock-minute-nospace | 14.7.1969 5:36PM |
eu-ddate-clock-second | 14.7.1969 5:36:39 PM |
eu-ddate-clock-second-nospace | 14.7.1969 5:36:39PM |
eu-ddate-millisecond | 14.7.1969 17:36:39.592 |
eu-ddate-minute | 14.7.1969 17:36 |
eu-ddate-second | 14.7.1969 17:36:39 |
eu-sdate | 14-7-1969 |
eu-sdate-clock-minute | 14-7-1969 5:36 PM |
eu-sdate-clock-minute-nospace | 14-7-1969 5:36PM |
eu-sdate-clock-second | 14-7-1969 5:36:39 PM |
eu-sdate-clock-second-nospace | 14-7-1969 5:36:39PM |
eu-sdate-millisecond | 14-7-1969 17:36:39.592 |
eu-sdate-minute | 14-7-1969 17:36 |
eu-sdate-second | 14-7-1969 17:36:39 |
hour-minute | 17:36 |
hour-minute-second | 17:36:39 |
hour-minute-second-fraction | 17:36:39.592 |
hour-minute-second-ms | 17:36:39.592 |
iso-date | 1969-07-14Z |
iso-date-time | 1969-07-14T17:36:39.592Z |
iso-instant | 1969-07-14T17:36:39.592Z |
iso-local-date | 1969-07-14 |
iso-local-date-time | 1969-07-14T17:36:39.592 |
iso-local-time | 17:36:39.592 |
iso-offset-date | 1969-07-14Z |
iso-offset-date-time | 1969-07-14T17:36:39.592Z |
iso-offset-time | 17:36:39.592Z |
iso-ordinal-date | 1969-195Z |
iso-time | 17:36:39.592Z |
iso-week-date | 1969-W29-1Z |
iso-zoned-date-time | 1969-07-14T17:36:39.592Z |
mysql | 1969-07-14 17:36:39 |
no-t-date-hour-minute | 1969-7-14 17:36 |
odata-format | /Date(-14711000408)/ |
ordinal-date-time | 1969-195T17:36:39.592Z |
ordinal-date-time-no-ms | 1969-195T17:36:39Z |
rfc-1123-date-time | Mon, 14 Jul 1969 17:36:39 GMT |
rfc822 | Mon, 14 Jul 1969 17:36:39 +0000 |
t-time | T17:36:39.592Z |
t-time-no-ms | T17:36:39Z |
time | 17:36:39.592Z |
time-no-ms | 17:36:39Z |
timestamp | -14711000 |
timestamp-msecs | -14711000408 |
twitter-time | Mon Jul 14 17:36:39 +0000 1969 |
twitter-time-alt | 1969-7-14 17:36:39 +0000 |
twitter-time-alt-2 | 1969-7-14 17:36 +0000 |
twitter-time-alt-3 | Mon Jul 14 17:36 +0000 1969 |
us-date | 7/14/1969 |
us-date-clock-minute | 7/14/1969 5:36 PM |
us-date-clock-minute-nospace | 7/14/1969 5:36PM |
us-date-clock-second | 7/14/1969 5:36:39 PM |
us-date-clock-second-nospace | 7/14/1969 5:36:39PM |
us-date-millisecond | 7/14/1969 17:36:39.592 |
us-date-minute | 7/14/1969 17:36 |
us-date-second | 7/14/1969 17:36:39 |
us-sdate | 7-14-1969 |
us-sdate-clock-minute | 7-14-1969 5:36 PM |
us-sdate-clock-minute-nospace | 7-14-1969 5:36PM |
us-sdate-clock-second | 7-14-1969 5:36:39 PM |
us-sdate-clock-second-nospace | 7-14-1969 5:36:39PM |
us-sdate-millisecond | 7-14-1969 17:36:39.592 |
us-sdate-minute | 7-14-1969 17:36 |
us-sdate-second | 7-14-1969 17:36:39 |
week-date | 1969-W29-1 |
week-date-time | 1969-W29-1T17:36:39.592Z |
week-date-time-no-ms | 1969-W29-1T17:36:39Z |
weekyear-week | 1969-W29 |
weekyear-week-day | 1969-W29-1 |
year-month | 1969-07 |
year-month-day | 1969-07-14 |
It might happen that BigML is not able to determine the right format of your datetime field. In that case, it will be considered either a text or a categorical field. You can override that assignment by setting the optype of the field to datetime and passing the appropriate format in time_formats. For instance:
curl "https://bigml.io/andromeda/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000004": {"optype": "datetime", "time_formats": ["date"]}}}' \
-H 'content-type: application/json'
> Updating a source field with optype "datetime"
curl "https://bigml.io/andromeda/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000004": {"optype": "datetime", "time_formats": ["YYYY-MM-dd"]}}}' \
-H 'content-type: application/json'
> Updating a source field with custom "time_formats"
Retrieving a Source
Each source has a unique identifier in the form "source/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the source.
To retrieve a source with curl:
curl "https://bigml.io/andromeda/source/4f603fe203ce89bb2d000000?$BIGML_AUTH"
$ Retrieving a source from the command line
You can also use your browser to visualize the source using the full BigML.io URL or pasting the source/id into the BigML.com dashboard.
Source Properties
Once a source has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the source and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the source creation has been completed without errors. |
content_type
filterable, sortable |
String | This is the MIME content-type as provided by your HTTP client. The content-type can help BigML.io to better parse your file. For example, if you use curl, you can alter it using the type option "-F file=@iris.csv;type=text/csv". |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the source was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this source. |
description
updatable |
String | A text describing the source. It can contain restricted markdown to decorate the text. |
disable_datetime
updatable |
Boolean | False when BigML didn't generate new fields from existing date-time fields. |
fields
updatable |
Object |
A dictionary with an entry per field (column) in your data. Each entry includes the column number, the name of the field, the type of the field, a specific locale if it differs from the source's one, and specific missing tokens if the differ from the source's one. This property is very handy to update sources according to your own parsing preferences.
Example:
|
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
file_name
filterable, sortable |
String | The name of the file as you submitted it. |
md5 | String | The file MD5 Message-Digest Algorithm as specified by RFC 1321. |
name
filterable, sortable, updatable |
String | The name of the source as your provided or the name of the file by default. |
number_of_anomalies
filterable, sortable |
Integer | The current number of anomalies that use this source. |
number_of_anomalyscores
filterable, sortable |
Integer | The current number of anomaly scores that use this source. |
number_of_associations
filterable, sortable |
Integer | The current number of associations that use this source. |
number_of_associationsets
filterable, sortable |
Integer | The current number of association sets that use this source. |
number_of_centroids
filterable, sortable |
Integer | The current number of centroids that use this source. |
number_of_clusters
filterable, sortable |
Integer | The current number of clusters that use this source. |
number_of_correlations
filterable, sortable |
Integer | The current number of correlations that use this source. |
number_of_datasets
filterable, sortable |
Integer | The current number of datasets that use this source. |
number_of_ensembles
filterable, sortable |
Integer | The current number of ensembles that use this source. |
number_of_forecasts
filterable, sortable |
Integer | The current number of forecasts that use this source. |
number_of_linearregressions
filterable, sortable |
Integer | The current number of linear regressions that use this source. |
number_of_logisticregressions
filterable, sortable |
Integer | The current number of logistic regressions that use this source. |
number_of_models
filterable, sortable |
Integer | The current number of models that use this source. |
number_of_optimls
filterable, sortable |
Integer | The current number of OptiMLs that use this source. |
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this source. |
number_of_statisticaltests
filterable, sortable |
Integer | The current number of statistical tests that use this source. |
number_of_timeseries
filterable, sortable |
Integer | The current number of time series that use this source. |
number_of_topicdistributions
filterable, sortable |
Integer | The current number of topic distributions that use this source. |
number_of_topicmodels
filterable, sortable |
Integer | The current number of topic models that use this source. |
private
filterable, sortable |
Boolean | Whether the source is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
remote | String | URL of the remote data source. |
resource | String | The source/id. |
shared
filterable, sortable |
Boolean | Whether the source is shared using a private link or not. |
shared_hash | String | The hash that gives access to this source if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this source. |
size
filterable, sortable |
Integer | The number of bytes of the source. |
source_parser
updatable |
Object | Set of parameters to parse the source. |
status | Object | A description of the status of the source. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the source was created using a subscription plan or not. |
synthetic | Object | Set of parameters to generate a synthetic source presumably for activities such as testing, prototyping and benchmarking. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
term_analysis
updatable |
Object | Set of parameters that define how text analysis should work for text fields. |
type
filterable, sortable |
Integer |
The type of source.
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the source was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Source Fields
The property fields is a dictionary keyed by an auto-generated id per each field in the source. Each field has as a value an object with the following properties:
For fields classified with optype "text", the default values specified in the term_analysis at the top-level of the source are used.
Non-provided flags by term_analysis take their default value, i.e., false for booleans, none for language.
Besides these global default values, which apply to all text fields (and potential text fields, such as categorical ones that might overflow to text during dataset creation), it's possible to specify term_analysis flags on a per-field basis.
For fields classified with optype "items", the default values specified in the item_analysis at the top-level of the source are used.
Like term_analysis, non-provided flags by item_analysis take their default value and it's possible to specify item_analysis flags on a per-field basis as well at the global level, too.
Source Status
Before a source is successfully created, BigML.io makes sure that it has been uploaded in an understandable format, that the data that it contains is parseable, and that the types for each column in the data can be inferred successfully. The source goes through a number of states until all these analyses are completed. Through the status field in the source you can determine when the source has been fully processed and is ready to be used to create a dataset. These are the fields that a source's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the source creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the source. |
message | String | A human readable message explaining the status. |
Once a source has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"content_type": "application/octet-stream",
"created": "2012-11-15T02:24:59.686000",
"credits": 0.0,
"description": "",
"fields": {
"000000": {
"column_number": 0,
"name": "sepal length",
"optype": "numeric",
"order": 0
},
"000001": {
"column_number": 1,
"name": "sepal width",
"optype": "numeric",
"order": 1
},
"000002": {
"column_number": 2,
"name": "petal length",
"optype": "numeric",
"order": 2
},
"000003": {
"column_number": 3,
"name": "petal width",
"optype": "numeric",
"order": 3
},
"000004": {
"column_number": 4,
"name": "species",
"optype": "categorical",
"order": 4
}
},
"fields_meta": {
"count": 5,
"limit": 200,
"offset": 0,
"total": 5
},
"file_name": "iris.csv",
"md5": "d1175c032e1042bec7f974c91e4a65ae",
"name": "iris.csv",
"number_of_datasets": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"project": null,
"resource": "source/4f603fe203ce89bb2d000000",
"size": 4608,
"source_parser": {
"header": true,
"locale": "en_US",
"missing_tokens": [
"",
"N/A",
"n/a",
"NULL",
"null",
"-",
"#DIV/0",
"#REF!",
"#NAME?",
"NIL",
"nil",
"NA",
"na",
"#VALUE!",
"#NULL!",
"NaN",
"#N/A",
"#NUM!",
"?"
],
"quote": "\"",
"separator": ","
},
"status": {
"code": 5,
"elapsed": 244,
"message": "The source has been created"
},
"tags": [],
"type": 0,
"updated": "2012-11-15T02:25:00.001000"
}
< Example source JSON response
Filtering and Paginating Fields from a Source
A source might be composed of hundreds or even thousands of fields. Thus when retrieving a source, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Source
To update a source, you need to PUT an object containing the fields that you want to update to the source' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated source.
For example, to update a source with a new name and a new locale you can use curl like this:
curl "https://bigml.io/andromeda/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name", "source_parser": {"locale": "es-ES"}}' \
-H 'content-type: application/json'
$ Updating a source's name and locale
Deleting a Source
To delete a source, you need to issue a HTTP DELETE request to the source/id to be deleted.
Using curl you can do something like this to delete a source:
curl -X DELETE "https://bigml.io/andromeda/source/4f603fe203ce89bb2d000000?$BIGML_AUTH"
$ Deleting a source from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a source, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a source a second time, or a source that does not exist, you will receive a "404 not found" response.
However, if you try to delete a source that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Sources
To list all the sources, you can use the source base URL. By default, only the 20 most recent sources will be returned. You can see below how to change this number using the limit parameter.
You can get your list of sources directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/source?$BIGML_AUTH
> Listing sources from a browser
Datasets
Last Updated: Wednesday, 2020-12-09 09:40
A dataset is a structured version of a source where each field has been processed and serialized according to its type. The possible field types are numeric, categorical, text, date-time, or items. For each field, you can also get the number of errors that were encountered processing it. Errors are mostly missing values or values that do not match with the type assigned to the column.
When you create a new dataset, histograms of the field values are created for the categorical and numeric fields. In addition, for the numeric fields, a collection of statistics about the field distribution such as minimum, maximum, sum, and sum of squares are also computed.
For date-time fields, BigML attempts to parse the format and automatically generate the related subfields (year, month, day, and so on) present in the format.
For items fields which have many different categorical values per instance separated by non-alphanumeric characters, BigML tries to automatically detect which is the best separator for your items.
Finally, for text fields, BigML handles plain text fields with some light-weight natural language processing; BigML separates the field into words using punctuation and whitespace, attempts to detect the language, groups word forms together using word stemming, and eliminates words that are too common or too rare to be useful. We are then left with somewhere between a few dozen and a few hundred interesting words per text field, the occurrences of which can be features in a model.

BigML.io allows you to create, retrieve, update, delete your dataset. You can also list all of your datasets.
Jump to:
- Dataset Base URL
- Creating a Dataset
- Dataset Arguments
- Filtering Rows
- Retrieving a Dataset
- Dataset Properties
- Filtering and Paginating Fields from a Dataset
- Updating a Dataset
- Deleting a Dataset
- Listing Datasets
- Multi-Datasets
- Resources Accepting Multi-Datasets Input
- Creating a Dataset using SQL
- Transformations
- Cloning a Dataset
- Sampling a Dataset
- Filtering a Dataset
- Extending a Dataset
- Filtering the New Fields Output
- Discretization of a Continuous Field
- Outlier Elimination
- Lisp and JSON Syntaxes
- Final Remarks
Dataset Base URL
You can use the following base URL to create, retrieve, update, and delete datasets. https://bigml.io/andromeda/dataset
Dataset base URL
All requests to manage your datasets must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Dataset
To create a new dataset, you need to POST to the dataset base URL an object containing at least the source/id that you want to use to create the dataset. The content-type must always be "application/json".
You can easily create a new dataset using curl as follows. All you need is a valid source/id and your authentication variable set up as shown above.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source": "source/50a4527b3c1920186d000041"}'
> Creating a dataset
BigML.io will return the newly created dataset if the request succeeded.
{
"category": 0,
"code": 201,
"columns": 0,
"created": "2012-11-15T02:29:09.293711",
"credits": 0.00439453125,
"description": "",
"excluded_fields": [],
"fields": {},
"fields_meta": {
"count": 0,
"limit": 200,
"offset": 0,
"total": 0
},
"input_fields": [],
"locale": "en-US",
"name": "iris' dataset",
"number_of_evaluations": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"price": 0.0,
"private": true,
"project": null,
"resource": "dataset/52b9359a3c19205ff100002a",
"rows": 0,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"status": {
"code": 1,
"message": "The dataset is being processed and will be created soon"
},
"tags": [],
"updated": "2012-11-15T02:29:09.293733",
"views": 0
}
< Example dataset JSON response
Dataset Arguments
By default, the dataset will include all fields in the corresponding source; but this behaviour can be fine-tuned via the input_fields and excluded_fields lists of identifiers. The former specifies the list of fields to be included in the dataset, and defaults to all fields in the source when empty. To specify excluded fields, you can use excluded_fields: identifiers in that list are removed from the list constructed using input_fields".
See below the full list of arguments that you can POST to create a dataset.
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is the category of the source |
The category that best describes the dataset. See the category codes for the complete list of categories.
Example: 1 |
description
optional |
String |
A description of the dataset up to 8192 characters long.
Example: "This is a description of my new dataset" |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the source is excluded. |
Specifies the fields that won't be included in the dataset.
Example:
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names, labels or descriptions are changed. |
Updates the names, labels, and descriptions of the fields in the dataset with respect to the original names in the source. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
input_fields
optional |
Array, default is []. All the fields in the source. |
Specifies the fields to be included in the dataset.
Example:
|
json_filter
optional |
Array |
A JSON list representing a filter over the rows in the datasource. The first element is an operator and the rest of the elements its arguments. See the section below for more details.
Example: [">", 3.14, ["field", "000002"]] |
lisp_filter
optional |
String |
A string representing a Lisp s-expression to filter rows from the datasource.
Example: "(> 3.14 (field 2))" |
name
optional |
String, default is source's name |
The name you want to give to the new dataset.
Example: "my new dataset" |
objective_field
optional |
Object, default is the last non-auto-generated field in the dataset. |
Specifies the default objective field.
Example:
|
origin | String |
The dataset/id of the gallery dataset to be cloned. The price of the dataset must be 0 to be cloned via API.
Example: "dataset/5b9ab8474e172785e3000003" |
origin_dataset
optional |
String |
The dataset/id of dataset to be transformed. See the Section on Transformations for more details.
Example: "dataset/5b9ab8474e172785e3000003" |
origin_datasets
optional |
Array |
A list of dataset ids or objects to be merged.See the Section on Multi-Datasets for more details.
Example:
|
project
optional |
String |
The project/id you want the dataset to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
refresh_field_types
optional |
Boolean, default is false |
Specifies whether field types need to recomputed or not.
Example: true |
refresh_objective
optional |
Boolean, default is false |
Specifies whether the default objective field of the dataset needs to be recomputed or not.
Example: true |
refresh_preferred
optional |
Boolean, default is false |
Specifies whether preferred field flags need to recomputed or not.
Example: true |
shared_hash | String |
The shared hash of the shared dataset to be cloned.
Example: "kpY46mNuNVReITw0Z1mAqoQ9ySW" |
size
optional |
Integer, default is the source's size |
The number of bytes from the source that you want to use.
Example: 1073741824 |
source | String |
A valid source/id.
Example: source/4f665b8103ce8920bb000006 |
tags
optional |
Array of Strings |
A list of strings that help classify and index your dataset.
Example: ["best customers", "2018"] |
term_limit
optional |
Integer |
The maximum total number of terms to be used in text analysis.
Example: 500 |
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new dataset with a name, and different size, and only a few fields from the original source. For example, to create a new dataset named "my dataset", with only 500 bytes, and with only two fields:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source": "source/4f665b8103ce8920bb000006", "name": "my dataset", "size": 500, "fields": {"000001": {"name": "width_1"}, "000003": {"name": "width_2"}}}'
> Creating a customized dataset
If you do not specify a name, BigML.io will assign to the new dataset the source's name. If you do not specify a size, BigML.io will use the full the source's size. If you do not specify any fields BigML.io will include all the fields in the source with their corresponding names.
Filtering Rows
The dataset creation request can include an argument, json_filter, specifying a predicate that the input rows from the source have to satisfy in order to be included in the dataset. This predicate is specified as a (possibly nested) JSON list whose first element is an operator and the rest of the elements its arguments. Here's an example of a filter specification to choose only those rows whose field "000002" is less than 3.14:
[">", 3.14, ["field", "000002"]]
Filter Example
As you see, the list starts with the operator we want to use, ">", followed by its operands: the number 3.14, and the value of the field with identifier "000002", which is denoted by the operator "field". As another example, this filter:
["=", ["field", "000002"], ["field", "000003"], ["field", "000004"]]
Filter Example
selects rows for which the three fields with identifiers "000002", "000003" and "000004" have identical values. Note how you're not limited to two arguments. It's also worth noting that for a filter like that one to be accepted, all three fields must have the same optype (e.g. numeric), otherwise they cannot be compared.
The field operator also accepts as arguments the field's name (as a string) or the row column (as an integer). For instance, if field "000002" had column number 12, and field "000003" was named "Stock prize", our previous query could have been written:
["=", ["field", 12], ["field", "Stock prize"], ["field", "000004"]]
Filter Example
If the name is not unique, the first matching field found is picked, consistently over the whole filter formula. If you have duplicated field names, the best thing to do is to use either column numbers or field identifiers in your filters, to avoid ambiguities.
Besides a field's value, one can also ask whether it's missing or not. For instance, to include only those rows for which field "000002" contains a missing token, you would use:
["missing", "000002"]
Filter Example
["and", ["not", ["missing", 12]]
, ["not", ["missing", "Stock prize"]]]
Filter Example
["or", ["=", 3, ["field", "000001"]]
, [">", "1969-07-14T06:10", ["field", "000111"]]
, ["and", ["missing", 23]
, ["=", "Cat", ["field", "000002"]]
, ["<", 2, ["field", "000003"], 4]]]
Filter Example
In the examples above, you can also see how dates are allowed and can be compared as numerical values (provided the implied fields are of the correct optype).
Finally, it's also possible to use the arithmetic operators +, -, * and / with numeric fields and constants, as in the following example:
[">", ["/", ["+", ["-", ["field", "000000"]
, 4.4]
, ["field", "000003"]
, ["*", 2
, ["field", "Class"]
, ["field", "000004"]]]
, 3]
, 5.5]
Filter Example
These are all the accepted operators:
=, !=, >, >=, <, <=, and, or, not, field, missing, +, -, *, /.To be accepted by the API, the filter must evaluate to a boolean value and contain at least one operator.So, for instance, a constant or a formula evaluating to a number will be rejected.
Since writing and reading the above formula in pure JSON might be a bit involved, you can also send your query to the server as a string representing a Lisp flatine formula using the argument lisp_filter, e.g.
(> (/ (+ (- (field "000000") 4.4)
(field 23)
(* 2 (field "Class") (field "000004")))
3)
5.5)
Filter Example
Retrieving a Dataset
Each dataset has a unique identifier in the form "dataset/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the dataset. Notice that to download the dataset file in the CSV format, you will need to append "/download", and in the Tableau tde format, append "/download?format=tde" to resource id.
To retrieve a dataset with curl:
curl "https://bigml.io/andromeda/dataset/52b9359a3c19205ff100002a?$BIGML_AUTH"
$ Retrieving a dataset from the command line
To download the dataset file in the CSV format with curl:
curl "https://bigml.io/andromeda/dataset/52b9359a3c19205ff100002a/download?$BIGML_AUTH"
$ Downloading a dataset csv file from the command line
To download the dataset file in the Tableau tde format with curl:
curl "https://bigml.io/andromeda/dataset/52b9359a3c19205ff100002a/download?format=tde;$BIGML_AUTH"
$ Downloading a dataset tde file from the command line
You can also use your browser to visualize the dataset using the full BigML.io URL or pasting the dataset/id into the BigML.com dashboard.
Dataset Properties
Once a dataset has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the dataset and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the dataset creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields in the dataset. |
correlations | Object |
A dictionary where each entry represents a field (column) in your data with the last calculated correlation/id for it.
Example:
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the dataset was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this dataset. |
description
updatable |
String | A text describing the dataset. It can contain restricted markdown to decorate the text. |
excluded_fields | Array | The list of fields's ids that were excluded to build the model. |
field_types | Object | A dictionary that informs about the number of fields of each type. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields and an entry for the total number of fields. In new datasets, it uses the key effective_fields to inform of the effective number of fields. That is the total number of fields including those created under the hood to support text fields. |
fields
updatable |
Object | A dictionary with an entry per field (column) in your data. Each entry includes the column number, the name of the field, the type of the field, and the summary. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
input_fields | Array | The list of input fields' ids used to create the dataset. |
json_query | Object | A dictionary specifying each of the parts of the executed SQL query that was used to create this dataset. |
json_query_parsed | Object | The canonical representation of the SQL query as a JSON map. |
juxtapose
filterable, sortable |
Boolean | Whether juxtaposition has been performed during creation |
juxtapose_input_fields | Object | A dictionary keyed by dataset/id and an array of field names and/or ids that specifies the input fields to use for each dataset during merge. |
locale | String | The source's locale. |
name
filterable, sortable, updatable |
String | The name of the dataset as your provided or based on the name of the source by default.) |
number_of_anomalies
filterable, sortable |
Integer | The current number of anomalies that use this dataset. |
number_of_anomalyscores
filterable, sortable |
Integer | The current number of anomaly scores that use this dataset. |
number_of_associations
filterable, sortable |
Integer | The current number of associations that use this dataset. |
number_of_associationsets
filterable, sortable |
Integer | The current number of association sets that use this dataset. |
number_of_batchanomalyscores
filterable, sortable |
Integer | The current number of batch anomaly scores that use this dataset. |
number_of_batchcentroids
filterable, sortable |
Integer | The current number of batch centroids that use this dataset. |
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this dataset. |
number_of_batchtopicdistributions
filterable, sortable |
Integer | The current number of batch topic distributions that use this dataset. |
number_of_centroids
filterable, sortable |
Integer | The current number of centroids that use this dataset. |
number_of_clusters
filterable, sortable |
Integer | The current number of clusters that use this dataset. |
number_of_correlations
filterable, sortable |
Integer | The current number of correlations that use this dataset. |
number_of_ensembles
filterable, sortable |
Integer | The current number of ensembles that use this dataset. |
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this dataset. |
number_of_forecasts
filterable, sortable |
Integer | The current number of forecasts that use this dataset. |
number_of_linearregressions
filterable, sortable |
Integer | The current number of linear regressions that use this dataset. |
number_of_logisticregressions
filterable, sortable |
Integer | The current number of logistic regressions that use this dataset. |
number_of_models
filterable, sortable |
Integer | The current number of models that use this dataset. |
number_of_optimls
filterable, sortable |
Integer | The current number of OptiMLs that use this dataset. |
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this dataset. |
number_of_statisticaltests
filterable, sortable |
Integer | The current number of statistical tests that use this dataset. |
number_of_timeseries
filterable, sortable |
Integer | The current number of time series that use this dataset. |
number_of_topicdistributions
filterable, sortable |
Integer | The current number of topic distributions that use this dataset. |
number_of_topicmodels
filterable, sortable |
Integer | The current number of topic models that use this dataset. |
objective_field
updatable |
Object | The default objective field. |
optiml
filterable, sortable |
String | The optiml/id that created this model. |
optiml_status
filterable, sortable |
Boolean | Whether the OptiML is still available or has been deleted. |
origin
filterable, sortable |
String | The dataset/id of the original gallery dataset. |
origin_dataset
filterable, sortable |
String | The dataset/id of the original dataset. See the Section on Transformations for more details. |
origin_datasets | Array | A list of original dataset ids or objects. See the Section on Multi-Datasets for more details. |
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to clone the dataset instead of the sampled instances. |
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your dataset. |
private
filterable, sortable, updatable |
Boolean | Whether the dataset is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
range | Array | The range of instances used to clone the dataset. |
refresh_field_types
filterable, sortable |
Boolean | Whether the field types of the dataset have been recomputed or not. |
refresh_objective
filterable, sortable |
Boolean | Whether the default objective field of the dataset has been recomputed or not. |
refresh_preferred
filterable, sortable |
Boolean | Whether the preferred flags of the dataset fields have been recomputed or not. |
replacement
filterable, sortable |
Boolean | Whether the instances sampled to clone the dataset were selected using replacement or not. |
resource | String | The dataset/id. |
rows
filterable, sortable |
Integer | The total number of rows in the dataset. |
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
shared
filterable, sortable, updatable |
Boolean | Whether the dataset is shared using a private link or not. |
shared_clonable
filterable, sortable, updatable |
Boolean | Whether the shared dataset can be cloned or not. |
shared_hash | String | The hash that gives access to this dataset if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this dataset. |
size
filterable, sortable |
Integer | The number of bytes of the source that were used to create this dataset. |
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
sql_output_fields | Array of Objects | A list of dictionaries containing some of the properties of the fields generated by the given sql_query or json_query. |
sql_query | String | The SQL query that was executed to create this dataset. |
sql_query_parsed | String | The canonical form of query as a SQL prepared statement. |
statisticaltest
filterable, sortable |
String | The last statisticaltest/id that was generated for this dataset. |
status | Object | A description of the status of the dataset. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the dataset was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
term_limit
filterable, sortable |
Integer | The maximum total number of terms used by all the text fields. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the dataset was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Dataset Fields
The property fields is a dictionary keyed by each field's id in the source. Each field's id has as a value an object with the following properties:
Numeric Summary
Numeric summaries come with all the fields described below. If the number of unique values in the data is greater than 32, then 'bins' will be used for the summary. If not, 'counts' will be available.
Property | Type | Description |
---|---|---|
bins | Array | An array that represents an approximate histogram of the distribution. It consists of value pairs, where the first value is the mean of a histogram bin and the second value is the bin population. bins is only available when the number of distinct values is greater than 32. For more information, see our blog post or read this paper. |
counts | Array | An array of pairs where the first element of each pair is one of the unique values found in the field and the second element is the count. Only available when the number of distinct values is less than or equal to 32. |
kurtosis | Number | The sample kurtosis. A measure of 'peakiness' or heavy tails in the field's distribution. |
maximum | Number | The maximum value found in this field. |
mean | Number | The arithmetic mean of non-missing field values. |
median | Number | The approximate median of the non-missing values in this field. |
minimum | Number | The minimum value found in this field. |
missing_count | Integer | Number of instances missing this field. |
population | Integer | The number of instances containing data for this field. |
skewness | Number | The sample skewness. A measure of asymmetry in the field's distribution. |
standard_deviation | Number | The unbiased sample standard deviation. |
sum | String | Sum of all values for this field (for mean calculation). |
sum_squares | String | Sum of squared values (for variance calculation). |
variance | Number | The unbiased sample variance. |
Categorical Summary
Categorical summaries give you a count per each category and missing count in case any of the instances contain missing values.
Text Summary
Text summaries give statistics about the vocabulary of a text field, and the number of instances containing missing values.
Dataset Status
Before a dataset is successfully created, BigML.io makes sure that it has been uploaded in an understandable format, that the data that it contains is parseable, and that the types for each column in the data can be inferred successfully. The dataset goes through a number of states until all these analyses are completed. Through the status field in the dataset you can determine when the dataset has been fully processed and ready to be used to create a model. These are the fields that a dataset's status has:
Property | Type | Description |
---|---|---|
bytes | Integer | Number of bytes processed so far. |
code | Integer | A status code that reflects the status of the dataset creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the dataset. |
field_errors | Object |
Information about ill-formatted fields that includes the total format errors for the field and a sample of the ill-formatted tokens.
Example:
|
message | String | A human readable message explaining the status. |
row_format_errors | Array | Information about ill-formatted rows. It includes the total row-format errors and a sampling of the ill-formatted rows. |
serialized_rows | Integer | The number of rows serialized so far. |
Once a dataset has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"columns": 5,
"created": "2012-11-15T02:29:09.293000",
"credits": 0.00439453125,
"description": "",
"excluded_fields": [],
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": {
"bins": [
[
4.3,
1
],
[
4.425,
4
],
[
4.6,
4
],
[
4.7,
2
],
[
4.8,
5
],
[
4.9,
6
],
[
5,
10
],
[
5.1,
9
],
[
5.2,
4
],
[
5.3,
1
],
[
5.4,
6
],
[
5.5,
7
],
[
5.6,
6
],
[
5.7,
8
],
[
5.8,
7
],
[
5.9,
3
],
[
6,
6
],
[
6.1,
6
],
[
6.2,
4
],
[
6.3,
9
],
[
6.44167,
12
],
[
6.6,
2
],
[
6.7,
8
],
[
6.8,
3
],
[
6.92,
5
],
[
7.1,
1
],
[
7.2,
3
],
[
7.3,
1
],
[
7.4,
1
],
[
7.6,
1
],
[
7.7,
4
],
[
7.9,
1
]
],
"maximum": 7.9,
"mean": 5.84333,
"median": 5.77889,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"splits": [
4.51526,
4.67252,
4.81113,
4.89582,
4.96139,
5.01131,
5.05992,
5.11148,
5.18177,
5.35681,
5.44129,
5.5108,
5.58255,
5.65532,
5.71658,
5.77889,
5.85381,
5.97078,
6.05104,
6.13074,
6.23023,
6.29578,
6.35078,
6.41459,
6.49383,
6.63013,
6.70719,
6.79218,
6.92597,
7.20423,
7.64746
],
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569
}
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": {
"counts": [
[
2,
1
],
[
2.2,
3
],
[
2.3,
4
],
[
2.4,
3
],
[
2.5,
8
],
[
2.6,
5
],
[
2.7,
9
],
[
2.8,
14
],
[
2.9,
10
],
[
3,
26
],
[
3.1,
11
],
[
3.2,
13
],
[
3.3,
6
],
[
3.4,
12
],
[
3.5,
6
],
[
3.6,
4
],
[
3.7,
3
],
[
3.8,
6
],
[
3.9,
2
],
[
4,
1
],
[
4.1,
1
],
[
4.2,
1
],
[
4.4,
1
]
],
"maximum": 4.4,
"mean": 3.05733,
"median": 3.02044,
"minimum": 2,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.43587,
"sum": 458.6,
"sum_squares": 1430.4,
"variance": 0.18998
}
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true,
"summary": {
"bins": [
[
1,
1
],
[
1.1,
1
],
[
1.2,
2
],
[
1.3,
7
],
[
1.4,
13
],
[
1.5,
13
],
[
1.63636,
11
],
[
1.9,
2
],
[
3,
1
],
[
3.3,
2
],
[
3.5,
2
],
[
3.6,
1
],
[
3.75,
2
],
[
3.9,
3
],
[
4.0375,
8
],
[
4.23333,
6
],
[
4.46667,
12
],
[
4.6,
3
],
[
4.74444,
9
],
[
4.94444,
9
],
[
5.1,
8
],
[
5.25,
4
],
[
5.46,
5
],
[
5.6,
6
],
[
5.75,
6
],
[
5.95,
4
],
[
6.1,
3
],
[
6.3,
1
],
[
6.4,
1
],
[
6.6,
1
],
[
6.7,
2
],
[
6.9,
1
]
],
"maximum": 6.9,
"mean": 3.758,
"median": 4.34142,
"minimum": 1,
"missing_count": 0,
"population": 150,
"splits": [
1.25138,
1.32426,
1.37171,
1.40962,
1.44567,
1.48173,
1.51859,
1.56301,
1.6255,
1.74645,
3.23033,
3.675,
3.94203,
4.0469,
4.18243,
4.34142,
4.45309,
4.51823,
4.61771,
4.72566,
4.83445,
4.93363,
5.03807,
5.1064,
5.20938,
5.43979,
5.5744,
5.6646,
5.81496,
6.02913,
6.38125
],
"standard_deviation": 1.7653,
"sum": 563.7,
"sum_squares": 2582.71,
"variance": 3.11628
}
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true,
"summary": {
"counts": [
[
0.1,
5
],
[
0.2,
29
],
[
0.3,
7
],
[
0.4,
7
],
[
0.5,
1
],
[
0.6,
1
],
[
1,
7
],
[
1.1,
3
],
[
1.2,
5
],
[
1.3,
13
],
[
1.4,
8
],
[
1.5,
12
],
[
1.6,
4
],
[
1.7,
2
],
[
1.8,
12
],
[
1.9,
5
],
[
2,
6
],
[
2.1,
6
],
[
2.2,
3
],
[
2.3,
8
],
[
2.4,
3
],
[
2.5,
3
]
],
"maximum": 2.5,
"mean": 1.19933,
"median": 1.32848,
"minimum": 0.1,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.76224,
"sum": 179.9,
"sum_squares": 302.33,
"variance": 0.58101
}
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true,
"summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-setosa",
50
],
[
"Iris-virginica",
50
]
],
"missing_count": 0
}
}
},
"fields_meta": {
"count": 5,
"limit": 200,
"offset": 0,
"total": 5
},
"input_fields": [
"000000",
"000001",
"000002",
"000003",
"000004"
],
"locale": "en_US",
"name": "iris' dataset",
"number_of_evaluations": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"price": 0.0,
"private": true,
"project": null,
"resource": "dataset/52b9359a3c19205ff100002a",
"rows": 150,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"status": {
"bytes": 4608,
"code": 5,
"elapsed": 163,
"field_errors": [],
"message": "The dataset has been created",
"row_format_errors": [],
"serialized_rows": 150
},
"tags": [],
"updated": "2012-11-15T02:29:10.537000",
"views": 0
}
< Example dataset JSON response
Filtering and Paginating Fields from a Dataset
A dataset might be composed of hundreds or even thousands of fields. Thus when retrieving a dataset, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Dataset
To update a dataset, you need to PUT an object containing the fields that you want to update to the dataset' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated dataset.
For example, to update a dataset with a new name you can use curl like this:
curl "https://bigml.io/andromeda/dataset/52b9359a3c19205ff100002a?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a dataset's name
Deleting a Dataset
To delete a dataset, you need to issue a HTTP DELETE request to the dataset/id to be deleted.
Using curl you can do something like this to delete a dataset:
curl -X DELETE "https://bigml.io/andromeda/dataset/52b9359a3c19205ff100002a?$BIGML_AUTH"
$ Deleting a dataset from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a dataset, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a dataset a second time, or a dataset that does not exist, you will receive a "404 not found" response.
However, if you try to delete a dataset that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Datasets
To list all the datasets, you can use the dataset base URL. By default, only the 20 most recent datasets will be returned. You can see below how to change this number using the limit parameter.
You can get your list of datasets directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/dataset?$BIGML_AUTH
> Listing datasets from a browser
Multi-Datasets
BigML.io allows you to create a new dataset merging multiple datasets. This functionaliy can be very useful when you use multiple sources of data and in online scenarios as well. Imagine, for example, that you collect data in a hourly basis and want to create a dataset aggregating data collected over the whole day. So you only need to send the new generated data each hour to BigML, create a source and a dataset for each one and then merge all the individual datasets into one at the end of the day.
We usually call dataset created in this way multi-dataset. BigML.io allows you to aggregate up to 32 datasets in the same API request. You can merge multiple datasets, so basically you can grow a dataset as much as you want.
To create a multi-dataset, you can specify a list of dataset ids as input using the argument origin_datasets. The example below will construct a new dataset that is the concatenation of three other datasets.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc7fd03c1920e4a3000016",
"dataset/52bc80233c1920e4a300001a"]}'
> Creating a multi-dataset with dataset ids
Alternatively, you can specify a list of dataset objects that each contains id of the dataset with optional arguments such as name, sample_rate, out_of_bag, replacement, seed, fields_map, range, and juxtapose_input_fields for each dataset. You can also mix the two formats. The next two examples are equivalent to the first example above.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
{"id": "dataset/52bc7fc83c1920e4a3000012"},
{"id": "dataset/52bc7fd03c1920e4a3000016"},
{"id": "dataset/52bc80233c1920e4a300001a"}]}'
> Creating a multi-dataset with dataset objects
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc7fd03c1920e4a3000016",
{"id": "dataset/52bc80233c1920e4a300001a"}]}'
> Creating a multi-dataset with mixed formats
By convention, the first dataset defines the final dataset fields. However, there can be cases where each dataset might come from a different source and therefore have different field ids. In these cases, you might need to use a fields_map argument to match each field in a dataset to the fields of the first dataset.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
{
"id": "dataset/52bc7fd03c1920e4a3000016",
"fields_map": {
"000000":"000023",
"000001":"000024",
"000002":"00003a"
}
},
{
"id": "dataset/52bc80233c1920e4a300001a",
"fields_map": {
"000000":"000023",
"000001":"000004",
"000002":"00000f"
}
},
"dataset/52bc851b3c1920e4a3000022"]}'
> Creating a multi-dataset mapping fields
For instance, in the request above, we use four datasets as input. The first one would define the final dataset fields. For instance, let's say that the dataset dataset/52bc7fc83c1920e4a3000012 in this example has three fields with identifiers 000001, 000002 and 000003. Those will be the default resulting fields, together with their data types and so on. Then we need to specify, for each of the remaining datasets in the list, a mapping from the "standard" fields to those in the corresponding dataset. In our example, we're saying that the fields of the second dataset to be used during the concatenation are 000023, 000024 and 00003a, which correspond to the final fields having them as keys. In the case of the third dataset, the fields used will be 000023, 000004 and 00000f. For the last one, since there's no entry in fields_map, we'll try to use the same identifiers as those of the first dataset.
The optypes of the paired fields should match, and for the case of categorical fields, be a proper subset. If a final field has optype text, however, all values are converted to strings.
You can achieve the same result as the example above with the following command.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc7fd03c1920e4a3000016",
"dataset/52bc80233c1920e4a300001a",
"dataset/52bc851b3c1920e4a3000022"],
"fields_maps": {
"dataset/52bc7fd03c1920e4a3000016": {
"000000":"000023",
"000001":"000024",
"000002":"00003a"},
"dataset/52bc80233c1920e4a300001a": {
"000000":"000023",
"000001":"000004",
"000002":"00000f"}}}'
> Creating a multi-dataset mapping fields (Deprecated)
Note that the top-level fields_maps argument is deprecated in favor of the self-contained dataset object format along with the other top-level plural arguments (i.e., origin_dataset_names, sample_rates, out_of_bags, replacements, seeds, ranges, and juxtapose_input_fields) which are dictionaries keyed by the dataset id. When origin_datasets contains at least one self-contained dataset object, those arguments are simply ignored. A self-contained dataset object must have id and may have name, fields_map, sample_rate, out_of_bag, replacement, seed, range, and juxtapose_input_fields. And this format also allows you to list the same dataset more than once. However, if it contains a list of string dataset ids only, then each dataset id must be unique for the backward-comparability.
BigML.io also allows you to sample each dataset individually before merging it. You can specify the sample options for each dataset object using the arguments sample_rate, replacement, seed, and out_of_bag. Likewise, you can define the range of the dataset by using the range option. The next request will create a multi-dataset sampling the two input datasets differently. Note that you can still apply the top-level sampling arguments to the merged dataset.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
{
"id": "dataset/52bc7fc83c1920e4a3000012",
"sample_rate: 0.5,
"replacement": true
},
{
"id": "dataset/52bc851b3c1920e4a3000022",
"sample_rate": 0.8
}]}'
> Creating a multi-dataset with sampling
BigML.io also allows you to create a new dataset merging multiple datasets using juxtaposition instead of concatenating the datasets passed in the argument origin_datasets. In its simplest form, a juxtaposition request would look like this:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc7fd03c1920e4a3000016",
"dataset/52bc80233c1920e4a300001a"],
"juxtapose": true}'
> Creating a multi-dataset using juxtaposition
In the example above, we are asking for the generation of a new dataset with each row constructed by concatenating the three rows of each origin dataset put side by side. The new dataset will thus contain as many rows as the shorter input dataset and as many fields as the sum of the number of fields of the input datasets.
Unless otherwise specified, all fields of each of the datasets in origin_datasets are used in the juxtaposition. If you want to use a subset of any of them, specify it using juxtapose_input_fields. This creation request field must be an object where each entry specifies the input fields to use for the corresponding dataset. The values must be a list of fields names or identifiers. For instance:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
{
"id": "dataset/52bc7fc83c1920e4a3000012",
"juxtapose_input_fields": ["000001", "species"]
},
"dataset/52bc7fd03c1920e4a3000016",
{
"id": "dataset/52bc80233c1920e4a300001a",
"juxtapose_input_fields": ["age", "000002", "000003"]
}],
"juxtapose": true}'
> Creating a multi-dataset using juxtaposition
It will juxtapose two fields of the first dataset, all the fields of the second dataset, and three fields of the last dataset. We also show in the example how fields can be identified by either id or name.
This is the list of all the arguments that you can use to create a multi-dataset. Note again that many of them are deprecated in favor of the self-contained dataset object arguments.Argument | Type | Description |
---|---|---|
fields_maps
optional |
Object |
A dictionary keyed by dataset/id and object values. Each entry maps fields in the first dataset to fields in the dataset referenced by the key.
Example:
DEPRECATED
|
json_query
optional |
Object | A dictionary specifying each of the parts of the executed SQL query separately. See the Section on Creating a Dataset using SQL for more details. |
juxtapose
optional |
Boolean, default is false |
Whether juxtaposition should be performed on multi-dataset merging.
Example: true |
juxtapose_input_fields
optional |
Object |
A dictionary keyed by dataset/id and an array of field names and/or ids that specifies the input fields to use for each dataset during merge.
Example:
DEPRECATED
|
origin_dataset_names
optional |
Object | A dictionary keyed by dataset/id and value a string that represents the name to be used as its table name in the SQL query to be performed for the dataset. See the Section on Creating a Dataset using SQL for more details. DEPRECATED |
out_of_bags
optional |
Object |
A dictionary keyed by dataset/id and boolean values. Setting this parameter to true for a dataset will return a dataset containing sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example:
DEPRECATED
|
ranges
optional |
Object |
A dictionary keyed by dataset/id and range values.
Example:
DEPRECATED
|
replacements
optional |
Object |
A dictionary keyed by dataset/id and boolean values indicating whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example:
DEPRECATED
|
sample_rates
optional |
Object |
A dictionary keyed by dataset/id and float values. Each value is a number between 0 and 1 specifying the sample rate for the dataset. See the Section on Sampling for more details.
Example:
DEPRECATED
|
seeds
optional |
Object |
A dictionary keyed by dataset/idand string values indicating the seed to be used for each dataset to generate deterministic samples. See the Section on Sampling for more details.
Example:
DEPRECATED
|
sql_output_fields
optional |
Array of Objects | A list of dictionaries containing some of the properties of the fields generated by the given sql_query or json_query. See the Section on Creating a Dataset using SQL for more details. |
sql_query
optional |
String | The SQL query to be executed.See the Section on Creating a Dataset using SQL for more details. |
- Sample each individual dataset object according to the specifications provided in the arguments sample_rate, replacement, seed, out_of_bag, and range.
- Merge all the datasets together using the fields_map argument to match fields in case they come from different sources (i.e., have different field ids).
- When juxtapose is true, all argument in the table above except juxtapose_input_fields are ignored, however, the rules below still apply.
- Sample the merged dataset like in the case of a regular datasaset sampling using the the top-level arguments sample_rate, replacement, seed, and out_of_bag.
- Filter the sampled dataset using input_fields, excluded_fields, and either a json_filter or lisp_filter.
- Extend the dataset with new fields according to the specifications provided in the new_fields argument.
- Filter the output of the new fields using either an output_json_filter or output_lisp_filter.
Creating a Dataset using SQL
BigML.io now allows you to create a new dataset by performing an SQL-style query over a list of input datasets, which are treated as SQL tables. To that end, the POST request JSON should contain the following fields:
- origin_datasets: a list of dataset identifiers or objects for the input datasets that are going to be used as the input tables of the query, this is identical to the field used to specify a multi-dataset during merging datasets.
- origin_dataset_names: a dictionary keyed by dataset/id and value a string that represents the name to be used as its table name in the SQL query to be performed for the dataset. This will typically be a list of sort names so that the SQL query is readable. (i.e., you want to write "SELECT d0.field1" rather than "SELECT a_long_dataset_name.field1".) Note this field is deprecated in favor of name for the specified dataset object under origin_datasets.
- sql_query, a string with the SQL query to be executed or json_query, a map specifying each of the parts of the SQL query separately.
- sql_output_fields, a list of dictionaries containing some of the properties of the fields generated by the given sql_query or json_query.
The platform will parse the query, converting if needed its field names to identifiers and return back (for informational purposes) two additional fields, namely sql_query_parsed and json_query_parsed. The first is the canonical form of query as a SQL prepared statement (i.e., a list with a string that can contain wild-cards and, if needed, some arguments as in ["SELECT * FROM A WHERE A.000000 > ?", 2]), and json_query is the canonical representation of query as a JSON map.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
{
"id": "dataset/5bab8d6b1f386f7c20000000",
"name": "A"
},
{
"id": "dataset/5bab8d6e1f386f7c20000003",
"name": "B"
}
],
"sql_query": "select A.`000000` as x, A.`00000a` as z, A.`00000c` from A, B where A.id = B.id",
"sql_output_fields": [
{
"column": 0,
"name": "name, a text",
"optype": "text",
"term_analysis": {
"enabled": true,
"case_sensitive": true
}
},
{
"column": 1,
"optype": "items",
"item_analysis": {
"separator": ";"
}
}
]
}'
> Creating a dataset using SQL
A SELECT specification can be provided either as a SQL string or as a map possibly containing the following keys:
-
SELECT: A list of strings,
each one specifying one of the new fields in the generated dataset.
This corresponds to the "selected columns" part of a "SELECT FROM ..." SQL statement,
and will use the names in origin_dataset_names
to refer to input datasets as SQL tables.
Each table has as one column per dataset field, and its canonical name is the field identifier;
but for convenience one can refers to input columns using field names
and BigML will translate them automatically to identifiers.
For instance, say we have origin_dataset_names:
{"dataset/5bab8d6b1f386f7c20000000": "d0"}, i.e. one input dataset with,
say, fields 000000, 000001 and 000002 named field1, field2 and field3.
One could select only the first field of the first dataset either via
"SELECT d0.000000" or via "SELECT d0.field1", or using the maps:
{"select": ["d0.`000000`"], ...}
{"select": ["d0.field1"], ...}
or the second and first columns of the dataset with, for instance, "SELECT d0.000001, d0.field1", with the map form:{"select": ["d0.`000001`", "d0.field1"], ...}
In an SQL query specified as string, you can name the output columns of your query using "AS", for instance: "SELECT d0.field2 AS age" will pick the third column of the dataset and the field of the generated dataset will be named age. When the query is specified using a JSON dictionary, the corresponding element in the select list will be a pair, with the first element the left hand operator of "AS" and the second element the right hand one. So the previous query would be translated as:{"select": [["d0.field2", "age"]], ...}
It is also possible to specify SQL function calls in a select list element, using prefix notation for the operators. - DISTINCT: A boolean flag that corresponds to the distinct SQL keyword when set to true. Defaults to false.
- LIMIT: An integer with the maximum number of rows to select, as in the SQL string "SELECT * LIMIT 3".
- OFFSET: The offset of the selected rows, as an integer. Corresponds to the offset SQL keyword.
-
WHERE: A JSON rendition, as a list, of a SQL where clause.
The first element in the list is the SQL operand to apply
(one of and, or, count,
avg, sum, min,
max, =, <>,
!=, >, >=,
<, <=, and between),
and the rest are its operands (possibly including nested operators).
So for instance, the SQL clause "d.f0 < 3 and e.f1 = e.f2" is written as
["and", ["<", "d.f0", 3], ["=", "e.f1", "e.f2"]]
- HAVING: A translation of a SQL having clause using prefix notation, as in where.
- GROUP_BY: A list of field identifiers (as in select) to perform a SQL group by operation.
-
ORDER_BY: A list of field identifiers to perform a SQL order by operation.
Each element in the list can be either a field identifier string,
or a pair of a field identifier and either ASC or
DESC to denote ascending or descending ordering.
["t.field1", ["t.field2", "DESC"], ["e.field0" "ASC"]]
-
JOIN, LEFT_JOIN, RIGHT_JOIN, FULL_JOIN:
An SQL JOIN clause is used to combine rows from two or more datasets (tables).
The specification consists of a list starting with the name of the dataset to join on,
followed by the operation that one writes in the SQL ON specification.
Thus, for instance, the SQL string "JOIN foo ON foo.id = bar.id" would be translate to the JSON specification
{"join": ["foo", ["=", "foo.id", "bar.id"]]}
and likewise for the other joins.
Here's an example of a complicated query combining most of the elements above:
{
"select": ["f.*", "b.baz", "c.quux", ["b.bla", "bla-bla"], ["now"]],
"distinct": true,
"having": ["<", 0, "f.e"],
"where": ["or", ["and", ["=", "f.a", "bort"], ["!=", "b.baz", "param1"]],
["<", 1, 2, 3],
["in", "f.e", [1, 19, 3]],
["between", "f.e", 10, 20]],
"limit": 50,
"group_by": ["f.a"],
"offset": 10,
"join": ["draq", ["=", "f.b", "draq.x"]],
"left_join": ["clod", ["=", "f.a", "clod.d"]],
"order_by": [["b.baz", "desc"], "c.quux", "f.a"]
}
which would correspond to the SQL query string:
SELECT DISTINCT f.*, b.baz, c.quux, b.bla AS bla_bla, now()
INNER JOIN draq ON f.b = draq.x
LEFT JOIN clod c ON f.a = c.d
WHERE ((f.a = "bort" AND b.baz <> "param1")
OR (1 < 2 AND 2 < 3)
OR (f.e in (1, 10, 3))
OR f.e BETWEEN 10 AND 20)
GROUP BY f.a
HAVING 0 < f.e
ORDER BY b.baz DESC, c.quux, f.a
LIMIT 50
OFFSET 10
Although the above examples contain field names, you can also use the field IDs in a SQL query as long as you set them with back quotes, e.g., “SELECT `100006`, sum(`000001`) AS sum_field1, FROM DS GROUP BY `100006`”
The user can submit either form as her query. If she uses the latter, as a string, BigML will parse it to a standard map format as the former, discarding non-supported SQL constructs appearing in the query string.
Note that, as is conventional in SQL, we mix freely upper and lowercase keywords in the above examples for a sql_query value. BigML accepts both cases although the recommended style is to not mix them in a single request. The keywords, however, in json_query must be in lowercase.
You can find a list of other non-standard SQL functions supported by BigML in this document.
Next, we'll list all the arguments that can be used to fine-tuning the properties of the SQL-generated fields.
Argument | Type | Description |
---|---|---|
column | Integer |
Column number denoting the output field. Zero-based numbering.
Example: 1 |
item_analysis
optional |
Object |
Set of parameters to activate item analysis for the dataset. See the Section on Item Analysis for more details.
Example:
|
name
optional |
String |
Name of the new field.
Example: "Price" |
optype
optional |
String |
Optype of the new field. Available optypes are "numeric", "categorical", "text", "datetime", and "items".
Example: "text" |
refresh_field_type
optional |
Boolean, default is false |
Whether the optype of the field needs to recomputed or not.
Example: true |
refresh_preferred
optional |
Boolean, default is false |
Whether the preferred flag of the field needs to recomputed or not.
Example: true |
term_analysis
optional |
Object |
Set of parameters to activate text analysis for the dataset. See the Section on Term Analysis for more details.
Example:
|
Column Concatenation Example
Say you have 2 datasets and want to join them using the field named "id". A query request would look like this:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
{
"id": "dataset/5bab8d6b1f386f7c20000000",
"name": "A"
},
{
"id": "dataset/5bab8d6e1f386f7c20000003",
"name": "B"
}
],
"sql_query": "select * from A join B on A.id=B.id"
}'
> Creating a dataset using sql_query
or, using the JSON dictionary form of the query:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
{
"id": "dataset/5bab8d6b1f386f7c20000000",
"name": "A"
},
{
"id": "dataset/5bab8d6e1f386f7c20000003",
"name": "B"
}
],
"json_query": {
"select": ["*"],
"from": ["A"],
"join": ["B", ["=", "A.id", "B.id"]]}
}'
> Creating a dataset using json_query
Resources Accepting Multi-Datasets Input
You can also create an anomaly, association, cluster, correlation, deepnet, ensemble, evaluation, linear regression, logistic regression, model, PCA, statistical test, time series, and topic model using multiple datasets as input at once. That is, without merging all the datasets together into a new dataset first. All the multi-dataset arguments above except those related to juxtaposition and SQL query can be used. i.e., juxtapose, juxtapose_input_fields, and output_dataset_names (name for a dataset object). You just need to use the datasets argument instead of the regular dataset. Note that the sampling arguments but range are not allowed for time series as it requires sequential data.See examples below to create a multi-dataset model, a multi-dataset ensemble, and a multi-dataset evaluation.
curl "https://bigml.io/andromeda/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"datasets": [
{
"id": "dataset/52bc7fc83c1920e4a3000012",
"sample_rate": 0.5,
"out_of_bag": true
},
{
"id": "dataset/52bc851b3c1920e4a3000022",
"sample_rate": 0.8,
"replacement": true
}]}'
> Creating a multi-dataset model
Transformations
Once you have created a dataset, BigML.io allows you to derive new datasets from it, sampling, filtering, adding new fields, or concatenating it to other datasets. We apply the term dataset transformations to the set of operations to create new modified versions of your original dataset or just transformations to abbreviate.
We use the term:- Cloning for the general operation of generating a new dataset.
- Sampling when the original dataset is sampled.
- Filtering when the original dataset is filtered.
- Extending when new fields are generated.
- Merging when a multi-dataset is created.
Keep in mind that you can sample, filter and extend a dataset all at once in only one API request.
So let's start with the most basic transformation: cloning a dataset.
Cloning a Dataset
To clone a dataset you just need to use the origin_dataset argument to send the dataset/id of the dataset that you want to clone. For example:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b8fdff3c19205ff100001e"}'
> Cloning a dataset
Argument | Type | Description |
---|---|---|
category
optional |
Integer |
The category that best describes the dataset. See the category codes for the complete list of categories.
Example: 1 |
fields
optional |
Object |
Updates the names, labels, and descriptions of the fields in the new dataset. An entry keyed with the field id of the original dataset for each field that will be updated.
Example:
|
name
optional |
String |
The name you want to give to the new dataset.
Example: "my new dataset" |
origin_dataset | String |
The dataset/id of the dataset to be cloned.
Example: "dataset/52694b59035d0737c201ac68" |
Sampling a Dataset
It is also possible to provide a sampling specification to be used when cloning the dataset. The sample will be applied to the origin_dataset rows. For example:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b8fdff3c19205ff100001e",
"sample_rate": 0.8,
"replacement": true,
"seed": "myseed"}'
> Sampling a dataset
Argument | Type | Description |
---|---|---|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a dataset containing a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
Filtering a Dataset
A dataset can be filtered in different ways:- Excluding a few fields using the excluded_fields argument.
- Selecting only a few fields using the input_fields argument.
- Filtering rows using a json_filter or lisp_filter similarly to the way you can filter a source.
- Specifying a range of rows.
As illustrated in the following example, it's possible to provide a list of input fields, selecting the fields from the filtered input dataset that will be created. Filtering happens before field picking and, therefore, the row filter can use fields that won't end up in the cloned dataset.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b8fdff3c19205ff100001e",
"input_fields": ["000000", "000001", "000003"],
"json_filter": [">", 3.14, ["field", "000002"]],
"range": [50, 100]}'
> Filtering a dataset
Argument | Type | Description |
---|---|---|
excluded_fields
optional |
Array |
Specifies the fields that won't be included in the new dataset.
Example:
|
input_fields
optional |
Array |
Specifies the fields to be included in the dataset.
Example:
|
json_filter
optional |
Array |
A JSON list representing a filter over the rows in the origin dataset. The first element is an operator and the rest of the elements its arguments. See the Section on filtering sources for more details.
Example: [">", 3.14, ["field", "000002"]] |
lisp_filter
optional |
String |
A string representing a Lisp s-expression to filter rows from the origin dataset.
Example: "(> 3.14 (field 2))" |
range
optional |
Array |
The range of successive instances to create the new dataset.
Example: [100, 200] |
Extending a Dataset
You can clone a dataset and extend it with brand new fields using the new_fields argument. Each new field is created using a Flatline formula and optionally a name, label, and description.
A Flatline formula is a lisp-like expresion that allows you to make references and process columns and rows of the origin dataset. See the full Flatline reference here. Let's see a first example that clones a dataset and adds a new field named "Celsius" to it using a formula that converts the values from the "Fahrenheit" field to Celsius.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [{
"field": "(/ (* 5 (- (f Fahrenheit) 32)) 9)",
"name": "Celsius"}]}'
> Extending a dataset
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"all_fields": false,
"new_fields": [{
"fields": "(fields 0 1)",
"names": ["Day", "Temperature"]}]}'
> Extending a dataset
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [
{"field": "(avg (window Fahrenheit -6 0))",
"name": "Weekly AVG",
"label":"Weekly Average",
"description": "Temperature average over the last seven days"},
{"fields": "(list (f 0 -1) (f 0 1))",
"names": ["Yesterday", "Tomorrow"],
"labels": ["Yesterday prediction", "Tomorrow prediction"],
"descriptions": ["Prediction for the previous day", "Prediction for the next day"]}]}'
> Extending a dataset
Filtering the New Fields Output
The generation of new fields works by traversing the input dataset row by row and applying the Flatline formula of each new field to each row in turn. The list of values generated from each input row that way constitutes an output row of the generated dataset.
It is possible to limit the number of input rows that the generator sees by means of filters and/or sample specifications, for example:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb2c263c192015e3000004",
"lisp_filter": "(not (= 0 (f 000001)))",
"new_fields": [
{"field": "(/ 1 (f 000001))",
"name": "Inverse value"}]}'
> Extending a dataset
And, as an additional convenience, it is also possible to specify either a output_lisp_filter or a output_json_filter, that is, a Flatline row filter that will act on the generated rows, instead of on the input data:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb2c263c192015e3000004",
"lisp_filter": "(not (= 0 (f 000001)))",
"new_fields": [
{"field": "(/ 1 (f 000001))",
"name": "Inverse value"}],
"output_lisp_filter": "(< 0.25 (f \"Inverse value\"))"}'
> Extending a dataset
And if all you need after the traversal is the last row, possibly because you are using cells to accumulate values that will be available in the metadata of the final dataset, you can set reduce to true, and the resulting dataset will contain only one row.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb2c263c192015e3000004",
"reduce": true,
"lisp_filter": "(not (= 0 (f 000001)))",
"new_fields": [
{"name": "Inverse value",
"field": "(set-cell "sum" (+ (cell "sum" 0) (/ 1 (f 000001))))"}
]}'
> Extending a dataset
Of course the same effect can be accomplished with an output row filter that checks (row-number) against its maximum value, reduce is just a shortcut.
You can also skip any number of rows in the input, starting the generation at an offset given by row_offset, and traverse the input rows by any step specified by row_step. For instance, the following request will generate a dataset whose rows are created by putting together every three consecutive values of the input field "Price":
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b7f0ba3c19208c13000131",
"row_offset": 2,
"row_step": 3,
"new_fields": [
{"fields": "(window \"Price\" -2 0)",
"names": ["Price-2", "Price-1", "Price"]}]}'
> Extending a dataset
With the specification above, the new field will start with the third row in the input dataset, generate an output row (which uses values from the current input row as well as from the two previous ones), skip to the 6th input row, generate a new output, and so on and so forth.
Next, we'll list all the arguments that can be used to extend a dataset.
Argument | Type | Description |
---|---|---|
all_but
optional |
Array |
Specifies the fields to be included in the dataset.
Example: ["000001", "000003"] |
all_fields
optional |
Boolean |
Whether all fields should be included in the new dataset or not.
Example: false |
new_fields
optional |
Array |
Specifies the new fields to be included in the dataset. See the table below for more details.
Example: [{"field": "(log10 (field "000001"))", "name": "log"}] |
output_json_filter
optional |
Array |
A JSON list representing a filter over the rows of the dataset once the new fields have been generated. The first element is an operator and the rest of the elements its arguments. See the Section on filtering rows for more details.
Example: [">", 3.14, ["field", "000002"]] |
output_lisp_filter
optional |
String |
A string representing a Lisp s-expression to filter rows after the new fields have been generated.
Example: "(> 3.14 (field 2))" |
reduce
optional |
Boolean |
Whether the last row should be returned. It will only be used in field generations (not filters)
Example: false |
row_offset
optional |
Array |
The initial number of rows to skip from from the input dataset before start processing rows.
Example: 100 |
row_step
optional |
Array |
The number of rows to skip in every step.
Example: 5 |
Argument | Type | Description |
---|---|---|
description
optional |
String |
A description for the new field.
Example: "This field is a transformation" |
descriptions
optional |
Array |
A description for every of the new fields generated.
Example: ["Price 3 days ago", "Price 2 days ago", "Price 1 day ago"] |
field | Flatline expression |
Either a json-like or lisp-like Flatline expression to generate a new field.
Example: "(* (field 5) 100)" |
fields | Flatline expression |
Either a json-like or lisp-like Flatline expression to generate a number of fields.
Example: "(window Price -2 0)" |
item_analyses
optional |
Array | List of item_analyses for each of the new fields generated. |
item_analysis
optional |
Object |
Set of parameters to activate item analysis for the dataset. See the Section on Item Analysis for more details.
Example:
|
label
optional |
Array |
Label of the new field.
Example: "New price" |
labels | Array |
Labels for each of the new fields generated.
Example: ["Price-3", "Price-2", "Price-1"] |
name
optional |
String |
Name of the new field.
Example: "Price" |
names
optional |
Array |
Names for each of the new fields generated.
Example: ["P3", "P2", "P1"] |
optype
optional |
String |
Optype of the new field. Available optypes are "numeric", "categorical", "text", "datetime", and "items".
Example: "text" |
optypes
optional |
Array |
Optypes for each of the new fields generated.
Example: ["numeric", "categorical", "text"] |
refresh_field_type
optional |
Boolean, default is false |
Specifies whether the new field type needs to recomputed or not.
Example: true |
term_analyses
optional |
Array | List of term_analyses for each of the new fields generated. |
term_analysis
optional |
Object |
Set of parameters to activate text analysis for the dataset. See the Section on Term Analysis for more details.
Example:
|
Discretization of a Continuous Field
Here's an example discretizing the "temp" field into three homogeneous levels:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [{
"field": "(cond (< (f \"temp\") 0) \"SOLID\"
(< (f \"temp\") 100) \"LIQUID\"
\"GAS\")",
"name":"Discrete Temp"}]}'
Descritizing a field
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b9359a3c19205ff100002a",
"new_fields": [{
"field": "(cond (> (percentile \"age\" 0.1) (f \"age\")) \"baby\"
(> (percentile \"age\" 0.2) (f \"age\")) \"child\"
(> (percentile \"age\" 0.6) (f \"age\")) \"adult\"
(> (percentile \"age\" 0.9) (f \"age\")) \"old\"
\"elder\")",
"name":"Discrete Age"}]}'
Descritizing a field
Outlier Elimination
You can use, for instance, the following predicate in a filter to remove outliers:
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b9359a3c19205ff100002a",
"lisp_filter": "(< (percentile \"age\" 0.5) (f \"age\") (percentile \"age\" 0.95))"}'
Eliminating outliers
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b9359a3c19205ff100002a",
"lisp_filter": "(within-percentiles? "age" 0.5 0.95)"}'
Eliminating outliers
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [{
"field": "(if (missing? \"temp\") (mean \"temp\") (field \"temp\"))",
"name": "no missing temp"}]}'
Changing missing values
Lisp and JSON Syntaxes
Flatline also has a json-like flavor with exactly the same semantics that the lisp-like version. Basically, a Flatline formula can easily be translated to its json-like variant and vice versa by just changing parentheses to brackets, symbols to quoted strings, and adding commas to separate each sub-formula. For example, the following two formulas are the same for BigML.io.
"(/ (* 5 (- (f Fahrenheit) 32)) 9)"
Lisp-like formula
["/", ["*", 5, ["-", ["f", "Fahrenheit"], 32]], 9]
Json-like formula
Final Remarks
A few important details that you should keep in mind:- Cloning a dataset implies creating also a copy of its serialized form, so you get an asyncronous resource with a status that evolves from the Summarized (4) to the Finished (5) state.
- If you specify both sampling and filtering arguments, the former are applied first.
- As with filters applied to datasources, dataset filters can use the full Flatline language to specify the boolean formula to use when sifting the input.
- Flatline performs type inference, and will in general figure out the proper optype for the generated fields, which are subsequently summarized by the dataset creation process, reaching then their final datatype (just as with a regular dataset created from a datasource). In case you need to fine-tune Flatline's inferences, you can provide an optype (or optypes) key and value in the corresponding output field entry (together with generator and names), but in general this shouldn't be needed.
- Please check the Flatline reference manual for a full description of the language for field generation and the many pre-built functions it provides.
Samples
Last Updated: Thursday, 2020-10-08 20:05
A sample provides fast-access to the raw data of a dataset on an on-demand basis.
When a new sample is requested, a copy of the dataset is stored in a special format in an in-memory cache. Multiple and different samples of the data can then be extracted using HTTPS parameterized requests by sampling sizes and simple query string filters.

Samples are ephemeral. That is to say, a sample will be available as long as GETs are requested within periods smaller than a pre-established TTL (Time to Live). The expiration timer of a sample is reset every time a new GET is received.
If requested, a sample can also perform linear regression and compute Pearson's and Spearman's correlations for either one numeric field against all other numeric fields or between two specific numeric fields.
BigML.io allows you to create, retrieve, update, delete your sample. You can also list all of your samples.
Jump to:
- Sample Base URL
- Creating a Sample
- Sample Arguments
- Retrieving a Sample
- Sample Properties
- Filtering and Paginating Fields from a Sample
- Filtering Rows from a Sample
- Updating a Sample
- Deleting a Sample
- Listing Samples
Sample Base URL
You can use the following base URL to create, retrieve, update, and delete samples. https://bigml.io/andromeda/sample
Sample base URL
All requests to manage your samples must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Sample
To create a new sample, you need to POST to the sample base URL an object containing at least the dataset/id that you want to use to create the sample. The content-type must always be "application/json".
You can easily create a new sample using curl as follows. All you need is a valid dataset/id and your authentication variable set up as shown above.
curl "https://bigml.io/andromeda/sample?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/5484b109f0a5ea59a6000018"}'
> Creating a sample
BigML.io will return the newly created sample if the request succeeded.
{
"category":0,
"code":201,
"created":"2015-02-03T08:53:08.782775",
"credits":0,
"dataset":"dataset/5484b109f0a5ea59a6000018",
"description":"",
"excluded_fields": [
"00000e",
"00000f"
],
"fields_meta":{
"count":0,
"limit":1000,
"offset":0,
"total":0
},
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007",
"000008",
"000009",
"00000a",
"00000b",
"00000c",
"00000d"
],
"max_columns":14,
"max_rows":32561,
"name":"census' dataset sample",
"private":true,
"project":null,
"resource":"sample/54d9c6f4f0a5ea0b1600003a",
"seed":"c30d76cd14e24ef7ab7d28f98b3c8488",
"size":3292068,
"status":{
"code":1,
"message":"The sample is being processed and will be created soon"
},
"subscription":false,
"tags":[],
"updated":"2015-02-03T08:53:08.782792"
}
< Example sample JSON response
Sample Arguments
See below the full list of arguments that you can POST to create a sample.
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the sample. See the category codes for the complete list of categories.
Example: 1 |
dataset | String |
A valid dataset/id.
Example: dataset/4f665b8103ce8920bb000006 |
description
optional |
String |
A description of the sample up to 8192 characters long.
Example: "This is a description of my new sample" |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields in the dataset that won't be included in the sample
Example:
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields in the dataset to be considered to create the sample.
Example:
|
name
optional |
String, default is dataset's name sample |
The name you want to give to the new sample.
Example: "my new sample" |
project
optional |
String |
The project/id you want the sample to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
tags
optional |
Array of Strings |
A list of strings that help classify and index your sample.
Example: ["best customers", "2018"] |
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new sample with a name. For example, to create a new sample named "my sample" with some tags:
curl "https://bigml.io/andromeda/sample?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/5484b109f0a5ea59a6000018",
"name": "my sample",
"tags": ["potential customers", "2015"]}'
> Creating a customized sample
If you do not specify a name, BigML.io will assign to the new sample the dataset's name.
Retrieving a Sample
Each sample has a unique identifier in the form "sample/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the sample.
To retrieve a sample with curl:
curl "https://bigml.io/andromeda/sample/54d9c6f4f0a5ea0b1600003a?$BIGML_AUTH"
$ Retrieving a sample from the command line
Sample Properties
Once a sample has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the sample and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the sample creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields returned in the sample's fields. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the sample was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this sample. |
dataset
filterable, sortable |
String | The dataset/id that was used to create the sample. |
description
updatable |
String | A text describing the sample. It can contain restricted markdown to decorate the text. |
excluded_fields | Array | The list of fields's ids that were excluded to build the sample. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
input_fields | Array | The list of input fields' ids available to filter the sample |
locale | String | The source's locale. |
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the sample. |
max_rows
filterable, sortable |
Integer | The max number of rows in the sample. |
name
filterable, sortable, updatable |
String | The name of the sample as provided or based on the name of the dataset by default. |
private
filterable, sortable |
Boolean | Whether the sample is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
resource | String | The sample/id. |
rows
filterable, sortable |
Integer | The total number of rows in the sample, |
sample | Object | All the information that you need to analyze the sample on your own. It includes the fields' dictionary describing the fields and their summaries and the rows. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this sample. |
status | Object | A description of the status of the sample. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the sample was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the sample was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
A Sample Object has the following properties:
Property | Type | Description |
---|---|---|
rows | Array of Arrays | A list of lists representing the rows of the sample. Values in each list are ordered according to the fields list. |
Sample Status
Through the status field in the sample you can determine when the sample has been fully processed and ready to be used. These are the fields that a sample's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the sample creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the sample. |
message | String | A human readable message explaining the status. |
Once a sample has been successfully created, it will look like:
{
"category":0,
"code":200,
"columns":2,
"created":"2015-02-03T18:21:07.001000",
"credits":0,
"dataset":"dataset/5484b109f0a5ea59a6000018",
"description":"",
"excluded_fields": [
"00000e",
"00000f"
],
"fields_meta":{
"count":2,
"limit":2,
"offset":0,
"query_total":14,
"total":14
},
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007",
"000008",
"000009",
"00000a",
"00000b",
"00000c",
"00000d"
],
"locale":"en-US",
"max_columns":14,
"max_rows":32561,
"name":"my dataset",
"private":true,
"project":null,
"resource":"sample/54d9c6f4f0a5ea0b1600003a",
"rows":2,
"sample":{
"fields":[
{
"column_number":0,
"datatype":"int8",
"id":"000000",
"input_column":0,
"name":"age",
"optype":"numeric",
"order":0,
"preferred":true,
"summary":{
"bins":[
[
18.75643,
2410
],
[
21.51515,
1485
],
[
23.47642,
1675
],
[
25.48278,
1626
],
[
27.5094,
1702
],
[
29.51434,
1674
],
[
31.48252,
1716
],
[
33.50312,
1761
],
[
35.5062,
1774
],
[
37.4908,
1685
],
[
39.49317,
1610
],
[
41.49118,
1588
],
[
43.48461,
1494
],
[
46.38942,
2722
],
[
50.4325,
2252
],
[
53.47213,
879
],
[
55.46624,
785
],
[
57.50552,
724
],
[
59.46777,
667
],
[
61.46237,
558
],
[
63.47489,
438
],
[
65.45732,
328
],
[
67.4428,
271
],
[
69.45178,
197
],
[
71.48201,
139
],
[
73.44348,
115
],
[
75.50549,
91
],
[
77.44231,
52
],
[
80.28947,
76
],
[
83.95,
20
],
[
87.75,
4
],
[
90,
43
]
],
"maximum":90,
"mean":38.58165,
"median":37.03324,
"minimum":17,
"missing_count":0,
"population":32561,
"splits":[
18.58199,
20.00208,
21.38779,
22.6937,
23.89609,
25.137,
26.40151,
27.62339,
28.8206,
30.03925,
31.20051,
32.40167,
33.57212,
34.72468,
35.87617,
37.03324,
38.24651,
39.49294,
40.76573,
42.0444,
43.3639,
44.75256,
46.13703,
47.60107,
49.39145,
51.09725,
53.14627,
55.56526,
58.35547,
61.50785,
66.43583
],
"standard_deviation":13.64043,
"sum":1256257,
"sum_squares":54526623,
"variance":186.0614
}
},
{
"column_number":1,
"datatype":"string",
"id":"000001",
"input_column":1,
"name":"workclass",
"optype":"categorical",
"order":1,
"preferred":true,
"summary":{
"categories":[
[
"Private",
22696
],
[
"Self-emp-not-inc",
2541
],
[
"Local-gov",
2093
],
[
"State-gov",
1298
],
[
"Self-emp-inc",
1116
],
[
"Federal-gov",
960
],
[
"Without-pay",
14
],
[
"Never-worked",
7
]
],
"missing_count":1836
},
"term_analysis":{
"enabled":true
}
}
],
"rows":[
[
48,
"Private",
"HS-grad",
9,
"Divorced",
"Transport-moving",
"Not-in-family",
"White",
"Male",
0,
0,
65,
"United-States",
"<=50K"
],
[
71,
"Private",
"9th",
5,
"Married-civ-spouse",
"Other-service",
"Husband",
"White",
"Male",
0,
0,
40,
"United-States",
"<=50K"
]
]
},
"seed":"0493a6f8ca7aeb2aaccca22560e4b8cb",
"size":3292068,
"status":{
"code":5,
"elapsed":1,
"message":"The sample has been created",
"progress":1
},
"subscription":false,
"tags":[
"potential customers",
"2015"
],
"updated":"2015-02-03T18:21:14.537000"
}
< Example sample JSON response
Filtering and Paginating Fields from a Sample
A sample might be composed of hundreds or even thousands of fields. Thus when retrieving a sample, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Filtering Rows from a Sample
A sample might be composed of thousands or even millions of rows. Thus when retrieving a sample, it's possible to specify that only a subset of rows be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored). BigML will never return more than 1000 rows in the same response. However, you can send additional request to get different random samples.
Parameter | Type | Description |
---|---|---|
!field=
optional |
Blank |
With field the identifier of a field, select only those rows where field is not missing (i.e., it has a definite value).
Example:
|
!field=from,to
optional |
List |
With field the identifier of a numeric field, returns the values not in the specified interval. As with inclusion, it's possible to include or exclude the boundaries of the specified interval using square or round brackets
Example:
|
!field=value
optional |
List |
With field the identifier of a numeric field, returns rows for which the field doesn't equal that value.
Example:
|
!field=value1&!field=value2&...
optional |
String |
With field the identifier of a categorical field, select only those rows with the value of that field not one of the provided categories (when the parameter is repeated).
Example:
|
field=
optional |
Blank |
With field the identifier of a field, select only those rows where field is missing.
Example:
|
field=from,to
optional |
List |
With field the identifier of a numeric field and from, to optional numbers, specifies a filter for the numeric values of that field in the range [from, to]. One of the limits can be omitted.
Example:
|
field=value
optional |
List |
With field the identifier of a numeric field, returns rows for which the field equals that value.
Example:
|
field=value1&field=value2&...
optional |
String |
With field the identifier of a categorical field, select only those rows with the value of that field one of the provided categories (when the parameter is repeated).
Example:
|
index
optional |
Boolean |
When set to true, every returned row will have a first extra value which is the absolute row number, i.e., a unique row identifier. This can be useful, for instance, when you're performing various GET requests and want to compute the union of the returned regions.
Example: index=true |
mode
optional |
String |
One amongst deterministic, random, or linear. The way we sample the resulting rows, if needed; random means a random sample, deterministic is also random but using a fixed seed so that it's repeatable, and linear means that BigML just returns the first size rows after filtering; defaults to "deterministic".
Example: mode=random |
occurrence
optional |
Boolean |
When set to true, rows have prepended a value which denotes the number of times the row was present in the sample. You'll want this only when unique is set to true, otherwise all those extra values will be equal to 1. When index is also set to true (see above), the multiplicity column is added after the row index.
Example: occurrence=true |
precision
optional |
Integer |
The number of significant decimal numbers to keep in the returned values, for fields of type float or double. For instance, if you set precision=0, all returned numeric values will be truncated to their integral part.
Example: precision=2 |
row_fields
optional |
List |
You can provide a list of identifiers to be present in the samples rows, specifying which ones you actually want to see and in which order.
Example: row_fields=000000,000002 |
row_offset
optional |
Integer |
Skip the given number of rows. Useful when paginating over the sample in linear mode.
Example: row_offset=300 |
row_order_by
optional |
String |
A field that causes the returned columns to be sorted by the value of the given field, in ascending order or, when the - prefix is used, in descending order.
Example: row_order_by=-000000 |
rows
optional |
Integer |
The total number of rows to be returned; if less than the resulting from the rest of the filter parameters, the latter will be sampled according to mode.
Example: rows=300 |
seed
optional |
String |
When mode is random, you can specify your own seed in this parameter; otherwise, we choose it at random, and return the value we've used in the body of the response: that way you can make a random sampling deterministic if you happen to like a particular result.
Example: seed=mysample |
stat_field
optional |
String |
A field_id that corresponds to the identifier of a numeric field will cause the answer to include the Pearson's and Spearman's correlations, and linear regression terms of this field with all other numeric fields in the sample. Those values will be returned in maps keyed by "other" field id and named spearman_correlations, pearson_correlations, slopes, and intercepts.
Example: stat_field=000000 |
stat_fields
optional |
String |
Two field_ids that correspond to the identifiers of numeric fields will cause the answer to include the Pearson's and Spearman's correlations, and linear regression terms between the two fields. Those values will be returned in maps keyed named spearman_correlation, pearson_correlation, slope, and intercept.
Example: stat_fields=000000,000003 |
unique
optional |
Boolean |
When set to true, repeated rows will be removed from the sample.
Example: unique=true |
Updating a Sample
To update a sample, you need to PUT an object containing the fields that you want to update to the sample' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated sample.
For example, to update a sample with a new name you can use curl like this:
curl "https://bigml.io/andromeda/sample/54d9c6f4f0a5ea0b1600003a?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a sample's name
Deleting a Sample
To delete a sample, you need to issue a HTTP DELETE request to the sample/id to be deleted.
Using curl you can do something like this to delete a sample:
curl -X DELETE "https://bigml.io/andromeda/sample/54d9c6f4f0a5ea0b1600003a?$BIGML_AUTH"
$ Deleting a sample from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a sample, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a sample a second time, or a sample that does not exist, you will receive a "404 not found" response.
However, if you try to delete a sample that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Samples
To list all the samples, you can use the sample base URL. By default, only the 20 most recent samples will be returned. You can see below how to change this number using the limit parameter.
You can get your list of samples directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/sample?$BIGML_AUTH
> Listing samples from a browser
Correlations
Last Updated: Thursday, 2020-10-08 20:05
A correlation resource allows you to compute advanced statistics for the fields in your dataset by applying various exploratory data analysis techniques to compare the distributions of the fields in your dataset against an objective_field.
BigML.io allows you to create, retrieve, update, delete your correlation. You can also list all of your correlations.
Jump to:
- Correlation Base URL
- Creating a Correlation
- Correlation Arguments
- Retrieving a Correlation
- Correlation Properties
- Filtering and Paginating Fields from a Correlation
- Updating a Correlation
- Deleting a Correlation
- Listing Correlations
Correlation Base URL
You can use the following base URL to create, retrieve, update, and delete correlations. https://bigml.io/andromeda/correlation
Correlation base URL
All requests to manage your correlations must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Correlation
To create a new correlation, you need to POST to the correlation base URL an object containing at least the dataset/id that you want to use to create the correlation. The content-type must always be "application/json".
You can easily create a new correlation using curl as follows. All you need is a valid dataset/id and your authentication variable set up as shown above.
curl "https://bigml.io/andromeda/correlation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c"}'
> Creating a correlation
BigML.io will return the newly created correlation if the request succeeded.
{
"category": 0,
"clones": 0,
"code": 201,
"columns": 0,
"correlations": null,
"created": "2015-06-23T21:45:24.002925",
"credits": 15.161365509033203,
"dataset": "dataset/55806fc2545e5f09b400002b",
"dataset_field_types": {
"categorical": 9,
"datetime": 0,
"numeric": 6,
"preferred": 14,
"text": 0,
"total": 15
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [ ],
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"input_fields": [ ],
"locale": "en-US",
"max_columns": 15,
"max_rows": 32561,
"name": "adult's dataset correlation",
"objective_field": "000000",
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
32561
],
"replacement": false,
"resource": "correlation/5589d374545e5f37fa000000",
"rows": 32561,
"sample_rate": 1,
"shared": false,
"size": 3974461,
"source": "source/5578d034545e5f6a17000006",
"source_status": true,
"status": {
"code": 1,
"message": "The correlation is being processed and will be created soon"
},
"subscription": false,
"tags": [ ],
"updated": "2015-06-23T21:45:24.003040",
"white_box": false
}
< Example correlation JSON response
Correlation Arguments
In addition to the dataset, you can also POST the following arguments.
Argument | Type | Description |
---|---|---|
categories
optional |
Object, default is {}, an empty dictionary. That is no categories are specified. |
A dictionary between input field id and an array of categories to limit the analysis to. Each array must contain 2 or more unique and valid categories in the string format. If omitted, each categorical field is limited to its 100 most frequent categorical values. This field has no impact if the data type of input fields are non-categorical.
Example:
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the correlation. See the category codes for the complete list of categories.
Example: 1 |
dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
datasets
optional |
Array |
A list of dataset ids or objects to be used to build the new correlation. See the Section on Multi-Datasets and Section on Resources Accepting Multi-Datasets Input for more details.
Example:
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
description
optional |
String |
A description of the correlation up to 8192 characters long.
Example: "This is a description of my new correlation" |
discretization | Object | Global numeric field transformation parameters. See the discretization table below. |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the correlation.
Example:
|
field_discretizations | Object | Per-field numeric field transformation parameters, taking precedence over discretization. See the field_discretizations table below. |
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the correlation with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the correlation.
Example:
|
name
optional |
String, default is dataset's name |
The name you want to give to the new correlation.
Example: "my new correlation" |
objective_field
optional |
String, default is dataset's pre-defined objective field |
The id of the field to be used as the objective for correlation tests.
Example: "000001" |
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
project
optional |
String |
The project/id you want the correlation to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the correlation.
Example: [1, 150] |
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
shared_hash | String |
The shared hash of the shared model to be cloned. Set deep to true to clone the dataset used to build the correlation too. Note that the dataset can be cloned only if it is already shared and set clonable. If multiple datasets have been used to create the correlation, only the first dataset will be cloned.
Example: "kpY46mNuNVReITw0Z1mAqoQ9ySW" |
significance_levels
optional |
Array, default is [0.01, 0.05, 0.1] |
An array of significance levels between 0 and 1 to test against p_values.
Example: [0.01, 0.025, 0.05, 0075, 0.1] |
tags
optional |
Array of Strings |
A list of strings that help classify and index your correlation.
Example: ["best customers", "2018"] |
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
ç Discretization is used to transform numeric input fields to categoricals before further processing. It is to be applied globally with all input fields. A Discretization object is composed of any combination of the following properties.
For example, let's say type is set to "width", size is 7, trim is 0.05, and pretty is false. This requests that numeric input fields be discretized into 7 bins of equal width, trimming the outer 5% of counts, and not rounding bin boundaries.
Field Discretizations is also used to transform numeric input fields to categoricals before further processing. However, it allows the user to specify parameters on a per field basis, taking precedence over the global discretization. It is a map whose keys are field ids and whose values are maps with the same format as discretization. It also accepts edges, which is a numeric array manually specifying edge boundary locations. If this parameter is present, the corresponding field will be discretized according to those defined bins, and the remaining discretization parameters will be ignored. The maximum value of the field's distribution is automatically set as the last value in the edges array. A value object of a Field Discretizations object is composed of any combination of the following properties.
You can also use curl to customize a new correlation. For example, to create a new correlation named "my correlation", with only certain rows, and with only three fields:
curl "https://bigml.io/andromeda/correlation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c",
"objective_field": "000001",
"input_fields": ["000001", "000002", "000003"],
"name": "my correlation",
"range": [25, 125]}'
> Creating customized correlation
If you do not specify a name, BigML.io will assign to the new correlation the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset.
Read the Section on Sampling Your Dataset to lean how to sample your dataset. Here's an example of correlation request with range and sampling specifications:
curl "https://bigml.io/andromeda/correlation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/505f43223c1920eccc000297",
"range": [1, 5000],
"sample_rate": 0.5}'
> Creating a correlation using sampling
Retrieving a Correlation
Each correlation has a unique identifier in the form "correlation/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the correlation.
To retrieve a correlation with curl:
curl "https://bigml.io/andromeda/correlation/5589d374545e5f37fa000000?$BIGML_AUTH"
$ Retrieving a correlation from the command line
Correlation Properties
Once a correlation has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the correlation and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the correlation creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields in the correlation. |
correlations | Object | All the information that you need to recreate the correlation. It includes the field's dictionary describing the fields and their summaries, and the correlations. See the Correlations Object definition below. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the correlation was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this correlation. |
dataset
filterable, sortable |
String | The dataset/id that was used to build the correlation. |
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
datasets | Array | A list of dataset ids or objects used to build the correlation. |
description
updatable |
String | A text describing the correlation. It can contain restricted markdown to decorate the text. |
excluded_fields | Array | The list of fields's ids that were excluded to build the correlation. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
input_fields | Array | The list of input fields' ids used to build the models of the correlation. |
locale | String | The dataset's locale. |
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the correlation. |
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the correlation. |
name
filterable, sortable, updatable |
String | The name of the correlation as your provided or based on the name of the dataset by default. |
objective_field |
String, default is dataset's pre-defined objective field |
The id of the field to be used as the objective for a correlations test.
Example: "000001" |
objective_field_details | Object | The details of the objective fields. See the Objective Field Details. |
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the correlation instead of the sampled instances. |
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your correlation. |
private
filterable, sortable, updatable |
Boolean | Whether the correlation is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
range | Array | The range of instances used to build the correlation. |
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the correlation were selected using replacement or not. |
resource | String | The correlation/id. |
rows
filterable, sortable |
Integer | The total number of instances used to build the correlation |
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the correlation. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
shared
filterable, sortable, updatable |
Boolean | Whether the correlation is shared using a private link or not. |
shared_clonable
filterable, sortable, updatable |
Boolean | Whether the shared correlation can be cloned or not. |
shared_hash | String | The hash that gives access to this correlation if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this correlation. |
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this correlation. |
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
status | Object | A description of the status of the correlation. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the correlation was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the correlation was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
white_box
filterable, sortable |
Boolean | Whether the correlation is publicly shared as a white-box. |
The Correlations Object of test has the following properties. Some correlation results will contain a p-value and a significant boolean array, indicating whether the p_value is less than the provided significance_levels (by default, [0.01, 0.05, 0.10] is used if not provided). If p-value is greater than the accepted significance level, then then it fails to reject the null hypothesis, meaning there is no statistically significant difference between the treatment groups. For example, if the significance levels is [0.01, .0.025, 0.05, 0.075, 0.1] and p-value is 0.05, then significant is [false, false, false, true, true].
Property | Type | Description |
---|---|---|
categories | Object | A dictionary between input field id and arrays of category names selected for correlations. |
correlations | Array | Correlation results. See Correlation Results Object. |
fields | Object | A dictionary with an entry per field in the dataset used to build the test. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
significance_levels | Array | An array of user provided significance levels to test against p_values. |
The Correlation Results Object has the following properties.
Property | Type | Description |
---|---|---|
name | String | Name of the correlation. Available values are coefficients, contingency_tables, and one_way_anova. |
result | Object | A correlation result which is a dictionary between field ids and the result. The type of result object varies based on the name of the correlation. When name is coefficients, it returns Coefficients Result Object, when contingency_tables, Contingency Tables Result Object, and when one_way_anova, One-way ANOVA Result Object. |
The Coefficients Result Object contains the correlation measures between objective_field and each of the input_fields when the two fields are numeric-numeric pairs. It has the following properties:
Property | Type | Description |
---|---|---|
pearson | Float | A measure of the linear correlation between two variables, giving a value between +1 and -1, where 1 is total positive correlation, 0 is no correlation, and -1 is total negative correlation. See Pearson's correlation coefficients for more information. |
pearson_p_value | Float |
A function used in the context of null hypothesis testing for pearson correlations in order to quantify the idea of statistical significance of evidence.
Example: 0.015 |
spearman | Float | A nonparametric (parameters are determined by the training data, not the model. Thus, the number of parameters grows with the amount of training data) measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect correlation of +1 or -1 occurs when each of the variables is a perfect monotone function of the other. See Spearman's correlation coefficients for more information. |
spearman_p_value | Float |
A function used in the context of null hypothesis testing for spearman correlations in order to quantify the idea of statistical significance of evidence.
Example: 0.015 |
The Contingency Tables Result Object contains the correlation measures between objective_field and each of the input_fields when the two fields are both categorical.It has the following properties:
Property | Type | Description |
---|---|---|
chi_square | Object | See Chi-Square Object. |
cramer | Float | A measure of association between two nominal variables. Its value ranges between 0 (no association between the variables) and 1 (complete association), and can reach 1 only when the two variables are equal to each other. It is based on Pearson's chi-squared statistic. See Cramer's V for more information. |
tschuprow | Float | A measure of association between two nominal variables. Its value ranges ranges between 0 (no association between the variables) and 1 (complete association). It is closely related to Cramer's V, coinciding with it for square contingency tables. See Tschuprow's T for more information. |
two_way_table | Array |
Contingency Table as a nested row-major array with the frequency distribution of the variables. In other words, the table summarizes the distribution of values in the sample.
Example: [[2514, 362, 78, 38, 23], [889, 53, 39, 2, 1]] |
The Chi-Square Object contains the chi-square statistic used to investigate whether distributions of categorical variables differ from one another. This test is used to compare a collection of categorical data with some theoretical expected distribution. The object has the following properties.
The One-way ANOVA Result Object contains correlation measures between objective_field and each of the input_fields when the two fields are categorical-numerical pairs. ANOVA is used to compare the means of numerical data samples. The ANOVA tests the null hypothesis that samples in two or more groups are drawn from populations with the same mean values. See One-way Analysis of Variance for more information. The object has the following properties:
Property | Type | Description |
---|---|---|
eta_square | Float | A measure of effect size, a measure of the strength of the relationship between two variables, for use in ANOVA. Its value ranges ranges between 0 and 1. A rule of thumb is: 0.02 ~ small, 0.13 ~ medium, and 0.26 ~ large. See eta-squared for more information. |
f_ratio | Float | The value of the F statistic, which is used to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other. It is the ratio of the variance calculated among the means to the variance within the samples. |
p_value | Float |
A function used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.
Example: 0.015 |
significant | Array |
A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. The default significance_levels are [0.01, 0.05, 0.1].
Example: [false, true, true] |
An Objective Field Details Object has the following properties.
Correlation Status
Creating correlation is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The correlation goes through a number of states until its fully completed. Through the status field in the correlation you can determine when the correlation has been fully processed and ready to be used to create predictions. These are the properties that correlation's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the correlation creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the correlation. |
message | String | A human readable message explaining the status. |
progress | Float, between 0 and 1 | How far BigML.io has progressed building the correlation. |
Once correlation has been successfully created, it will look like:
{
"category": 0,
"clones": 0,
"code": 200,
"columns": 14,
"correlations": {
"categories": {
"000003": [
"Bachelors",
"Some-college",
"HS-grad"
],
"000005": [
"Divorced",
"Separated",
"Widowed"
]
},
"correlations": [
{
"name": "coefficients",
"result": {
"000002": {
"pearson": -0.07665,
"pearson_p_value": 0,
"spearman": -0.07814,
"spearman_p_value": 0
},
"000004": { … },
"00000a": { … },
"00000b": { … },
"00000c": { … }
}
},
{
"name": "one_way_anova",
"result": {
"000001": {
"eta_square": 0.05254,
"f_ratio": 243.34988,
"p_value": 0,
"significant": [
true,
true
]
},
"000003": { … },
"000005": { … },
"000006": { … },
"000007": { … },
"000008": { … },
"000009": { … },
"00000e": { … }
}
}
],
"fields": { … },
"significance_levels": [
0.025,
0.01
]
},
"created": "2015-06-23T21:45:24.002000",
"credits": 15.161365509033203,
"dataset": "dataset/55806fc2545e5f09b400002b",
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [
],
"fields_meta": {
"count": 14,
"limit": 1000,
"offset": 0,
"query_total": 14,
"total": 14
},
"input_fields": [
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007",
"000008",
"000009",
"00000a",
"00000b",
"00000c",
"00000e"
],
"locale": "en-US",
"max_columns": 15,
"max_rows": 32561,
"name": "Sample correlation",
"objective_field": "000000",
"objective_field_details": {
"column_number": 0,
"datatype": "int8",
"name": "age",
"optype": "numeric",
"order": 0
},
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
32561
],
"replacement": false,
"resource": "correlation/5589d374545e5f37fa000000",
"rows": 32561,
"sample_rate": 1,
"shared": false,
"size": 3974461,
"source": "source/5578d034545e5f6a17000006",
"source_status": true,
"status": {
"code": 5,
"elapsed": 11504,
"message": "The correlation has been created",
"progress": 1
},
"subscription": false,
"tags": [
],
"updated": "2015-06-23T21:45:56.066000",
"white_box": false
}
< Example correlation JSON response
Filtering and Paginating Fields from a Correlation
A correlation might be composed of hundreds or even thousands of fields. Thus when retrieving a correlation, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Correlation
To update a correlation, you need to PUT an object containing the fields that you want to update to the correlation' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated correlation.
For example, to update correlation with a new name you can use curl like this:
curl "https://bigml.io/andromeda/correlation/5589d374545e5f37fa000000?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a correlation's name
If you want to update correlation with a new label and description for a specific field you can use curl like this:
curl "https://bigml.io/andromeda/correlation/5589d374545e5f37fa000000?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"fields": {"000000": {
"label": "a longer name",
"description": "an even longer description"}}}'
$ Updating correlation's field,
label, and description
Deleting a Correlation
To delete a correlation, you need to issue a HTTP DELETE request to the correlation/id to be deleted.
Using curl you can do something like this to delete a correlation:
curl -X DELETE "https://bigml.io/andromeda/correlation/5589d374545e5f37fa000000?$BIGML_AUTH"
$ Deleting a correlation from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a correlation, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a correlation a second time, or a correlation that does not exist, you will receive a "404 not found" response.
However, if you try to delete a correlation that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Correlations
To list all the correlations, you can use the correlation base URL. By default, only the 20 most recent correlations will be returned. You can see below how to change this number using the limit parameter.
You can get your list of correlations directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/correlation?$BIGML_AUTH
> Listing correlations from a browser
Statistical Tests
Last Updated: Thursday, 2020-10-08 20:05
A statistical test resource automatically runs some advanced statistical tests on the numeric fields of a dataset. The goal of these tests is to check whether the values of individual fields conform or differ from some distribution patterns. Statistical test are useful in tasks such as fraud, normality, or outlier detection.
The tests are grouped in the following three categories:
-
Fraud Detection Tests:
- Benford: This statistical test performs a comparison of the distribution of first significant digits (FSDs) of each value of the field to the Benford's law distribution. Benford's law applies to numerical distributions spanning several orders of magnitude, such as the values found on financial balance sheets. It states that the frequency distribution of leading, or first significant digits (FSD) in such distributions is not uniform. On the contrary, lower digits like 1 and 2 occur disproportionately often as leading significant digits. The test compares the distribution in the field to Bendford's distribution using a Chi-square goodness-of-fit test, and Cho-Gaines d test. If a field has a dissimilar distribution, it may contain anomalous or fraudulent values.
-
Normality tests: These tests can be used to confirm the assumption that the data in each field of a dataset is distributed
according to a normal distribution. The results are relevant because many statistical and machine learning techniques rely on this assumption.
- Anderson-Darling: The Anderson-Darling test computes a test statistic based on the difference between the observed cumulative distribution function (CDF) to that of a normal distribution. A significant result indicates that the assumption of normality is rejected.
- Jarque-Bera: The Jarque-Bera test computes a test statistic based on the third and fourth central moments (skewness and kurtosis) of the data. Again, a significant result indicates that the normality assumption is rejected.
- Z-score: For a given sample size, the maximum deviation from the mean that would expected in a sampling of a normal distribution can be computed based on the 68-95-99.7 rule. This test simply reports this expected deviation and the actual deviation observed in the data, as a sort of sanity check.
-
Outlier tests:
- Grubbs: When the values of a field are normally distributed, a few values may still deviate from the mean distribution. The outlier tests reports whether at least one value in each numeric field differs significantly from the mean using Grubb's test for outliers. If an outlier is found, then its value will be returned.
Note that both the number of tests within each category and the categories may increase in the near future.
BigML.io allows you to create, retrieve, update, delete your statistical test. You can also list all of your statistical tests.
Jump to:
- Statistical Test Base URL
- Creating a Statistical Test
- Statistical Test Arguments
- Retrieving a Statistical Test
- Statistical Test Properties
- Filtering and Paginating Fields from a Statistical Test
- Updating a Statistical Test
- Deleting a Statistical Test
- Listing Statistical Tests
Statistical Test Base URL
You can use the following base URL to create, retrieve, update, and delete statistical tests. https://bigml.io/andromeda/statisticaltest
Statistical Test base URL
All requests to manage your statistical tests must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Statistical Test
To create a new statistical test, you need to POST to the statistical test base URL an object containing at least the dataset/id that you want to use to create the statistical test. The content-type must always be "application/json".
You can easily create a new statistical test using curl as follows. All you need is a valid dataset/id and your authentication variable set up as shown above.
curl "https://bigml.io/andromeda/statisticaltest?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c"}'
> Creating a statistical test
BigML.io will return the newly created statistical test if the request succeeded.
{
"category": 0,
"clones": 0,
"code": 201,
"columns": 0,
"created": "2015-06-23T06:14:49.583473",
"credits": 0.09991455078125,
"dataset": "dataset/5579abc3545e5f4f8a000000",
"dataset_field_types": {
"categorical": 1,
"datetime": 0,
"numeric": 8,
"preferred": 9,
"text": 0,
"total": 9
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [ ],
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"input_fields": [ ],
"locale": "en-US",
"max_columns": 9,
"max_rows": 768,
"name": "Diabetes (all numeric) test",
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
768
],
"replacement": false,
"resource": "statisticaltest/5588f959545e5fdc1e000007",
"rows": 768,
"sample_rate": 1,
"shared": false,
"size": 26192,
"source": "source/5578d077545e5f6a17000011",
"source_status": true,
"statistical_tests": null,
"status": {
"code": 1,
"message": "The statistical test is being processed and will be created soon"
},
"subscription": false,
"tags": [ ],
"updated": "2015-06-23T06:14:49.583623",
"white_box": false
}
< Example statistical test JSON response
Statistical Test Arguments
In addition to the dataset, you can also POST the following arguments.
Argument | Type | Description |
---|---|---|
ad_sample_size
optional |
Integer, default is 1024 |
The Anderson-Darling normality test is computed from a sample from the values of each field. This parameter specifies the number of samples to be used during the normality test. If not given, defaults to 1024.
Example: 128 |
ad_seed
optional |
String |
A string to be hashed to generate deterministic samples for the Anderson-Darling normality test.
Example: "MyADSeed" |
category
optional |
Integer, default is the category of the dataset |
The category that best describes the statistical test. See the category codes for the complete list of categories.
Example: 1 |
dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
datasets
optional |
Array |
A list of dataset ids or objects to be used to build the new statistical test. See the Section on Multi-Datasets and Section on Resources Accepting Multi-Datasets Input for more details.
Example:
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
description
optional |
String |
A description of the statistical test up to 8192 characters long.
Example: "This is a description of my new statistical test" |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the statistical test.
Example:
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the statistical test with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the statistical test.
Example:
|
name
optional |
String, default is dataset's name |
The name you want to give to the new statistical test.
Example: "my new statistical test" |
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
project
optional |
String |
The project/id you want the statistical test to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the statistical test.
Example: [1, 150] |
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
shared_hash | String |
The shared hash of the shared model to be cloned. Set deep to true to clone the dataset used to build the statistical test too. Note that the dataset can be cloned only if it is already shared and set clonable. If multiple datasets have been used to create the statistical test, only the first dataset will be cloned.
Example: "kpY46mNuNVReITw0Z1mAqoQ9ySW" |
significance_levels
optional |
Array, default is [0.01, 0.05, 0.1] |
An array of significance levels between 0 and 1 to test against p_values.
Example: [0.01, 0.025, 0.05, 0075, 0.1] |
tags
optional |
Array of Strings |
A list of strings that help classify and index your statistical test.
Example: ["best customers", "2018"] |
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new statistical test. For example, to create a new statistical test named "my statistical test", with only certain rows, and with only three fields:
curl "https://bigml.io/andromeda/statisticaltest?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c",
"input_fields": ["000001", "000002", "000003"],
"name": "my statistical test",
"range": [25, 125]}'
> Creating a customized statistical test
If you do not specify a name, BigML.io will assign to the new statistical test the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset.
Read the Section on to learn how to sample your dataset. Here's an example of statistical test request with range and sampling specifications:
curl "https://bigml.io/andromeda/statisticaltest?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/505f43223c1920eccc000297",
"range": [1, 5000],
"sample_rate": 0.5}'
> Creating a statistical test using sampling
Retrieving a Statistical Test
Each statistical test has a unique identifier in the form "statisticaltest/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the statistical test.
To retrieve a statistical test with curl:
curl "https://bigml.io/andromeda/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH"
$ Retrieving a statistical test from the command line
Statistical Test Properties
Once a statistical test has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the statistical test and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the statistical test creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields in the statistical test. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the statistical test was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this statistical test. |
dataset
filterable, sortable |
String | The dataset/id that was used to build the statistical test. |
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
datasets | Array | A list of dataset ids or objects used to build the statistical test. |
description
updatable |
String | A text describing the statistical test. It can contain restricted markdown to decorate the text. |
excluded_fields | Array | The list of fields's ids that were excluded to build the statistical test. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
input_fields | Array | The list of input fields' ids used to build the models of the statistical test. |
locale | String | The dataset's locale. |
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the statistical test. |
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the statistical test. |
name
filterable, sortable, updatable |
String | The name of the statistical test as your provided or based on the name of the dataset by default. |
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the statistical test instead of the sampled instances. |
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your statistical test. |
private
filterable, sortable, updatable |
Boolean | Whether the statistical test is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
range | Array | The range of instances used to build the statistical test. |
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the statistical test were selected using replacement or not. |
resource | String | The statisticaltest/id. |
rows
filterable, sortable |
Integer | The total number of instances used to build the statistical test |
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the statistical test. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
shared
filterable, sortable, updatable |
Boolean | Whether the statistical test is shared using a private link or not. |
shared_clonable
filterable, sortable, updatable |
Boolean | Whether the shared statistical test can be cloned or not. |
shared_hash | String | The hash that gives access to this statistical test if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this statistical test. |
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this statistical test. |
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
statistical_tests | Object | All the information that you need to recreate the statistical test. It includes the field's dictionary describing the fields and their summaries, and the statistical tests. See the Statistical Tests Object definition below. |
status | Object | A description of the status of the statistical test. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the statistical test was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the statistical test was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
white_box
filterable, sortable |
Boolean | Whether the statistical test is publicly shared as a white-box. |
The Statistical Tests Object of statistical test has the following properties. Many statistical tests will contain a p-value and a significant boolean array, indicating whether the p_value is less than the provided significance_levels (by default, [0.01, 0.05, 0.10] is used if not provided). If p-value is greater than the accepted significance level, then then it fails to reject the null hypothesis, meaning there is no statistically significant difference between the treatment groups. For example, if the significance levels is [0.01, .0.025, 0.05, 0.075, 0.1] and p-value is 0.05, then significant is [false, false, false, true, true].
Property | Type | Description |
---|---|---|
ad_sample_size | Integer | The sample test size used for the Anderson-Darling normality test |
ad_seed | String | A seed used to generate deterministic samples for the Anderson-Darling normality test. |
fields | Object | A dictionary with an entry per field in the dataset used to build the test. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
fraud | Array | An array of anomalous fields detection test results for each numeric field. See Fraud Object. |
normality | Array | An array of data normality test results for each numeric field. See Normality Object. |
outliers | Array | An array of outlier detection test results for each numeric field. See Outliers Object. |
significance_levels | Array | An array of user provided significance levels to test against p_values. |
The Fraud Object has the following properties.
Property | Type | Description |
---|---|---|
name | String | Name of the fraud test. Currently only value available is benford. |
result | Object | A test result which is a dictionary between field ids and test result. The type of result object varies based on the name of the test. When name is benford, it returns Benford Result Object. |
The Benford Result Object has the following properties. Benford's Law is a simple yet powerful tool allowing quick screening of data for anomalies.
Property | Type | Description |
---|---|---|
chi_square | Object | See Chi-Square Object. |
cho_gaines | Object | See Cho-Gaines Object. |
distribution | Array |
The distribution of first significant digits (FSDs) to the Benford's law distribution. For example, the FSD for 2015 is 2, and for 0.00609 is 6. The array represents the number of occurences for each digit from 1 to 9.
Example: [0, 0, 0, 22, 61, 54, 0, 0, 0] |
negatives | Integer | The number of negative values. |
zeros | Integer | The number of values exactly equal to 0. |
The Chi-Square Object contains the chi-square statistic used to investigate whether distributions of categorical variables differ from one another. This test is used to compare a collection of categorical data with some theoretical expected distribution. The object has the following properties.
The Cho-Gaines Object has the following properties.
Property | Type | Description |
---|---|---|
d_statistic | Float | A value based on Euclidean distance from Benford's distribution in the 9-dimensional space occupied by any first-digit vector to test Cho-Gaines d test. |
significant | Array |
A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. It does not respect the values passed in significance_levels, but always use [0.01, 0.05, 0.1].
Example: [false, true, true] |
The Normality Object has the following properties.
Property | Type | Description |
---|---|---|
name | String | Name of the normality test. Available values are anderson_darling, jarque_bera, and z_score. |
result | Object | A test result which is a dictionary between field ids and test result. The type of result object varies based on the name of the test. When name is anderson_darling, it returns Anderson-Darling Result Object, when jarque_bera, Jarque-Bera Result Object, and when z-score, Z-Score Result Object. |
The Anderson-Darling Result Object has the following properties. See Anderson-Darling Test for more information.
The Jarque-Bera Result Object has the following properties. See Jarque-Bera Test for more information.
The Z-Score Object has the following properties. A positive standard score indicates a datum above the mean, while a negative standard score indicates a datum below the mean. See z-score for more information.
Property | Type | Description |
---|---|---|
expected-max-z | Float | The expected maximum z-score for the sample size. |
max-z | Float | The maximum z-score. |
The Outliers Object has the following properties.
Property | Type | Description |
---|---|---|
name | String | Name of the outlier detection test. Currently only value available is grubbs. |
result | Object | A test result which is a dictionary between field ids and test result. The type of result object varies based on the name of the test. When name is grubbs, it returns Grubbs Result Object. |
The Grubb's Test for Outliers Result Object has the following properties. It computes a t-test based on the maximum deviation from the mean. A significant result indicates that at least one outlier is present in the data. If an outlier is found, also returns the value of the outlier. Note that this test assumes that the data are normally distributed. See Grubb's test for outliers for more information.
Statistical Test Status
Creating statistical test is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The statistical test goes through a number of states until its fully completed. Through the status field in the statistical test you can determine when the test has been fully processed and ready to be used to create predictions. These are the properties that statistical test's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the statistical test creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the statistical test. |
message | String | A human readable message explaining the status. |
progress | Float, between 0 and 1 | How far BigML.io has progressed building the statistical test. |
Once statistical test has been successfully created, it will look like:
{
"category": 0,
"clones": 0,
"code": 200,
"columns": 9,
"created": "2015-06-23T06:14:49.583000",
"credits": 0.09991455078125,
"dataset": "dataset/5579abc3545e5f4f8a000000",
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [ ],
"fields_meta": {
"count": 9,
"limit": 1000,
"offset": 0,
"query_total": 9,
"total": 9
},
"input_fields": [
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007"
],
"locale": "en-US",
"max_columns": 9,
"max_rows": 768,
"name": "Diabetes test",
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
768
],
"replacement": false,
"resource": "statisticaltest/5588f959545e5fdc1e000007",
"rows": 768,
"sample_rate": 1,
"shared": false,
"size": 26192,
"source": "source/5578d077545e5f6a17000011",
"source_status": true,
"statistical_tests": {
"ad_sample_size": 2048,
"ad_seed": "MyADSeed",
"fields": { … },
"fraud": [
{
"name": "benford",
"result": {
"000000": {
"chi_square": {
"chi_square_value": 5.67791,
"p_value": 0.68326,
"significant": [
false,
false
]
},
"cho_gaines": {
"d_statistic": 0.7654738225941359,
"significant": [
false,
false,
false
]
},
"distribution": [
193,
103,
75,
68,
57,
50,
45,
38,
28
],
"negatives": 0,
"zeros": 111
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
}
],
"normality": [
{
"name": "anderson_darling",
"result": {
"000000": {
"p_value": 0,
"significant": [
true,
true
]
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
},
{
"name": "jarque_bera",
"result": {
"000000": {
"p_value": 0,
"significant": [
true,
true
]
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
},
{
"name": "z_score",
"result": {
"000000": {
"expected_max_z": 3.21552,
"max_z": 3.90403
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
}
],
"outliers": [
{
"name": "grubbs",
"result": {
"000000": {
"p_value": 0.06734,
"significant": [
false,
false
]
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
}
],
"significance_levels": [
0.025,
0.01
]
},
"status": {
"code": 5,
"elapsed": 2244,
"message": "The statistical test has been created",
"progress": 1
},
"subscription": false,
"tags": [ ],
"updated": "2015-06-23T06:15:18.908000",
"white_box": false
}
< Example statistical test JSON response
Filtering and Paginating Fields from a Statistical Test
A statistical test might be composed of hundreds or even thousands of fields. Thus when retrieving a statisticaltest, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Statistical Test
To update a statistical test, you need to PUT an object containing the fields that you want to update to the statistical test' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated statistical test.
For example, to update statistical test with a new name you can use curl like this:
curl "https://bigml.io/andromeda/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating statistical test' name
If you want to update statistical test with a new label and description for a specific field you can use curl like this:
curl "https://bigml.io/andromeda/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"fields": {"000000": {
"label": "a longer name",
"description": "an even longer description"}}}'
$ Updating statistical test's field,
label, and description
Deleting a Statistical Test
To delete a statistical test, you need to issue a HTTP DELETE request to the statisticaltest/id to be deleted.
Using curl you can do something like this to delete a statistical test:
curl -X DELETE "https://bigml.io/andromeda/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH"
$ Deleting a statistical test from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a statistical test, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a statistical test a second time, or a statistical test that does not exist, you will receive a "404 not found" response.
However, if you try to delete a statistical test that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Statistical Tests
To list all the statistical tests, you can use the statisticaltest base URL. By default, only the 20 most recent statistical tests will be returned. You can see below how to change this number using the limit parameter.
You can get your list of statistical tests directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/statisticaltest?$BIGML_AUTH
> Listing statistical tests from a browser
Configurations
Last Updated: Thursday, 2020-10-08 20:05
A configuration is a helper resource that provides an easy way to reuse the same arguments during the resource creation.
A configuration must have a name and optionally a category, description, and multiple tags to help you organize and retrieve your configurations.
BigML.io allows you to create, retrieve, update, delete your configuration. You can also list all of your configurations.
Jump to:
- Configuration Base URL
- Creating a Configuration
- Configuration Arguments
- Retrieving a Configuration
- Configuration Properties
- Updating a Configuration
- Deleting a Configuration
- Listing Configurations
Configuration Base URL
You can use the following base URL to create, retrieve, update, and delete configurations. https://bigml.io/andromeda/configuration
Configuration base URL
All requests to manage your configurations must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Configuration
To create a new configuration, you just need to POST the name you want to give to the new configuration and configurations that contains settings for individual or any resources to the configuration base URL.
You can easily do this using curl.
curl "https://bigml.io/andromeda/configuration?$BIGML_AUTH" \
-H 'content-type: application/json' \
-d '{
"name": "My First Configuration",
"configurations": {
"dataset": {
"name": "Customer FAQ dataset"
},
"ensemble": {
"description": "Customer FAQ ensemble with 10 models",
"number_of_models": 10
},
"any": {
"project": "project/55eeed1f1f386fc29520000a"
}
}
}'
> Creating a configuration
BigML.io will return a newly created configuration document, if the request succeeded.
{
"category":0,
"code":201,
"configurations": {
"any": {
"project": "project/55eeed1f1f386fc29500000a"
}
"dataset": {
"name": "Customer FAQ dataset"
},
"ensemble": {
"description": "Customer FAQ ensemble with 10 models"
"number_of_models": 10
}
},
"created":"2016-10-07T19:35:22.533289",
"credits":0,
"description":"",
"name":"Configuration 1",
"private":true,
"project":null,
"resource":"configuration/57db8107b8aa0940d5b61138",
"shared":false,
"stats":null,
"status":{
"code":5,
"message":"The configuration has been created"
},
"tags":[],
"updated":"2016-10-07T19:35:22.533391"
}
< Example configuration JSON response
The following arguments are available for you to use.
Configuration Arguments
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is 0 |
The category that best describes the configuration. See the category codes for the complete list of categories.
Example: 1 |
configurations | Object |
Default arguments for individual resources or any to apply the argument to all resources. For more information, see the Configurations below.
Example:
|
description
optional |
String |
A description of the configuration up to 8192 characters long.
Example: "This is a description of my new configuration" |
name
optional |
String |
The name you want to give to the new configuration.
Example: "my new configuration" |
tags
optional |
Array of Strings |
A list of strings that help classify and index your configuration.
Example: ["best customers", "2018"] |
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
Configurations
Under configurations, you can have any or specific resource names BigML supports excluding configuration resource itself, such as dataset, anomaly, model, etc.
Once a configuration is successfully created, you can pass a configuration argument to any resource as part of POST requests. For example, "configuration": "configuration/5776b2a64e1727b72c000007".
The order of precedence of applying the default values when using a configuration is
- User input
- Specific resource type from the configuration
- any from the configuration
For example, if you use the following configuration for creating a model,
"configurations": {
"model": {
"name": "model name",
"description": "model description"
},
"any" : {
"name": "any name",
"description": "any description",
"tags": ["any tags"]
}
}
Configurations example
and pass "name": "my custom name" as a POST argument, your new model will have
"name": "my custom name",
"description": "model description",
"tags": ["any tags"]
New model properties
Any element under a resource name will be validated against its validator when configuration is created. Elements under any are however validated at the runtime. For example, the following input
"configurations": {
"anomaly": {
"forest_size" : 32,
"dataset": "dataset/5776b19e4e1727b72c000002",
"anomaly_seed": true,
"top_n": "10"
},
"model": {
"objective_field": "000004",
"out_of_bag": true,
"model_seed": "my model seed"
},
"any" : {
"tag" : ["sample"],
}
}
Invalid configuration example
will return the following errors:
{
"code": 400,
"status": {
"code": -1204,
"extra": {
"configurations": {
"anomaly": {
"anomaly_seed": [
"This field must be a string no longer than 256 chars"
],
"top_n": [
"This field must be a number between 1 and 1024"
]
},
"model": {
"model_seed": [
"This field is not postable"
]
}
}
},
"message": "Bad request"
}
}
Errors for an invalid configuration example
Note any has a field called tag instead of tags, which isn't supported by any resources. This won't raise an error until you use the configuration to create other resources.
Retrieving a Configuration
Each configuration has a unique identifier in the form "configuration/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the configuration.
To retrieve a configuration with curl:
curl "https://bigml.io/andromeda/configuration/57db8107b8aa0940d5b61138?$BIGML_AUTH"
$ Retrieving a configuration from the command line
Configuration Properties
Once a configuration has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the configuration and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the configuration creation has been completed without errors. |
configurations | Object | Configuration object. For more information, see the Configurations above. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the configuration was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
description
updatable |
String | A text describing the configuration. It can contain restricted markdown to decorate the text. |
name
filterable, sortable, updatable |
String | The name of the configuration as provided. |
private
filterable, sortable |
Boolean | Whether the configuration is public or not. |
resource | String | The configuration/id. |
status | Object | A description of the status of the configuration. It includes a code, a message, and some extra information. See the table below. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the configuration was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Updating a Configuration
To update a configuration, you need to PUT an object containing the fields that you want to update to the configuration' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated configuration.
For example, to update a configuration with a new configurations and a new category, you can use curl like this:
curl "https://bigml.io/andromeda/configuration/57db8107b8aa0940d5b61138?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{
"category": 3,
"configurations": {
"dataset": {
"name": "Customer FAQ dataset"
},
"ensemble": {
"description": "Customer FAQ ensemble with 10 models"
"number_of_models": 10
},
"any": {
"project": "project/55eeed1f1f386fc29520000a"
"tags": ["FAQ", "Sample"]
}
}
}'
$ Updating a configuration
Deleting a Configuration
To delete a configuration, you need to issue a HTTP DELETE request to the configuration/id to be deleted.
Using curl you can do something like this to delete a configuration:
curl -X DELETE "https://bigml.io/andromeda/configuration/57db8107b8aa0940d5b61138?$BIGML_AUTH"
$ Deleting a configuration from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a configuration, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a configuration a second time, or a configuration that does not exist, you will receive a "404 not found" response.
However, if you try to delete a configuration that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Configurations
To list all the configurations, you can use the configuration base URL. By default, only the 20 most recent configurations will be returned. You can see below how to change this number using the limit parameter.
You can get your list of configurations directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/configuration?$BIGML_AUTH
> Listing configurations from a browser
Composites
Last Updated: Thursday, 2020-10-08 20:05
Composite models are an aggregate of individual model resources grouped by the user in an ad-hoc fashion. Each submodel in a composite has been previously built independently by the service, and they're just put together in a composite resource. However, a composite cannot be used for predictions or evaluations.
Any model type (anomaly, association, cluster, composite, deepnet, ensemble, fusion, model, linear regression, logistic regression, optiml, time series, and topic model) can be a submodel of a composite.
BigML.io allows you to create, retrieve, update, delete your composite. You can also list all of your composites.
Jump to:
- Composite Base URL
- Creating a Composite
- Composite Arguments
- Retrieving a Composite
- Composite Properties
- Filtering and Paginating Models from a Composite
- Updating a Composite
- Deleting a Composite
- Listing Composites
Composite Base URL
You can use the following base URL to create, retrieve, update, and delete composites. https://bigml.io/andromeda/composite
Composite base URL
All requests to manage your composites must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Composite
To create a new composite, you need to POST to the composite base URL an object containing at least a list of model ids that you want to use to create the composite. The content-type must always be "application/json".
POST /composite?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating composite definition
curl "https://bigml.io/andromeda/composite?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"models": ["model/4f66a80803ce8940c5000006", "logisticregression/5a95d5664e17271473000000", "cluster/5aec0b9e4e17275dab000401"]}'
> Creating a composite
BigML.io will return the newly created composite if the request succeeded.
{
"category": 0,
"code": 201,
"composite": {},
"configuration": null,
"configuration_status": false,
"created": "2018-05-09T05:52:16.673433",
"description": "",
"model_count": {
"cluster": 1,
"logisticregression": 1,
"model": 1,
"total": 3
},
"models": [
"model/5948beb44e17273079000003",
"logisticregression/5a95d5664e17271473000000",
"cluster/5aec0b9e4e17275dab000401"
],
"name": "Iris models composite",
"name_options": "3 total models (cluster: 1, logisticregression: 1, model: 1)",
"private": true,
"project": null,
"resource": "composite/59af8107b8aa0965d5b61138",
"shared": false,
"status": {
"code": 1,
"message": "The composite creation request has been queued and will be processed soon"
},
"subscription": false,
"tags": [],
"type": 0,
"updated": "2018-05-09T05:52:16.677393"
}
< Example composite JSON response
Composite Arguments
In addition to the models, you can also POST the following arguments.
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is 0 |
The category that best describes the composite. See the category codes for the complete list of categories.
Example: 1 |
description
optional |
String |
A description of the composite up to 8192 characters long.
Example: "This is a description of my new composite" |
models | Array |
A list with composite submodel resource/ids or or list of maps using the key id for each submodel resource/id, and any other key/values for additional meta-information on the model. Available submodel types are anomaly, association, cluster, composite, deepnet, ensemble, fusion, model, linear regression, logistic regression, optiml, time series, and topic model. The maximum number of submodels is 1000.
Example: or
|
name
optional |
String, default is composite's name |
The name you want to give to the new composite.
Example: "my new composite" |
project
optional |
String |
The project/id you want the composite to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
tags
optional |
Array of Strings |
A list of strings that help classify and index your composite.
Example: ["best customers", "2018"] |
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
If you do not specify a name, BigML.io will assign to the new composite one.
Retrieving a Composite
Each composite has a unique identifier in the form "composite/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the composite.
To retrieve a composite with curl:
curl "https://bigml.io/andromeda/composite/59af8107b8aa0965d5b61138?$BIGML_AUTH"
$ Retrieving a composite from the command line
Composite Properties
Once a composite has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the composite and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the composite creation has been completed without errors. |
composite
filterable, sortable |
Object | Composite object. For more information, see the Composite below. |
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the composite was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this composite. |
description
updatable |
String | A text describing the composite. It can contain restricted markdown to decorate the text. |
model_count
filterable, sortable |
Object |
A dictionary that informs about the number of submodels of each type in the composite.
Example:
|
models | Array |
A list of all submodels ids regardless of how models are filtered and paged.
Example:
|
models_meta | Object | A dictionary with meta information about the models filtered.It specifies the total number of models, the current offset, and limit. |
name
filterable, sortable, updatable |
String | The name of the composite as your provided. |
private
filterable, sortable |
Boolean | Whether the composite is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
resource | String | The composite/id. |
shared
filterable |
Boolean | Whether the composite is shared using a private link or not. |
shared_clonable
filterable |
Boolean | Whether the shared composite can be cloned or not. |
shared_hash
filterable |
String | The hash that gives access to this composite if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this composite. |
status | Object | A description of the status of the composite. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the composite was created using a subscription plan or not. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
type
filterable, sortable |
Integer | Reserved for future use. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the composite was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
The Composite object has the following properties.
Composite Status
Creating a composite is a process that can take just a few seconds or a few hours depending on the size of the models used as input and on the workload of BigML's systems. The composite goes through a number of states until its fully completed. Through the status field in the composite you can determine when the composite has been fully processed and ready to be used to create predictions. These are the properties that a composite's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the composite creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the composite. |
message | String | A human readable message explaining the status. |
progress | Float, between 0 and 1 | How far BigML.io has progressed building the composite. |
Once a composite has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"composite": {
"models": [
{
"id": "model/5948beb44e17273079000003",
"kind": "model",
"name": "Iris tree",
"name_options": "512-node, pruned, deterministic order"
},
{
"id": "logisticregression/5a95d5664e17271473000000",
"kind": "logisticregression",
"name": "Iris LR",
"name_options": "L2 regularized (c=1), bias, auto-scaled, missing values"
},
{
"id": "cluster/5aec0b9e4e17275dab000401",
"kind": "cluster",
"name": "Flower colors cluster",
"name_options": "K-means, k=10"
}
]
},
"configuration": null,
"configuration_status": false,
"created": "2018-05-09T05:52:16.673000",
"description": "",
"model_count": {
"cluster": 1,
"logisticregression": 1,
"model": 1,
"total": 3
},
"models": [
"model/5948beb44e17273079000003",
"logisticregression/5a95d5664e17271473000000",
"cluster/5aec0b9e4e17275dab000401"
],
"models_meta": {
"count": 3,
"offset": 0,
"limit": 1000,
"total": 3
},
"name": "Iris models composite",
"name_options": "3 total models (cluster: 1, logisticregression: 1, model: 1)",
"private": true,
"project": null,
"resource": "composite/59af8107b8aa0965d5b61138",
"shared": false,
"status": {
"code": 5,
"elapsed": 4658,
"message": "The composite has been created",
"progress": 1
},
"subscription": false,
"tags": [],
"type": 0,
"updated": "2018-05-09T05:52:21.359000"
}
< Example composite JSON response
Filtering and Paginating Models from a Composite
Since model lists can grow large, we offer paginations of the models list in the response when GETting it via HTTP. Pagination is specified using the following query string parameters:
- models_limit: A non-negative integer indicating how many elements in models to return. If not provided, we return at most 1000. If passed a negative value (say, -1), we return all of them.
- models_offset: The offset in the list of models (i.e., how many models are discarded before we take limit of them).
- models_sort_by: Sorting criteria, specified by any one of the keys the user provided during the creation in the models maps. Sorting is ascending, unless you prefix the key name with a minus sign. For instance, let's say your models have a property, rank. You can use a query string of the form models_sort_by=rank to sort them by rank in ascending order, and one of the form strong>models_sort_by=-rank" to sort them in descending order. It is possible to provide more than one ordering criterion, separating them by commas, in which case the second and subsequent ones are used to break ties in the ordering generated by the previous ones.
Sorting happens before limit and offset are applied. When pagination is active, the models_meta property at the top level in the returned. This property will contain offset, limit, count, and total.
Updating a Composite
To update a composite, you need to PUT an object containing the fields that you want to update to the composite' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated composite.
For example, to update a composite with a new name you can use curl like this:
curl "https://bigml.io/andromeda/composite/59af8107b8aa0965d5b61138?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a composite's name
Deleting a Composite
To delete a composite, you need to issue a HTTP DELETE request to the composite/id to be deleted.
Using curl you can do something like this to delete a composite:
curl -X DELETE "https://bigml.io/andromeda/composite/59af8107b8aa0965d5b61138?$BIGML_AUTH"
$ Deleting a composite from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a composite, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a composite a second time, or a composite that does not exist, you will receive a "404 not found" response.
However, if you try to delete a composite that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Composites
To list all the composites, you can use the composite base URL. By default, only the 20 most recent composites will be returned. You can see below how to change this number using the limit parameter.
You can get your list of composites directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/composite?$BIGML_AUTH
> Listing composites from a browser
Models
Last Updated: Wednesday, 2020-12-09 09:40
A model is a tree-like representation of your dataset with predictive power. You can create a model selecting which fields from your dataset you want to use as input fields (or predictors) and which field you want to predict, the objective field.
Each node in the model corresponds to one of the input fields. Each node has an incoming branch except the top node also known as root that has none. Each node has a number of outgoing branches except those at the bottom (the "leaves") that have none.
Each branch represents a possible value for the input field where it originates. A leaf represents the value of the objective field given all the values for each input field in the chain of branches that goes from the root to that leaf.
When you create a new model, BigML.io will automatically compute a classification model or regression model depending on whether the objective field that you want to predict is categorical or numeric, respectively.

BigML.io allows you to create, retrieve, update, delete your model. You can also list all of your models.
Jump to:
- Model Base URL
- Creating a Model
- Model Arguments
- Shuffling the Rows of Your Dataset
- Sampling Your Dataset
- Random Decision Forests
- Retrieving a Model
- Model Properties
- Filtering a Model
- PMML
- Filtering and Paginating Fields from a Model
- Updating a Model
- Deleting a Model
- Listing Models
- Weights
- Weight Field
- Objective Weights
- Automatic Balancing
Model Base URL
You can use the following base URL to create, retrieve, update, and delete models. https://bigml.io/andromeda/model
Model base URL
All requests to manage your models must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Model
To create a new model, you need to POST to the model base URL an object containing at least the dataset/id that you want to use to create the model. The content-type must always be "application/json".
POST /model?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating model definition
curl "https://bigml.io/andromeda/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a model
BigML.io will return the newly created model if the request succeeded.
{
"category": 0,
"code": 201,
"columns": 1,
"created": "2012-11-15T02:32:48.763534",
"credits": 0.017578125,
"credits_per_prediction": 0.0,
"dataset": "dataset/50a453753c1920186d000045",
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_meta": {
"count": 0,
"limit": 200,
"offset": 0,
"total": 0
},
"input_fields": [],
"locale": "en-US",
"max_columns": 5,
"max_rows": 150,
"missing_splits": false,
"name": "iris' dataset model",
"number_of_evaluations": 0,
"number_of_predictions": 0,
"number_of_public_predictions": 0,
"objective_field": null,
"objective_fields": [],
"ordering": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"randomize": false,
"range": [
1,
150
],
"replacement": false,
"resource": "model/50a454503c1920186d000049",
"rows": 150,
"sample_rate": 1.0,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"status": {
"code": 1,
"message": "The model is being processed and will be created soon"
},
"tags": [
"species"
],
"updated": "2012-11-15T02:32:48.763566",
"views": 0,
"white_box": false
}
< Example model JSON response
Model Arguments
In addition to the dataset, you can also POST the following arguments.
Argument | Type | Description |
---|---|---|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the model. See the category codes for the complete list of categories.
Example: 1 |
dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
datasets
optional |
Array |
A list of dataset ids or objects to be used to build the new model. See the Section on Multi-Datasets and Section on Resources Accepting Multi-Datasets Input for more details.
Example:
|
deep
optional |
Boolean, default is false |
Clone the dataset used to build the Example: true |
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
depth_threshold
optional |
Integer, default is 512 |
When the depth in the tree exceeds this value, the tree stops growing. It has no effect if it's bigger than the node_threshold.
Example: 128 |
description
optional |
String |
A description of the model up to 8192 characters long.
Example: "This is a description of my new model" |
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the model.
Example:
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the model with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
focus_field
optional |
String |
A field name or identifier for a categorical field. If set, the resulting tree will split first, in a cascade, on all categories of the given field. We are still splitting first on the field, but all nodes are kept binary (i.e., having two children). This is for the convenience of clients that don't know how to handle non-binary splits.
Example: "000001" |
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the model.
Example:
|
max_training_time
optional |
Integer, default is 1800 |
The maximum training time allowed for the optimization, in seconds, as a strictly positive integer. Applicable only when optimize is set to true.
Example: 3600 |
missing_splits
optional |
Boolean, default is false |
Defines whether to explicitly include missing field values when choosing a split. When this option is enabled, generates predicates whose operators include an asterisk, such as >*, <=*, =*, or !=*. The presence of an asterisk means "or missing". So a split with the operator >* and the value 8 can be read as "x > 8 or x is missing". When using missing_splits there may also be predicates with operators = or !=, but with a null value. This means "x is missing" and "x is not missing" respectively.
Example: true |
name
optional |
String, default is dataset's name |
The name you want to give to the new model.
Example: "my new model" |
node_threshold
optional |
Integer, default is 512 |
When the number of nodes in the tree exceeds this value, the tree stops growing.
Example: 1000 |
number_of_model_candidates
optional |
Integer, default is 128 |
The number of model candidates evaluated over the course of the optimization. Applicable only when optimize is set to true. Maximum 200 candidates.
Example: 100 |
objective_field
optional |
String, default is dataset's pre-defined objective field |
Specifies the id of the field that you want to predict.
Example: "000003" |
objective_fields
optional |
Array, default is an array with the id of the last field in the dataset |
Specifies the id of the field that you want to predict. Even if this an array BigML.io only accepts one objective field in the current version. If both objective_field and objective_fields are specified then, objective_field takes preference.
Example: ["000003"] |
optimize
optional |
Boolean, default is false |
Whether the model should be built with the automatic optimization. When it is set to true, only the following modeling properties are applied: default_numeric_value, excluded_fields, input_fields, max_training_time, missing_splits, number_of_model_candidates, objective_field, objective_weights, sample_rate, and weight_field
Example: true |
ordering
optional |
Integer, default is 0 (deterministic). |
Specifies the type of ordering followed to build the model. There are three different types that you can specify:
Example: 1 |
origin | String |
The model/id of the gallery model to be cloned. The price of the model must be 0 to be cloned via API. Set deep to true to clone the dataset used to build the model too. Note that the dataset can be cloned only if it is already in the public gallery and free. If multiple datasets have been used to create the model, only the first dataset will be cloned.
Example: "model/5b9ab8474e172785e3000003" |
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
project
optional |
String |
The project/id you want the model to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
random_candidate_ratio
optional |
Float |
A real number between 0 and 1. When randomize is true and random_candidate_ratio is given, BigML randomizes the tree and uses random_candidate_ratio * total fields (counting the number of terms in text fields as fields). To get the final number of candidate fields we round down to the nearest integer, but if the result is 0 we'll use 1 instead. If both random_candidates and random_candidate_ratio are given, BigML ignores random_candidate_ratio.
Example: 0.2 |
random_candidates
optional |
Integer, default is the square root of the total number of input fields. |
Sets the number of random fields considered when randomize is true.
Example: 10 |
randomize
optional |
Boolean, default is false |
Setting this parameter to true will consider only a subset of the possible fields when choosing a split. See the Section on Random Decision Forests below.
Example: true |
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the model.
Example: [1, 150] |
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
shared_hash | String |
The shared hash of the shared model to be cloned. Set deep to true to clone the dataset used to build the model too. Note that the dataset can be cloned only if it is already shared and set clonable. If multiple datasets have been used to create the model, only the first dataset will be cloned.
Example: "kpY46mNuNVReITw0Z1mAqoQ9ySW" |
split_candidates
optional |
Integer, default is 32 |
The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024
Example: 128 |
split_field
optional |
String |
A field name or identifier for a categorical field. If set, the first split of the decision tree will use this field and have a children per category (i.e., if there are n categories, the first node will have n elements in its children).
Example: "000001" |
stat_pruning
optional |
Boolean |
Activates statistical pruning on your decision tree model.
Example: true |
support_threshold
optional |
Float, default is 0 |
The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. So, if it is 3, then a both children of a new split must have 3 instances supporting them. Since instances may have non-integer weights, non-integer values are valid.
Example: 16 |
tags
optional |
Array of Strings |
A list of strings that help classify and index your model.
Example: ["best customers", "2018"] |
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new model. For example, to create a new model named "my model", with only certain rows, and with only three fields:
curl "https://bigml.io/andromeda/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"input_fields": ["000001", "000003"],
"name": "my model",
"range": [25, 125]}'
> Creating a customized model
If you do not specify a name, BigML.io will assign to the new model the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset, and if you do not specify an objective field, BigML.io will use the last field in your dataset.
Shuffling the Rows of Your Dataset
By default, rows from the input dataset are deterministically shuffled before being processed, to avoid inaccurate models caused by ordered fields in the input rows. Since the shuffling is deterministic, i.e., always the same for a given dataset, retraining a model for the same dataset will always yield the same result.
However, you can modify this default behaviour by including the ordering argument in the model creation request, where "ordering" here is a shortcut for "ordering for the traversal of input rows". When this property is absent or set to 0, deterministic shuffling takes place; otherwise, you can set it to:
- Linear: If you know that your input is already in random order. Setting "ordering" to 1 in your model request tells BigML to traverse the dataset in a linear fashion, without performing any shuffling (and therefore operating faster).
- Random: If you'd like to perform a really random shuffling, most probably different from any other one attempted before. Setting "ordering" to 2 will shuffle the input rows non-deterministically.
Sampling Your Dataset
You can limit the dataset rows that are used to create a model in two ways (which can be combined), namely, by specifying a row range and by asking for a sample of the (alreaday clipped) input rows.
The row range is specified with the range argument defined in the Section on Arguments above.
To specify a sample, which is taken over the row range or over the whole dataset if a range is not provided, you can add the following arguments to the creation request:
- sample_rate : A positive number that specifies the sampling rate, i.e., how often we pick a row from the range. In other words, the final number of rows will be the size of the range multiply by the sample_rate, unless "out_of_bag" is true (see below).
- replacement : A boolean indicating whether sampling should be performed with or without replacement, i.e., the same instance may be selected multiple times for inclusion in the result set. Defaults to false.
- out_of_bag : If an instance isn't selected as part of a sampling, it's called out of bag. Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. This can be useful when paired with "seed". When replacement is false, the final number of row returned is the size of the range multiply by one minus the sample_rate. Out-of-bag sampling with replacement gives rise to variable-size samples. Defaults to false.
- seed : Rows are sampled probabilistically using a random string, which means that, in general, two identical samples of the same row range of the same dataset will be different. If you provide a seed (as an arbitrary string), its hash value will be used as the seed, and it'll be possible for you to generate deterministic samples.
Finally, note that the "ordering" of the dataset described in the previous subsection is used on the result of the sampling.
Here's an example of a model request with range and sampling specifications:
curl https://bigml.io/andromeda/model?$BIGML_AUTH \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/505f43223c1920eccc000297", "range": [1, 5000], "sample_rate": 0.5, "replacement": true}'
Creating a model using sampling
Random Decision Forests
A model can be randomized by setting the randomize parameter to true. The default is false.
When randomized, the model considers only a subset of the possible fields when choosing a split. The size of the subset will be the square root of the total number of input fields. So if there are 100 input fields, each split will only consider 10 fields randomly chosen from the 100. Every split will choose a new subset of fields.
Although randomize could be used for other purposes, it's intended for growing random decision forests. To grow tree models for a random forest, set randomize to true and select a sample from the dataset. Traditionally this is a 1.0 sample rate with replacement, but we suggest a 0.63 sample rate without replacement.
Retrieving a Model
Each model has a unique identifier in the form "model/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the model.
To retrieve a model with curl:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH"
$ Retrieving a model from the command line
You can also use your browser to visualize the model using the full BigML.io URL or pasting the model/id into the BigML.com dashboard.
Model Properties
Once a model has been successfully created it will have the following properties.
Property | Type | Description |
---|---|---|
boosted_ensemble
filterable, sortable |
Boolean | Whether the model was built as part of an ensemble with boosted trees. |
boosting | Object |
Boosting attribute for the boosted tree. See the Gradient Boosting section for more information.
Example:
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
code | Integer | HTTP status code. This will be 201 upon successful creation of the model and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the model creation has been completed without errors. |
columns
filterable, sortable |
Integer | The number of fields in the model. |
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the model was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
credits
filterable, sortable |
Float | The number of credits it cost you to create this model. |
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a prediction with your model if you made it public. |
dataset
filterable, sortable |
String | The dataset/id that was used to build the model. |
dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the model. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
datasets | Array | A list of dataset ids or objects used to build the model. |
deep | Boolean | Whether the dataset used to build the original model is also requested to be cloned or not. |
description
updatable |
String | A text describing the model. It can contain restricted markdown to decorate the text. |
ensemble
filterable, sortable |
Boolean | Whether the model was built as part of an ensemble of not. |
ensemble_id
filterable, sortable |
String | The ensemble id. |
ensemble_index
filterable, sortable |
Integer | The number of order in the ensemble. |
excluded_fields | Array | The list of fields's ids that were excluded to build the model. |
fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
focus_field | String |
Specifies the id of the focus field in the model.
Example: "000001" |
focus_field_name | String | The name of the focus field in the model. |
fusions
filterable, sortable |
Array of Strings | The list of fusion ids that reference this model. |
input_fields | Array | The list of input fields' ids used to build the models of the model. |
locale | String | The dataset's locale. |
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the model. |
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the model. |
max_training_time | Integer | The maximum training time allowed for the optimization, in seconds. |
missing_splits
filterable, sortable |
Boolean | Whether to explicitly include missing field values when choosing a split while growing a model. |
model | Object | All the information that you need to recreate or use the model on your own. It includes a very intuitive description of the tree-like structure that makes the model up and the field's dictionary describing the fields and their summaries. |
name
filterable, sortable, updatable |
String | The name of the model as your provided or based on the name of the dataset by default. |
node_threshold
filterable, sortable |
String | The maximum number of nodes that the model will grow. |
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this model. |
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this model. |
number_of_model_candidates | Integer | The number of model candidates evaluated over the course of the optimization. |
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this model. |
number_of_public_predictions
filterable, sortable |
Integer | The current number of public predictions that use this model. |
objective_field | String | The id of the field that the model predicts. |
objective_fields | Array | Specifies the list of ids of the field that the model predicts. Even if this is an array BigML.io only accepts one objective field in the current version. |
optimize | Boolean | Whether the model was built with the automatic optimization. |
optiml
filterable, sortable |
String | The optiml/id that created this model. |
optiml_status
filterable, sortable |
Boolean | Whether the OptiML is still available or has been deleted. |
ordering
filterable, sortable |
Integer |
The order used to chose instances from the dataset to build the model. There are three different types:
|
origin
filterable, sortable |
String | The model/id of the original gallery model. |
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the model instead of the sampled instances. |
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your model. |
private
filterable, sortable, updatable |
Boolean | Whether the model is public or not. |
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
random_candidate_ratio
filterable, sortable |
Float | The random candidate ratio considered when randomize is true. |
random_candidates
filterable, sortable |
Integer | The number of random fields considered when randomize is true. |
randomize
filterable, sortable |
Boolean | Whether the model splits considered only a random subset of the fields or all the fields available. |
range | Array | The range of instances used to build the model. |
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the model were selected using replacement or not. |
resource | String | The model/id. |
rows
filterable, sortable |
Integer | The total number of instances used to build the model |
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the model. |
seed
filterable, sortable |
String | The string that was used to generate the sample. |
selective_pruning
filterable, sortable |
Boolean | If true, selective pruning throttled the strength of the statistical pruning depending on the size of the dataset. |
shared
filterable, sortable, updatable |
Boolean | Whether the model is shared using a private link or not. |
shared_clonable
filterable, sortable, updatable |
Boolean | Whether the shared model can be cloned or not. |
shared_hash | String | The hash that gives access to this model if it has been shared using a private link. |
sharing_key | String | The alternative key that gives read access to this model. |
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this model. |
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
split_candidates
filterable, sortable |
Integer | The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024 |
split_field | String |
Specifies the id of the split field in the model.
Example: "000001" |
split_field_name | String | The name of the split field in the model. |
stat_pruning
filterable, sortable |
Boolean | Whether statistical pruning was used when building the model. |
status | Object | A description of the status of the model. It includes a code, a message, and some extra information. See the table below. |
subscription
filterable, sortable |
Boolean | Whether the model was created using a subscription plan or not. |
support_threshold
filterable, sortable |
Float | The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. |
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the model was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
white_box
filterable, sortable |
Boolean | Whether the model is publicly shared as a white-box. |
A Model Object has the following properties:
Property | Type | Description |
---|---|---|
depth_threshold | Integer | The depth, or generation, limit for a tree. |
distribution | Object | This dictionary gives information about how the training data is distributed across the tree leaves. More concretely, it contains the training data distribution with key training, and the distribution for the actual prediction values of the tree with key predictions. The former is just the objective_summary of the tree root (see below), copied for easier individual retrieval, and both have the format of the objective summary in the tree nodes. |
fields | Object | A dictionary with an entry per field in the dataset used to build the model. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
importance | Array of Arrays | A list of pairs [field_id, importance]. Importance is the amount by which each field in the model reduces prediction error, normalized to be between zero and one. Note that fields with an importance of zero may still be correlated with the objective; they were just not used in the model. |
kind | String | The type of model. Currently, only stree is supported. |
missing_strategy | String | Default strategy followed by the model when it finds a missing value. Currently, last_prediction. At prediction time you can opt for using proportional. See this Section for more details. |
model_fields | Object | A dictionary with an entry per field used by the model (not all the fields that were available in the dataset). They follow the same structure as the fields attribute above except that the summary is not present. |
root | Object | A Node Object, a tree-like recursive structure representing the model. |
split_criterion | Integer | Method of choosing best attribute and split point for a given node. DEPRECATED |
support_threshold | Float | A number between 0 and 1. For a split to be valid, each child's support (instances / total instances) must be greater than this threshold. |
Node Objects have the following properties:
Property | Type | Description |
---|---|---|
children | Array | Array of Node Objects. |
confidence | Float | For classification models, a number between 0 and 1 that expresses how certain the model is of the prediction. For regression models, a number mapped to the top end of a 95 confidence interval around the expected error at that node (measured using the variance of the output at the node). See the Section on Confidence for more details. Note that for models you might have created using the first versions of BigML this value might be null. |
count | Integer | Number of instances classfied by this node. |
objective_summary | Object | An Objective Summary Object summarizes the objective field's distribution at this node. |
output | Number or String | Prediction at this node. |
predicate | Boolean or Object | Predicate structure to make a decision at this node. |
Objective Summary Objects have the following properties:
Property | Type | Description |
---|---|---|
bins | Array | If the objective field is numeric and the number of distinct values is greater than 32. An array that represents an approximate histogram of the distribution. It consists of value pairs, where the first value is the mean of a histogram bin and the second value is the bin population. For more information, see our blog post or read this paper. |
categories | Array | If the objective field is categorical, an array of pairs where the first element of each pair is one of the unique categories and the second element is the count for that category. |
counts | Array | If the objective field is numeric and the number of distinct values is less than or equal to 32, an array of pairs where the first element of each pair is one of the unique values found in the field and the second element is the count. |
maximum | Number | The maximum of the objective field's values. Available when 'bins' is present. |
minimum | Number | The minimum of the objective field's values. Available when 'bins' is present. |
Predicate Objects have the following properties:
Model Status
Creating a model is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The model goes through a number of states until its fully completed. Through the status field in the model you can determine when the model has been fully processed and ready to be used to create predictions. These are the properties that a model's status has:
Property | Type | Description |
---|---|---|
code | Integer | A status code that reflects the status of the model creation. It can be any of those that are explained here. |
elapsed | Integer | Number of milliseconds that BigML.io took to process the model. |
message | String | A human readable message explaining the status. |
progress | Float, between 0 and 1 | How far BigML.io has progressed building the model. |
Once a model has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"columns": 5,
"created": "2012-11-15T02:32:48.763000",
"credits": 0.017578125,
"credits_per_prediction": 0.0,
"dataset": "dataset/50a453753c1920186d000045",
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_meta": {
"count": 5,
"limit": 200,
"offset": 0,
"total": 5
},
"input_fields": [
"000000",
"000001",
"000002",
"000003"
],
"locale": "en_US",
"max_columns": 5,
"max_rows": 150,
"missing_splits": false,
"model": {
"depth_threshold": 20,
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": {
"bins": [
[
4.3,
1
],
[
4.425,
4
],
[
4.6,
4
],
[
4.7,
2
],
[
4.8,
5
],
[
4.9,
6
],
[
5,
10
],
[
5.1,
9
],
[
5.2,
4
],
[
5.3,
1
],
[
5.4,
6
],
[
5.5,
7
],
[
5.6,
6
],
[
5.7,
8
],
[
5.8,
7
],
[
5.9,
3
],
[
6,
6
],
[
6.1,
6
],
[
6.2,
4
],
[
6.3,
9
],
[
6.44167,
12
],
[
6.6,
2
],
[
6.7,
8
],
[
6.8,
3
],
[
6.92,
5
],
[
7.1,
1
],
[
7.2,
3
],
[
7.3,
1
],
[
7.4,
1
],
[
7.6,
1
],
[
7.7,
4
],
[
7.9,
1
]
],
"maximum": 7.9,
"mean": 5.84333,
"median": 5.77889,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"splits": [
4.51526,
4.67252,
4.81113,
4.89582,
4.96139,
5.01131,
5.05992,
5.11148,
5.18177,
5.35681,
5.44129,
5.5108,
5.58255,
5.65532,
5.71658,
5.77889,
5.85381,
5.97078,
6.05104,
6.13074,
6.23023,
6.29578,
6.35078,
6.41459,
6.49383,
6.63013,
6.70719,
6.79218,
6.92597,
7.20423,
7.64746
],
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569
}
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": {
"counts": [
[
2,
1
],
[
2.2,
3
],
[
2.3,
4
],
[
2.4,
3
],
[
2.5,
8
],
[
2.6,
5
],
[
2.7,
9
],
[
2.8,
14
],
[
2.9,
10
],
[
3,
26
],
[
3.1,
11
],
[
3.2,
13
],
[
3.3,
6
],
[
3.4,
12
],
[
3.5,
6
],
[
3.6,
4
],
[
3.7,
3
],
[
3.8,
6
],
[
3.9,
2
],
[
4,
1
],
[
4.1,
1
],
[
4.2,
1
],
[
4.4,
1
]
],
"maximum": 4.4,
"mean": 3.05733,
"median": 3.02044,
"minimum": 2,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.43587,
"sum": 458.6,
"sum_squares": 1430.4,
"variance": 0.18998
}
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true,
"summary": {
"bins": [
[
1,
1
],
[
1.1,
1
],
[
1.2,
2
],
[
1.3,
7
],
[
1.4,
13
],
[
1.5,
13
],
[
1.63636,
11
],
[
1.9,
2
],
[
3,
1
],
[
3.3,
2
],
[
3.5,
2
],
[
3.6,
1
],
[
3.75,
2
],
[
3.9,
3
],
[
4.0375,
8
],
[
4.23333,
6
],
[
4.46667,
12
],
[
4.6,
3
],
[
4.74444,
9
],
[
4.94444,
9
],
[
5.1,
8
],
[
5.25,
4
],
[
5.46,
5
],
[
5.6,
6
],
[
5.75,
6
],
[
5.95,
4
],
[
6.1,
3
],
[
6.3,
1
],
[
6.4,
1
],
[
6.6,
1
],
[
6.7,
2
],
[
6.9,
1
]
],
"maximum": 6.9,
"mean": 3.758,
"median": 4.34142,
"minimum": 1,
"missing_count": 0,
"population": 150,
"splits": [
1.25138,
1.32426,
1.37171,
1.40962,
1.44567,
1.48173,
1.51859,
1.56301,
1.6255,
1.74645,
3.23033,
3.675,
3.94203,
4.0469,
4.18243,
4.34142,
4.45309,
4.51823,
4.61771,
4.72566,
4.83445,
4.93363,
5.03807,
5.1064,
5.20938,
5.43979,
5.5744,
5.6646,
5.81496,
6.02913,
6.38125
],
"standard_deviation": 1.7653,
"sum": 563.7,
"sum_squares": 2582.71,
"variance": 3.11628
}
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true,
"summary": {
"counts": [
[
0.1,
5
],
[
0.2,
29
],
[
0.3,
7
],
[
0.4,
7
],
[
0.5,
1
],
[
0.6,
1
],
[
1,
7
],
[
1.1,
3
],
[
1.2,
5
],
[
1.3,
13
],
[
1.4,
8
],
[
1.5,
12
],
[
1.6,
4
],
[
1.7,
2
],
[
1.8,
12
],
[
1.9,
5
],
[
2,
6
],
[
2.1,
6
],
[
2.2,
3
],
[
2.3,
8
],
[
2.4,
3
],
[
2.5,
3
]
],
"maximum": 2.5,
"mean": 1.19933,
"median": 1.32848,
"minimum": 0.1,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.76224,
"sum": 179.9,
"sum_squares": 302.33,
"variance": 0.58101
}
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true,
"summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-setosa",
50
],
[
"Iris-virginica",
50
]
],
"missing_count": 0
}
}
},
"importance": [
[
"000002",
0.53159
],
[
"000003",
0.4633
],
[
"000000",
0.00511
],
[
"000001",
0
]
],
"kind": "stree",
"missing_strategy": "last_prediction",
"model_fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true
}
},
"root": {
"children": [
{
"confidence": 0.92865,
"count": 50,
"objective_summary": {
"categories": [
[
"Iris-setosa",
50
]
]
},
"output": "Iris-setosa",
"predicate": {
"field": "000002",
"operator": "<=",
"value": 2.45
}
},
{
"children": [
{
"children": [
{
"children": [
{
"children": [
{
"children": [
{
"confidence": 0.34237,
"count": 2,
"objective_summary": {
"categories": [
[
"Iris-virginica",
2
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": ">",
"value": 5.95
}
},
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000000",
"operator": "<=",
"value": 5.95
}
}
],
"confidence": 0.20765,
"count": 3,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
],
[
"Iris-virginica",
2
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": "<=",
"value": 6.4
}
},
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000000",
"operator": ">",
"value": 6.4
}
}
],
"confidence": 0.15004,
"count": 4,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
2
],
[
"Iris-virginica",
2
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000001",
"operator": ">",
"value": 2.9
}
},
{
"confidence": 0.60966,
"count": 6,
"objective_summary": {
"categories": [
[
"Iris-virginica",
6
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000001",
"operator": "<=",
"value": 2.9
}
}
],
"confidence": 0.49016,
"count": 10,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
2
],
[
"Iris-virginica",
8
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": "<=",
"value": 5.05
}
},
{
"confidence": 0.90819,
"count": 38,
"objective_summary": {
"categories": [
[
"Iris-virginica",
38
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": ">",
"value": 5.05
}
}
],
"confidence": 0.86024,
"count": 48,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
2
],
[
"Iris-virginica",
46
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000003",
"operator": ">",
"value": 1.65
}
},
{
"children": [
{
"confidence": 0.92444,
"count": 47,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
47
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000002",
"operator": "<=",
"value": 4.95
}
},
{
"children": [
{
"confidence": 0.43849,
"count": 3,
"objective_summary": {
"categories": [
[
"Iris-virginica",
3
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": ">",
"value": 6.05
}
},
{
"children": [
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-virginica",
1
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000003",
"operator": "<=",
"value": 1.55
}
},
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000003",
"operator": ">",
"value": 1.55
}
}
],
"confidence": 0.09453,
"count": 2,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
],
[
"Iris-virginica",
1
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": "<=",
"value": 6.05
}
}
],
"confidence": 0.37553,
"count": 5,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
],
[
"Iris-virginica",
4
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": ">",
"value": 4.95
}
}
],
"confidence": 0.81826,
"count": 52,
"objective_summary": {
"categories": [
[
"Iris-virginica",
4
],
[
"Iris-versicolor",
48
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000003",
"operator": "<=",
"value": 1.65
}
}
],
"confidence": 0.40383,
"count": 100,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-virginica",
50
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": ">",
"value": 2.45
}
}
],
"confidence": 0.26289,
"count": 150,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-setosa",
50
],
[
"Iris-virginica",
50
]
]
},
"output": "Iris-virginica",
"predicate": true
},
"split_criterion": "information_gain_mix",
"support_threshold": 0
},
"name": "iris' dataset model",
"number_of_evaluations": 0,
"number_of_predictions": 0,
"number_of_public_predictions": 0,
"objective_field": "000004",
"objective_fields": [
"000004"
],
"ordering": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"randomize": false,
"range": [
1,
150
],
"replacement": false,
"resource": "model/50a454503c1920186d000049",
"rows": 150,
"sample_rate": 1.0,
"selective_pruning": true,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"stat_pruning": true,
"status": {
"code": 5,
"elapsed": 413,
"message": "The model has been created",
"progress": 1.0
},
"tags": [
"species"
],
"updated": "2012-11-15T02:32:50.149000",
"views": 0,
"white_box": false
}
< Example model JSON response
Filtering a Model
It is possible to filter the tree returned by a GET to the model location by means of two optional query string parameters, namely support and value.
Filter by Support
Support is a number from 0 to 1 that specifies the minimum fraction of the total number of instances that a given branch must cover to be retained in the resulting tree. Thus, asking for (minimum) support of 0, is just asking for the whole tree, while something like:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;support=1.0"
Filter Example
will return just the root node, that being the only one that covers all instances. If you repeat the support parameter in the query string, the last one is used. Non-parseable support values are ignored.
Filter by Values and Value Intervals
Value is a concrete value or interval of values (for regression trees) that a leaf must predict to be kept in the returning tree. For instance:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;value=Iris-setosa"
Filter Example
will return only those branches in the tree whose leaves predict "Iris-setosa" as the value of the (categorical) objective field, while something like:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;value=[10,20]"
Filter Example
for a regression model will include only those leaves predicting an objective value between 10 and 20. You can also specify sharp values for regression models:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;value=23.2"
Filter Example
will retrieve only those branches whose predictions are exactly 23.2. It is possible to specify multiple values, as in:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;value=Iris-setosa&value=Iris-versicolor"
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;value=(10,20]&value=[-1.234,3.3)"
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;value=(10.2,20)&value=28.1&value=0.1"
Filter Example
in which case the union of the different predicates is used (i.e., the first query will return a tree will all leaves predicting "Iris-setosa" and all leaves predicting "Iris-versicolor".
Intervals can be closed or open in either end. For example, "(-2,10]", "[1,2)" or "(-1.234,0)", and the values of the left or right limits can be omitted, in which case they're taken as negative and positive infinity, respectively; thus "(,3]" denotes all values less or equal to three, as does "[,3]" (infinity not being a valid value for a numeric prediction), while "(0,)" accepts any positive value.
Filter by Confidence / Probability / Expected Error
Confidence is a concrete value or interval of values that a leaf must have to be kept in the returning tree. The specification of intervals follows the same conventions as those of value. Since confidences are a continuous value, the most common case will be asking for a range, but the service will accept also individual values. It's also possible to specify both a value and a confidence. For instance:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;value=Iris-setosa&confidence=[0.3,]"
Filter Example
asks for a tree with only those leaves that predict "Iris-setosa" with a confidence greater or equal to 0.3, while
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;confidence=[,0.25)"
Filter Example
returns a model where all leaves with confidence strictly less than 0.25. Confidence filters will work both for classification regression problems, since we call the regression expected error confidence in our JSON. If desired (and only for regression), one can specify a filter using expected_error instead:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;expected_error=[,0.25)"
Filter Example
If you specify both confidence and expected_error, only one of them will be used: confidence for classifications, expected_error for regressions. If only confidence is specified, it will always be used (confidence is an alias for the the expected error in regressions). If only expected_error is specified, it will only be used if the model is a regression.
Filters by probability works exactly as filters by confidence, but replacing probability for confidence. As a consequence, they'll only have an effect on classification problems.
Finally, note that it is also possible to specify support, value, confidence, probability, and expected_error parameters in the same query.
PMML
The default model output format is JSON. However, the pmml parameter allows to include a PMML version of the model. The model will include a XML document that fullfils PMML v4.1. For example:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;pmml=yes"
PMML Example
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH;pmml=only"
PMML Example
Filtering and Paginating Fields from a Model
A model might be composed of hundreds or even thousands of fields. Thus when retrieving a model, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Model
To update a model, you need to PUT an object containing the fields that you want to update to the model' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated model.
For example, to update a model with a new name you can use curl like this:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a model's name
If you want to update a model with a new label and description for a specific field you can use curl like this:
curl "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000000": {"label": "a longer name", "description": "an even longer description"}}}' \
-H 'content-type: application/json'
$ Updating a model's field, label, and description
Deleting a Model
To delete a model, you need to issue a HTTP DELETE request to the model/id to be deleted.
Using curl you can do something like this to delete a model:
curl -X DELETE "https://bigml.io/andromeda/model/50a454503c1920186d000049?$BIGML_AUTH"
$ Deleting a model from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a model, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a model a second time, or a model that does not exist, you will receive a "404 not found" response.
However, if you try to delete a model that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Models
To list all the models, you can use the model base URL. By default, only the 20 most recent models will be returned. You can see below how to change this number using the limit parameter.
You can get your list of models directly in your browser using your own username and API key with the following links.
https://bigml.io/andromeda/model?$BIGML_AUTH
> Listing models from a browser
Weights
BigML.io has added three new ways in which you can use weights to deal with imbalanced datasets:
- Weight Field: considering the values one of the fields in the dataset as weight for the instances. This is valid for both regression and classification models.
- Objective Weights: submitting a specific weight for each class in classification models.
- Automatic Balancing: setting the balance argument to true to let BigML automatically balance all the classes evenly.
Let's see each method in more detail.
Weight Field
A weight_field may be declared for either regression or classification models. Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the weight field's value. See the toy dataset for credit card transactions below.
online, transaction, pending transactions, days since last transaction, distance, transactions today, balance, mtd, fraud, weight
yes, 10, 3, 31, low, 3, -3250, -1500, no, 1
no, 20, 30, 1, high, 0, 0, -300, no, 1
no, 40, 13, 210, low, 1, -19890, -30, no, 1
yes, 500, 0, 1, high, 0, 0, 0, yes, 10
no, 10, 1, 32, low, 0, -2500, -7891, no, 1
yes, 100, 0, 3, low, 0, -5194, -120, no, 1
yes, 100, 1, 4, low, 0, 0, 1500, no, 1
yes, 1000, 0, 1, high, 0, 0, 0, yes, 10
no, 150, 3, 1, low, 5, -3250, 1500, no, 1
no, 75, 5, 1, high, 1, -3250, 1500, no, 1
yes, 10, 23, 0, low, 1, -3250, 1500, no, 1
yes, 10, 3, 31, low, 3, -3250, -1500, no, 1
Example CSV file
curl "https://bigml.io/andromeda/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"weight_field": "000009"
}'
> Using a weight field to create a new model
With Flatline, you can define arbitrarily complex functions to produce weight fields, making this the most flexible and powerful way to produce weighted models.
For instance, the request below would create a new dataset using the example above that will add a new weight field using the previous and multiplying by two when the amount of the transaction is higher than 500.
curl "https://bigml.io/andromeda/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"new_fields": [{
"field": "(if (and (= (f fraud) \"yes\") (> (f transaction) 500)) (* (f weight) 2) (f weight))",
"name": "new weight"}]
}'
> Creating a new weight field
Objective Weights
The second method for adding weights only applies to classification models. A set of objective_weights may be defined, one per objective class. Each instance will be weighted according to its class weight.
curl "https://bigml.io/andromeda/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"excluded_fields": ["000009"],
"objective_weights": [["yes", 10], ["no", 1]]
}'
> Using a weight field to create a new model
If a class is not listed in the objective_weights, it is assumed to have a weight of 1. This means the example below is equivalent to the example above.
curl "https://bigml.io/andromeda/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"excluded_fields": ["000009"],
"objective_weights": [["yes", 10]]
}'
> Using a weight field to create a new model
Weights of zero are valid as long as there are some positive valued weights. If every weight does end up zero (this is possible with sampled datasets), then the resulting model will have a single node with a nil output.
Automatic Balancing
Finally, we provide a convenience shortcut for specifying weights for a classification objective which are proportional to their category counts, by means of the balance_objective flag.
For instance, if the category counts of the objective field are, say:
[["Iris-versicolor", 20], ["Iris-virginica", 10], ["Iris-setosa", 5]]
Category counts
the request:
curl "https://bigml.io/andromeda/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bb141a3c19200bdf00000c",
"balance_objective": true
}'
> Using balance_objective to create a new model
would be equivalent to:
curl "https://bigml.io/andromeda/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bb141a3c19200bdf00000c",
"objective_weights": [
["Iris-versicolor", 1],
["Iris-virginica", 2],
["Iris-setosa", 4]]}'
> Using objective_weights to create a new model
The next table summarizes all the available arguments to use weights.
The nodes for a weighted tree will include a weight and weighted_objective_distribution, which are the weighted analogs of count and objective_distribution. Confidence, importance, and pruning calculations also take weights into account.
{
"id":0,
"children":[
{
"id":1,
"children":[
{
"output":"Iris-virginica",
"count":10,
"objective_summary":{
"categories":[
[
"Iris-virginica",
10
]
]
},
"predicate":{
"value":1.7,
"operator":">",
"field":"000003"
},
"weighted_objective_summary":{
"categories":[
[
"Iris-virginica",
10
]
]
},
"weight":10,
"confidence":0.72246,
"id":2
},
{
"output":"Iris-versicolor",
"count":20,
"objective_summary":{
"categories":[
[
"Iris-versicolor",
20
]
]
},
"predicate":{
"value":1.7,
"operator":"<=",
"field":"000003"
},
"weighted_objective_summary":{
"categories":[
[
"Iris-versicolor",
20
]
]
},
"weight":20,
"confidence":0.83887,
"id":3
}
],
"weighted_objective_summary":{
"categories":[
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
]
]
},
"weight":30,
"predicate":{
"value":0.6,
"operator":">",
"field":"000003"
},
"confidence":0.4878,
"count":30,
"output":"Iris-versicolor",
"objective_summary":{
"categories":[
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
]
]
}
},
{
"output":"Iris-setosa",
"count":5,
"objective_summary":{
"categories":[
[
"Iris-setosa",
5
]
]
},
"predicate":{
"value":0.6,
"operator":"<=",
"field":"000003"
},
"weighted_objective_summary":{
"categories":[
[
"Iris-setosa",
100
]
]
},
"weight":100,
"confidence":0.56551,
"id":4
}
],
"weighted_objective_summary":{
"categories":[
[
"Iris-setosa",
100
],
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
]
]
},
"weight":130,
"predicate":true,
"confidence":0.60745,
"count":35,
"output":"Iris-setosa",
"objective_summary":{
"categories":[
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
],
[
"Iris-setosa",
5
]
]
}
}
< Example weighted model JSON response
Ensembles
Last Updated: Thursday, 2020-10-08 20:05
Depending on the nature of your data and the specific parameters of the ensemble, you can significantly boost the predictive performance for single models, using exactly the same data.
You can create an ensemble just as you would create a model with the following three basic machine learning techniques: bagging, random decision forests, and gradient tree boosting.
Bagging, also known as bootstrap aggregating, is one of the simplest ensemble-based strategies but often outperforms strategies that are more complex. The basic idea is to use a different random subset of the original dataset for each model in the ensemble. Specifically BigML uses by default a sampling rate of 100% with replacement for each model. You can read more about bagging here.
Random decision forests is the second ensemble-based strategy that BigML provides. It consists, essentially, in selecting a new random set of the input fields at each split while an individual model is being built instead of considering all the input fields. To create a random decision forest you just need to set the randomize argument to true. You can read more about random decision forests here.
Gradient tree boosting is the third strategy whose predictions are additive. Each tree modifies the predictions of the previously grown tree. You must specify the boosting argument in order to apply this technique.

BigML.io allows you to create, retrieve, update, delete your ensemble. You can also l