Clustering

Applications of Clustering

Cluster Analysis is commonly used to solve business problems such as market and customer segmentation, competitive intelligence, portfolio management as well as image analysis, information retrieval, bioinformatics, and data compression. In addition, clusters can be useful to enhance your classification and regression models with new cluster-specific features.

Best-in-class algorithm

BigML clusters use optimized versions of K-means and G-means algorithms to group together the instances according to a distance measure, computed using the values of the fields as input. Each cluster group is represented by its center (or centroid). All BigML field types are valid inputs for Cluster Analyses, i.e. categorical, numeric, text and items fields, although there are a few caveats. First, numeric fields are automatically scaled to ensure that their different magnitudes do not bias the distance calculation. Second, Cluster Analysis does not tolerate missing values for numeric fields. BigML provides several strategies for dealing with them, or those instances may also be excluded entirely when computing the clusters. BigML clusters can be built using two different unsupervised learning algorithms:

K-means: the user needs to specify the number of clusters in advance.
G-means: the algorithm automatically learns the number of different clusters by iteratively taking existing cluster groups and testing whether the cluster's neighborhood appears Gaussian in its distribution.

Highly interpretable results

Cluster Analysis tasks are great to conduct exploratory analysis on your data, but they are not always easily interpretable. BigML provides an insightful planets view visualization that depicts your clusters in two dimensional space, where clusters are placed closer to one another if they are more similar and farther away if they are very dissimilar. The size of each planet is proportional to the number of data instances it contains. In addition, each cluster can be separately introspected through a data panel that contains a distance histogram that shows the distribution of distances of its member data points from the corresponding cluster centroid. The data panel also includes the summary statistics of the cluster centroid. Furthermore, the "model clusters" configuration option can help you better understand high importance features that define the rules of membership of any given cluster by automatically creating a decision tree for every cluster. Finally, you can download a Summary Report for your clusters. This report will inform you on the distribution of data across your clusters, as well as the associated features and data distances.

Highly interpretable results

Cluster Analysis tasks are great to conduct exploratory analysis on your data, but they are not always easily interpretable. BigML provides an insightful planets view visualization that depicts your clusters in two dimensional space, where clusters are placed closer to one another if they are more similar and farther away if they are very dissimilar. The size of each planet is proportional to the number of data instances it contains. In addition, each cluster can be separately introspected through a data panel that contains a distance histogram that shows the distribution of distances of its member data points from the corresponding cluster centroid. The data panel also includes the summary statistics of the cluster centroid. Furthermore, the "model clusters" configuration option can help you better understand high importance features that define the rules of membership of any given cluster by automatically creating a decision tree for every cluster. Finally, you can download a Summary Report for your clusters. This report will inform you on the distribution of data across your clusters, as well as the associated features and data distances.

Real-time or customizable Centroid Predictions

Using a cluster model you can predict the closest centroid for a new instance of data in real-time. For instance, you can assign a new customer to the most relevant customer segment such that she can start receiving the right offers. You can also compute centroids in batches using an existing cluster model and a new dataset.

Fully programmable Clustering

In addition to point-and-click mode on BigML Dashboard, any cluster model can be built programmatically via BigML's REST API and bindings for all popular languages. You can choose to use BigML with Python, Node.js, Java, Swift, C# or other languages. Clusters are also supported by WhizzML, our domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and sharing them with others.

Fully programmable Clustering

In addition to point-and-click mode on BigML Dashboard, any cluster model can be built programmatically via BigML's REST API and bindings for all popular languages. You can choose to use BigML with Python, Node.js, Java, Swift, C# or other languages. Clusters are also supported by WhizzML, our domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and sharing them with others.

Automatically segment your data into separate groups

Applications of Clustering

Best-in-class algorithm

Highly interpretable results

Highly interpretable results

Real-time or customizable Centroid Predictions

Fully programmable Clustering

Fully programmable Clustering

Clustering Training Video

Cluster Analysis Documentation

Dashboard Documentation

Grouping your Data by Similarity

COMPANY

PRODUCT

BUSINESS

TRAINING

GALLERY