Clustering

Automatically segment your data into separate groups

Cluster Analysis is a Machine Learning task, where the goal is to segment your data into separate groups. The instances in the same group, called a cluster, are more similar to each other than to those in other groups. Cluster analysis does not require using previously labeled data, thus it falls under the category of unsupervised learning. As a main component of exploratory data mining, Cluster Analysis is often an iterative process that requires some trial and error until the most useful grouping of your data instances is achieved.

Sign up now! It's free!

Applications of Clustering

Cluster Analysis is commonly used to solve business problems such as market and customer segmentation, competitive intelligence, portfolio management as well as image analysis, information retrieval, bioinformatics, and data compression. In addition, clusters can be useful to enhance your classification and regression models with new cluster-specific features.

Best-in-class algorithm

BigML clusters use optimized versions of K-means and G-means algorithms to group together the instances according to a distance measure, computed using the values of the fields as input. Each cluster group is represented by its center (or centroid). All BigML field types are valid inputs for Cluster Analyses, i.e. categorical, numeric, text and items fields, although there are a few caveats. First, numeric fields are automatically scaled to ensure that their different magnitudes do not bias the distance calculation. Second, Cluster Analysis does not tolerate missing values for numeric fields. BigML provides several strategies for dealing with them, or those instances may also be excluded entirely when computing the clusters. BigML clusters can be built using two different unsupervised learning algorithms:

  • K-means: the user needs to specify the number of clusters in advance.
  • G-means: the algorithm automatically learns the number of different clusters by iteratively taking existing cluster groups and testing whether the cluster's neighborhood appears Gaussian in its distribution.

Highly interpretable results

Cluster Analysis tasks are great to conduct exploratory analysis on your data, but they are not always easily interpretable. BigML provides an insightful planets view visualization that depicts your clusters in two dimensional space, where clusters are placed closer to one another if they are more similar and farther away if they are very dissimilar. The size of each planet is proportional to the number of data instances it contains. In addition, each cluster can be separately introspected through a data panel that contains a distance histogram that shows the distribution of distances of its member data points from the corresponding cluster centroid. The data panel also includes the summary statistics of the cluster centroid. Furthermore, the "model clusters" configuration option can help you better understand high importance features that define the rules of membership of any given cluster by automatically creating a decision tree for every cluster. Finally, you can download a Summary Report for your clusters. This report will inform you on the distribution of data across your clusters, as well as the associated features and data distances.

Real-time or customizable Centroid Predictions

Using a cluster model you can predict the closest centroid for a new instance of data in real-time. For instance, you can assign a new customer to the most relevant customer segment such that she can start receiving the right offers. You can also compute centroids in batches using an existing cluster model and a new dataset.

Fully programmable Clustering

In addition to point-and-click mode on BigML Dashboard, any cluster model can be built programmatically via BigML's REST API and bindings for all popular languages. You can choose to use BigML with Python, Node.js, Java, Swift, C# or other languages. Clusters are also supported by WhizzML, our domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and sharing them with others.

Clustering Training Video