This script takes as inputs a cluster identifier, an instance, i.e., a map with values for all fields used by the cluster, and a positive count n. It then:
Finds the centroid in the cluster closer to the given instance p
Selects within that centroid's dataset the n instances that are closest to p
If there are less than n rows in the centroid's dataset, missing instances are read from the next closest centroid.
This workflow uses flatline to compute the distance between p and the centroid datasets (via the row-distance-squared flatline function) and add an extra column to the dataset, and then creates a sample of the result, ordered by the computed distance.
The input instance can be specified using either field identifiers or field names.
A variation on the k-means-- algorithm proposed by Sanjay Chawla and Aristides Gionis in their paper "k-means--: A unified approach to clustering and outlier detection".
Given a dataset, a number of clusters k and a number of anomalies l, this script creates a BigML k-means cluster. The l instances that are the farthest from their centroids are removed and another BigML k-means cluster is created. This process is repeated until the Jaccard index of subsequent sets of anomalies passes some threshold, or until some maximum number of iterations.
Inputs:
dataset: the dataset of interest
k: the number of clusters desired
l: the number of anomalies to be removed at each step
threshold: the minimum desired Jaccard index between iterations
maximum: the maximum number of desired iterations
Outputs:
cluster: the cluster id of the final cluster
dataset-id: the original dataset appended with fields for cluster membership and distance to centroid
anomalies: a list of the anomalous instances
similarities: a list of the similarity coefficients from each step