This routine implements K-means clustering using the Pham-Dimov-Nguyen algorithm for
choosing the best K.
'Selection of K in K-means clustering'
Proc. IMechE, Part C: J. Mechanical Engineering Science, v.219
Inputs:
* dataset
: (string) Dataset ID for the dataset to be clustered
* cluster-args
: (map) cluster arguments for the cluster search operation
* k-min
: (number) minimum value of k
* k-max
: (number) maximum value of k
* bestcluster-args
: (map) cluster arguments for the final best cluster operation
* clean
: (boolean) Delete all but the optimal cluster
* logf
: (boolean) Enable logging
Output: (batchcentroid) Batchcentroid for best K-means clustering
This routine uses the Pham-Dimov-Nguyen algorithm to create a WhizzML batchcentroid object
and WhizzML dataset annotated with the best K-means clustering of the
supplied dataset
.
The clusters-args
and bestcluster-args
parameters are maps that
one can use to optionally specify all the parameters for the cluster
function except the dataset
, k
, and name
parameters. (See the
'Clusters Arguments' table in the BigML 'Clusters' documentation for
details.) cluster-args
is used in the search phase for the best
K. bestcluster-args
allows one to specify different args for the
final stage of clustering with the best K. In particular, one might
do clustering on samples of the dataset
during the search phase to
save time and other resources, then do the best clustering on the full
dataset
.
If bestcluster-args
matches cluster-args
, the result for the best
K generated with cluster-args
during the search phase is returned
by (best-k-means ....)
. If bestcluster-args
differs from
cluster-args
, the dataset
is re-clustered with the best K and
that is returned by (best-k-means ....)
.