Best k-means batchcentroid

Best k-means batchcentroid | BigML.com

FREE

Best K-means Batchcentroid

Details:

Created	Wed, 19 Apr 2017 16:44:22 +0000
Published	Thu, 20 Apr 2017 17:33:19 +0000

Description:

This routine implements K-means clustering using the Pham-Dimov-Nguyen algorithm for choosing the best K.

'Selection of K in K-means clustering' Proc. IMechE, Part C: J. Mechanical Engineering Science, v.219

Inputs: * dataset: (string) Dataset ID for the dataset to be clustered * cluster-args: (map) cluster arguments for the cluster search operation * k-min: (number) minimum value of k * k-max: (number) maximum value of k * bestcluster-args: (map) cluster arguments for the final best cluster operation * clean: (boolean) Delete all but the optimal cluster * logf: (boolean) Enable logging

Output: (batchcentroid) Batchcentroid for best K-means clustering

This routine uses the Pham-Dimov-Nguyen algorithm to create a WhizzML batchcentroid object and WhizzML dataset annotated with the best K-means clustering of the supplied dataset.

The clusters-args and bestcluster-args parameters are maps that one can use to optionally specify all the parameters for the cluster function except the dataset, k, and name parameters. (See the 'Clusters Arguments' table in the BigML 'Clusters' documentation for details.) cluster-args is used in the search phase for the best K. bestcluster-args allows one to specify different args for the final stage of clustering with the best K. In particular, one might do clustering on samples of the dataset during the search phase to save time and other resources, then do the best clustering on the full dataset.

If bestcluster-args matches cluster-args, the result for the best K generated with cluster-args during the search phase is returned by (best-k-means ....). If bestcluster-args differs from cluster-args, the dataset is re-clustered with the best K and that is returned by (best-k-means ....).

Tags:

clustering

URL:

https://bigml.com/user/whizzml/gallery/script/58f793e65188ef04c1000000

Script FREE

Source code

;; best-k
;;
;; An implementation of k-means clustering using the Pham-Dimov-Nguyen
;; algorithm for determining k
;;
;; Selection of K in K-means clustering
;; Proc. IMechE, v.219, Part C: J. Mechanical Engineering Science
;;


;; generate-clusters
;;
;; Generates a set of clusters (BigML objects) using the k-means algorithm
;; for the specified range of k.
;;
;; Inputs:
;;   dataset: (string) Dataset ID for the dataset to be clustered
;;   cluster-args: (map) arguments for the cluster operation
;;   k-min: (number) minimum value of k
;;   k-max: (number) maximum value of k
;;
;; Output: (list) Cluster metadata of created clusters
;;
(define (generate-clusters dataset cluster-args k-min k-max)
  (let (dname ((fetch dataset) "name")
        fargs (lambda (k)
                (assoc cluster-args "dataset" dataset
                                    "k" k
                                    "name" (str dname " - cluster (k=" k ")")))
        clist (map fargs (range k-min (+ 1 k-max)))
        ids (create* "cluster" clist))
    (map fetch (wait* ids))))


;; extract-cluster-ids
;;
;; Helper function to extract list of cluster IDs from list of cluster metadata
;;
;; Inputs:
;;   clusters: (list) Cluster metadata of created clusters
;;
;; Output: (list) Cluster IDs of clusters
;;
(define (extract-cluster-ids clusters)
  (map (lambda (x) (x "resource")) clusters))


;; extract-eval-data
;;
;; Helper function to extract map of cluster data for k evaluation function
;; from full cluster metadata
;;
;; Inputs:
;;   clusters: (string) Cluster ID
;;
;; Output: (map) Cluster data for use in the evaluation function
;;
;; (define id (nth ids i))
;; (define cid (fetch id))
;;
;; (get cid "k")
;; id
;; (get-in cid ["clusters" "within_ss"])
;; (count (get cid "input_fields"))
;;
(define (extract-eval-data cluster)
  (let (id (cluster "resource")
        k (cluster "k" 0)
        n (count (cluster "input_fields" []))
        within_ss (cluster ["clusters" "within_ss"] false)
        total_ss (cluster ["clusters" "total_ss"] false))
    {"id" id "k" k "n" n "within_ss" within_ss "total_ss" total_ss}))


;; alpha-func
;;
;; Given the number of covariates, returns the weighting function alpha(k)
;; Note this function can be re-written as
;;
;;           / 1 - (3/4 * n )                             (for k = 2)
;; alpha_k = |
;;           \ (5/6)^(k-2) * alpha_2 + [1 - 5/6^(k-2)]    (for k > 2)
;;
;; Inputs:
;;   n: (number) number of covariates
;;
;; Outputs: (function) returns weighting function
;;
(define (alpha-func n)
  (let (alpha_2 (- 1 (/ 3 (* 4 n)))
        w (/ 5 6))
    (lambda (k)
      (if (<= k 2)
        alpha_2
        (+ (* (pow w (- k 2)) alpha_2) (- 1 (pow w (- k 2))))))))


;; evaluation-func
;;
;; Given the number of covariates, returns the evaluation function
;;
;; Inputs:
;;   n: (number) number of covariates
;;
;; Outputs: (function) returns the Pham-Dimov-Nguyen evaluation function
;;
(define (evaluation-func n)
  (let (fa (alpha-func n))
    (lambda (k sk skm)
      (if (or (<= k 1) (not skm) (zero? skm))
        1
        (/ sk (* (fa k) skm))))))


;; evaluate-clusters
;;
;; Computes the evalution function value f(k) for each k
;;
;; Inputs:
;;   clusters: (list) Cluster metadata maps implicitly ordered by k
;;
;; Output: (list) Sequence of maps that have the field `fk` with the value f(k) added
;;
(define (evaluate-clusters clusters)
  (let (cmdata (map extract-eval-data clusters)
        n ((cmdata 0) "n")
        fe (evaluation-func n))
    (loop (in cmdata
           out []
           ckz {})
       (if (= [] in)
         out
         (let (ck (head in)
               ckr (tail in)
               k (ck "k" 0)
               within_ss (ck "within_ss" false)
               within_ssz (if (<= k 2)
                              (ck "total_ss" false)
                              (ckz "within_ss" false))
               cko (assoc ck "fk" (fe k within_ss within_ssz)))
           (recur ckr (append out cko) ck))))))


;; clean-clusters
;;
;; Helper function to delete clusters created as intermediate resources
;;
;; Inputs:
;;   evaluations: (list) maps of evaluation results including cluster IDs
;;   cluster-id: (cluster) cluster to save (not delete)
;;   logf: (boolean) Enable logging
;;
;; Output:
;;
(define (clean-clusters evaluations cluster-id logf)
  (for (x evaluations)
    (let (id (x "id"))
      (when logf (log-info "Testing for deletion " id " " cluster-id))
      (when (!= id cluster-id)
        (delete id)
        (when logf (log-info "Deleted " id)))))
  cluster-id)


;; best-cluster
;;
;; Creates cluster with specified k
;;
;; Inputs:
;;   dataset: (string) Dataset ID for the dataset to be clustered
;;   cluster-args: (map) cluster function arguments
;;   k: (number) number of clusters
;;
;; Output: (cluster) Created cluster
;;
(define (best-cluster dataset cluster-args k)
  (let (dname ((fetch dataset) "name")
        ckargs (assoc cluster-args "dataset" dataset
                                   "k" k
                                   "name" (str dname " - cluster (k=" k ")")))
    (create-and-wait-cluster ckargs)))


;; evaluate-k-means
;;
;; Uses the Pham-Dimov-Nguyen algorithm to generate evaluations of k-mean clustering
;;
;; Inputs:
;;   dataset: (string) Dataset ID for the dataset to be clustered
;;   cluster-args: (map) cluster arguments for the cluster search operation
;;   k-min: (number) minimum value of k
;;   k-max: (number) maximum value of k
;;   clean: (boolean) Delete all but optimal cluster
;;   logf: (boolean) Enable logging
;;
;; Output: (list) Evaluations over the range of k
;;
(define (evaluate-k-means dataset cluster-args k-min k-max clean logf)
  (let (clusters (generate-clusters dataset cluster-args k-min k-max)
        evaluations (evaluate-clusters clusters))
    (if clean
      (clean-clusters evaluations "" logf))
    evaluations))


;; best-k-means
;;
;; Uses the Pham-Dimov-Nguyen algorithm to create the best K-means clustering
;;
;; Inputs:
;;   dataset: (string) Dataset ID for the dataset to be clustered
;;   cluster-args: (map) cluster arguments for the cluster search operation
;;   k-min: (number) minimum value of k
;;   k-max: (number) maximum value of k
;;   bestcluster-args: (map) cluster arguments for the best cluster operation
;;   clean: (boolean) Delete all but optimal cluster
;;   logf: (boolean) Enable logging
;;
;; Output: (list) Best K-means cluster
;;
(define (best-k-means dataset cluster-args k-min k-max bestcluster-args clean logf)
  (let (evaluations (evaluate-k-means dataset cluster-args k-min k-max false logf)
        _ (if logf (log-info "Evaluations " evaluations))
        besteval (min-key (lambda (x) (x "fk" 0)) evaluations)
        _ (if logf (log-info "Best " besteval))
        cluster-id (if (= cluster-args bestcluster-args)
                     (besteval "id")
                     (best-cluster dataset bestcluster-args (besteval "k"))))
    (if clean
      (clean-clusters evaluations cluster-id logf))
    cluster-id))


;; best-batchcentroid
;;
;; Uses the Pham-Dimov-Nguyen algorithm to create the batchcentroid annotated
;; with the best K-means clustering
;;
;; Inputs:
;;   dataset: (string) Dataset ID for the dataset to be clustered
;;   cluster-args: (map) cluster arguments for the cluster search operation
;;   k-min: (number) minimum value of k
;;   k-max: (number) maximum value of k
;;   bestcluster-args: (map) cluster arguments for the best cluster operation
;;   clean: (boolean) Delete all but optimal cluster
;;   logf: (boolean) Enable logging
;;
;; Output: (batchcentroid) Created batchcentroid
;;
(define (find-best-batchcentroid dataset cluster-args k-min k-max bestcluster-args clean logf)
  (let (cluster-id (best-k-means dataset cluster-args k-min k-max bestcluster-args clean logf)
        batchcentroid-id (create-and-wait-batchcentroid {"cluster" cluster-id
                                                         "dataset" dataset
                                                         "output_dataset" true
                                                         "all_fields" true}))
    batchcentroid-id))

;; best-kmeans-batchcentroid
;;
;; Basic script to use the best-kmeans library
;;
(define best-batchcentroid (find-best-batchcentroid dataset cluster-args k-min k-max bestcluster-args clean logf))

Description

Inputs

Outputs

License: Creative Commons Zero V1.0
Script FREE

Best k-means batchcentroid | BigML.com

Sending Request...

COMPANY

PRODUCT

BUSINESS

TRAINING

GALLERY

License

Embed this Script in your web site

COMPANY

PRODUCT

BUSINESS

TRAINING

GALLERY