Boruta 1-click feature selection

Boruta 1-click feature selection | BigML.com

FREE

Boruta 1-click Feature Selection

Details:

Created	Thu, 4 Aug 2016 15:21:53 +0000
Published	Thu, 4 Aug 2016 22:12:21 +0000

Description:

This script implements feature selection using a version of the Boruta algorithm to detect important and unimportant fields in your dataset. The algorithm:

Retrieves the dataset information.
Creates a new extended dataset. In the new dataset, each field has a corresponding shadow field which has the same type but contains a random sample of the values contained in the original one.
Creates a random forest from the extended dataset.
Extracts the maximum of the importances for the shadow fields.
Uses this maximum plus (minus) a minimum gain as threshold. Any of the original fields scoring less than the minimal threshold are considered unimportant and fields scoring more than the maximum threshold are considered important.
Fields marked as unimportant are removed from the list of fields to be used as input fields for new datasets.
The procedure is repeated, and a new extended dataset is created with the remaining fields. The process stops when it reaches the user-given number of runs or when all the original fields in the dataset are marked as important or unimportant.
When iteration stops, a new dataset is created where unimportant fields have been removed.

The output of the script is a dataset ID.

Tags:

URL:

https://bigml.com/user/whizzml/gallery/script/57a35d91c528504d920002ca

Script FREE

Source code

;; This will eventually be packed into a library

(define IMPORTANCE_INC 1)
(define IMPORTANCE_DEC -1)

;; choosable-objective-ids
;;
;; List of IDs of the fields in the dataset that can be chosen as objective
;; field.
;;
;; Inputs:
;;  fields: (map) Fields structure
;; Output: (list) list of field IDs
(define (choosable-objective-ids fields)
  (let (field-val (lambda (fid k) (fields [fid k]))
        objective-types ["categorical" "numeric"]
        pref? (lambda (k) (field-val k "preferred"))
        pred? (lambda (k) (member? (field-val k "optype") objective-types)))
    (filter (lambda (x) (and (pref? x) (pred? x))) (keys fields))))


;; validate-objective-name
;;
;; Validates that the argument is a valid objective name in the reference
;; dataset.
;;
;; Inputs:
;;  objective-name: (string) Name of the objective field
;;  dataset: (map) dataset resource information
;;
;; Output: (string) objective-name
(define (validate-objective-name objective-name dataset)
  (let (fields (dataset "fields" {})
        objective-ids (choosable-objective-ids fields)
        objective-names (map (lambda (x) (fields [x "name"] false))
                        objective-ids))
    (if (not (member? objective-name objective-names))
      (raise {"message" (str "Failed to find the objective name in the dataset"
                             " choosable fields.")
              "code" 101})
      objective-name)))


;; boruta
;;
;; Marks fields as important, unimportant or neutral according to their
;; importance compared to shadow fields (copy fields using random values)
;;
;; Inputs:
;;   dataset-id: (dataset-id) ID of the origin dataset
;;   objective: (string) Name of the objective field
;;   max-runs: (integer) Maximum number of iterations
;;   min-gain: (number) Increment to define the interval of minimum gain
;;                      around the maximum importance of shadow fields
;;
;; Output: (map) Map of important fields and unimportant fields

(define (boruta dataset-id objective max-runs min-gain)
  (let (ds (fetch dataset-id)
        ;; original fields are stored for later use
        ori-fields (ds "fields" {})
        field-ids (keys ori-fields)
        f-names (map (lambda (x) (ori-fields [x "name"] false))
                     field-ids)
        init-rank (iterate (r {} id field-ids)
                    (assoc r (ori-fields [id "name"] false) 0))
        objective (if (= objective "")
                      (ori-fields
                       [(dataset-get-objective-id dataset-id) "name"]
                       false)
                      (validate-objective-name objective ds)))
    (loop (f-rank init-rank fields ori-fields ds-id dataset-id runs 0)
      ;; Repeats the procedure until the importance is assigned for all
      ;; the attributes, or the algorithm has reached the previously set
      ;; limit of the random forest runs.
      (if (or (= runs max-runs) (full f-rank))
          ;; as runs go by, the fields structure is modified and unimportant
          ;; fields are removed. Thus we need f-names to contain the original
          ;; list of names to know which are unimportant
          (group-importance f-rank fields f-names objective)
          ;; Loop body: Extends the information system by adding copies of all
          ;; variables (the information system is always extended by at least
          ;; 5 shadow attributes, even if the number of attributes in the
          ;; original set is lower than 5)
          ;; Shuffle the added attributes to remove their correlations with
          ;; the response.
          (let (fields (filter-map (lambda (_ y) (y "preferred" true)) fields)
                sh-ds-info (shadow-ds ds-id fields objective)
                sh-ds-id (sh-ds-info 0)
                sh-ids (sh-ds-info 1)
                fields (sh-ds-info 2)
                to-name (lambda (x) (fields [x "name"] false))
                ;; Run a random forest classifier on the extended
                ;; information system and gather the importances
                rf-id (random-forest sh-ds-id objective runs)
                ;; Find the maximum importance among shadow attributes,
                ;; and then assign a hit to every attribute that scored better
                ;; by more than the min-gain margin or worst by less than the
                ;; min-gain margin.
                imps (get-importances (fetch rf-id))
                max-sh (maximum-shadow-importance imps sh-ids)
                field-ids (filter (lambda (x) (!= objective (to-name x)))
                                  (keys fields))
                f-rank (rank-field-ids imps field-ids max-sh min-gain)
                fields (remove-unimportant fields f-rank))
            ;; logging each run results
            (log-info "Number of runs: " (+ runs 1))
            (log-info "------------------------------------------------")
            (log-info "Max importance for the shadow fields: " max-sh)
            (log-info "Important: " (select-by-importance f-rank
                                                          IMPORTANCE_INC
                                                          to-name))
            (log-info "Unimportant: " (select-by-importance f-rank
                                                            IMPORTANCE_DEC
                                                            to-name))
            (log-info "Undecided: " (select-by-importance f-rank 0 to-name))
            (log-info "remaining fields: " (map to-name (keys fields)))
            (log-info "------------------------------------------------")
            (recur f-rank fields sh-ds-id (+ runs 1)))))))

;; full
;; Checks whether all the fields are marked as important or unimportant
;;
;; Inputs:
;;   f-rank: (map) Map from field ID to imporance flag
;;
;; Output: (boolean) True if all the fields are classified
;;
(define (full f-rank)
  (let (undetermined (filter zero? (values f-rank)))
    (empty? undetermined)))

;; remove-unimportant
;; Removes the fields marked as unimportant
;;
;; Inputs:
;;   fields: (map) Fields information structure
;;   f-rank: (map) Map from field ID to importance flag
;;
;; Output: (map) Fields information structure without unimportant fields
;;
(define (remove-unimportant fields f-rank)
  (iterate (acc fields id (keys f-rank))
    (if (negative? (f-rank id 0))
        (dissoc acc id)
        acc)))

;; select-by-importance
;;
;; Filters the fields according to their importance flag
;;
;; Inputs:
;;   f-map: (map) Map from field ID to importance flag
;;   score: (integer) Importance flag value
;;   to-name-fn: (function) Function to transform IDs to field names
;;
;; Output: (list) List of field names
(define (select-by-importance f-map score to-name-fn)
  (let (selected (filter-map (lambda (x y) (= y score)) f-map))
    (map to-name-fn (keys selected))))

;; group-importance
;;
;; Groups the fields according to their importance
;;
;; Inputs:
;;   f-rank: (map) Map from field ID to importance flag
;;   fields: (map) Fields information structure
;;   f-names: (list) Fields names
;;   objective: (string) objective field name
;;
;; Output:  (map) Map from categories (important, unimportant or undecided) to
;;                list of field names
;;
(define (group-importance f-rank fields f-names objective)
  (let (to-name (lambda (x) (fields [x "name"] false))
        important (iterate (r [] k (keys f-rank))
                    (if (positive? (f-rank k)) (append r (to-name k)) r))
        important (append important objective)
        undecided (iterate (r [] k (keys f-rank))
                    (if (zero? (f-rank k)) (append r (to-name k)) r))
        unimportant (filter (lambda (x)
                              (not (or (member? x important)
                                       (member? x undecided))))
                            f-names))
    {"important" important
     "unimportant" unimportant
     "undecided" undecided}))

;; filter-map
;; Filters a map by applying the corresponding filter function to its keys
;;
;; Inputs:
;;   fn: (function) function to be applied to the map keys and/or values
;;   a-map: (map) Map to be filtered
;;
;; Output: (map) Filtered map
(define (filter-map fn a-map)
    (iterate (out-map {} key (keys a-map))
      (if (fn key (a-map key false))
          (assoc out-map key (a-map key false))
          out-map)))

;; shadow-ds
;; Extends a dataset adding a shadow field for each field used in input_fields.
;; If the number of shadow fields is less than 5, fields are repeated.
;;
;; Inputs
;;  ds-id: (dataset-id) Dataset ID
;;  f-ids: (list) List of field IDs that will be used as input and extended
;;                with shadow fields
;;
;; Output: (list) List of elements [dataset-id shadow-field-ids fields]
;;   dataset-id: (dataset-id) ID of the dataset with shadow fields
;;   shadow-fields-ids: (list) List of IDs for the newly added shadow fields
;;   fields: (map) New fields structure for the fields in the original dataset.
;;                 This structure is needed to know the IDs in the
;;                 shadow dataset for the original fields
;;
(define (shadow-ds ds-id fields obj-name)
  (let (f-id-names (reduce (lambda (x y)
                             (assoc x y (fields [y "name"] false)))
                           {}
                           (keys fields))
        obj-f (filter-map (lambda (x _)
                            (= (fields [x "name"] false) obj-name))
                          f-id-names)
        obj-id (head (keys obj-f))
        f-id-names (filter-map (lambda (x _) (!= x obj-id)) f-id-names)
        f-ids (keys f-id-names)
        input-fields (cons obj-id f-ids)
        f-ids (repeat-fields f-ids 5)
        f-names (map (lambda (x) (str "shadow " (f-id-names x false)))
                     f-ids)
        new-fields (for (id f-ids)
                     {"field" (flatline "(weighted-random-value {{id}})")
                      "name" (str "shadow " (f-id-names id))})
        sh-ds-id (create-and-wait-dataset {"origin_dataset" ds-id
                                           "input_fields" input-fields
                                           "new_fields" new-fields})
        sh-ds (fetch sh-ds-id)
        sh-fields (sh-ds "fields" false)
        sh-f-ids (filter (lambda (x)
                           (member? (sh-fields [x "name"] false) f-names))
                         (keys sh-fields))
        old-fields (filter-map (lambda (x _)
                                 (let (n (sh-fields [x "name"] false))
                                   (not (member? n f-names))))
                               sh-fields))
    [sh-ds-id sh-f-ids old-fields]))

;; repeat-fields
;; Cyclically repeats the list of field IDs until it has the desired
;; number of elements
;;
;; Inputs:
;;   f-ids: (list) List of field IDs
;;   min-count: (integer) Minimum number of field IDs
;;
;; Output: (list) List of field IDs
(define (repeat-fields f-ids min-count)
  (if (< (count f-ids) min-count)
      (repeat-fields (append f-ids (nth f-ids (rem min-count (count f-ids))))
                     min-count)
      f-ids))

;; random-forest
;; Create a random forest from the dataset
;;
;; Inputs:
;;   ds-id: (dataset-id) Dataset ID
;;   obj: (string) Objective field ID
;;   run: (integer) Number of run (to change the seed)
;; Output: (ensemble-id) Random Forest ID
(define (random-forest ds-id obj run)
  (create-and-wait-ensemble {"dataset" ds-id
                             "objective_field" obj
                             "randomize" true
                             "random_candidate_ratio" 0.33
                             "sample_rate" 0.8
                             "seed" (str "BigML Boruta" run)}))

;; assoc+
;;
;; Adds the value in the key-value pair to the one existing in the accumulator
;;
;; Input:
;;   acc: (map) accumulator map
;;   key: (string) key to accumulate
;;   value: (number) value to be added
(define (assoc+ acc key value)
  (assoc acc key (+ value (acc key 0))))

;; get-importances
;;
;; Selects the field importances in an ensemble
;;
;; Inputs:
;;   ensemble: (ensemble) Ensemble information
;;
;; Outputs: list of importances by field name

(define (get-importances ensemble)
  (let (dist (ensemble "distributions" false)
        imps (map (lambda (x) (x "importance" false)) dist)
        m-num (count imps))
    (iterate (acc {} m-imps imps)
      (iterate (m-acc acc imp m-imps)
        (assoc+ m-acc
                (imp 0)
                (/ (imp 1) m-num))))))

;; maximum-shadow-importance
;;
;; Selects the maximum importance for the shadow fields
;;
;; Inputs:
;;   imps: (list) List of field importances
;;   sh-ids: (list) List of field IDs of the shadow fields
;;
;; Output: (number) maximum importance for the fields not in the list
(define (maximum-shadow-importance imps sh-ids)
  (iterate (acc 0 imp-k (keys imps))
    (if (member? imp-k sh-ids)
      (max acc (imps imp-k 0))
      acc)))

;; rank-field-ids
;;
;; ranks each field id as important or unimportant. The min-gain
;; argument is used to set a range around the maximum shadow importance
;; where the field cannot be ranked as important or unimportant.
;;
;; Inputs:
;;   imps: (list) List of field ID - importance pairs
;;   field-ids: (list) List of original IDs to be checked against the random
;;                     forest importance list.
;;   max-sh: (number) Maximum importance of shadow fields
;;   min-gain: (number) Interval of minimum gain over the shadow fields
;;                      importance
;;
;; Output: (map) Field ID - rank map

(define (rank-field-ids imps field-ids max-sh min-gain)
  (iterate (acc {} field-id field-ids)
    (if (> (imps field-id 0) (+ max-sh min-gain))
        (assoc+ acc field-id IMPORTANCE_INC)
        (if (< (imps field-id 0) (- max-sh min-gain))
            (assoc+ acc field-id IMPORTANCE_DEC)
            (assoc+ acc field-id 0)))))


;; Script

;; boruta-feature-select
;;
;; Creates a new dataset from the one given as argument by finding
;; out the unimportant fields using the boruta algorithm and removing
;; them
;;
;; Inputs:
;;   dataset-id: (dataset-id) ID of the origin dataset
;;   objective: (string) Name of the objective field
;;   max-runs: (integer) Maximum number of iterations
;;   min-gain: (number) Increment to define the interval of minimum gain
;;                      around the maximum importance of shadow fields
;;
;; Output: (dataset-id) ID of the filtered dataset

(define (boruta-feature-select dataset-id)
  (let (fields-rank (boruta dataset-id "" 10 0.01)
        ds (fetch dataset-id)
        name (ds "name")
        exclude-fields (fields-rank "unimportant")
        args {"origin_dataset" dataset-id
              "name" (str name " - feature selected")}
        args (if (empty? exclude-fields)
                 args
                 (assoc args "excluded_fields" exclude-fields)))
    (create "dataset" args)))

(define feature-selected-dataset (boruta-feature-select dataset-id))

Description

Inputs

Outputs

License: Creative Commons Zero V1.0
Script FREE

Boruta 1-click feature selection | BigML.com

Sending Request...

COMPANY

PRODUCT

BUSINESS

TRAINING

GALLERY

License

Embed this Script in your web site

COMPANY

PRODUCT

BUSINESS

TRAINING

GALLERY