Class: RandomForest

eclairjs/mllib/tree.RandomForest

The settings for featureSubsetStrategy are based on the following references: - log2: tested in Breiman (2001) - sqrt: recommended by Breiman manual for random forests - The defaults of sqrt (classification) and onethird (regression) match the R randomForest package. [[http://www.stat.berkeley.edu/~breiman/randomforest2001.pdf Breiman (2001)]] [[http://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf Breiman manual for random forests]]

Constructor

new RandomForest(strategy, numTrees, featureSubsetStrategy, seed)

A class that implements a [[http://en.wikipedia.org/wiki/Random_forest Random Forest]] learning algorithm for classification and regression. It supports both continuous and categorical features.
Parameters:
Name Type Description
strategy module:eclairjs/mllib/tree/configuration.Strategy The configuration parameters for the random forest algorithm which specify the type of algorithm (classification, regression, etc.), feature type (continuous, categorical), depth of the tree, quantile calculation strategy, etc.
numTrees Number If 1, then no bootstrapping is used. If > 1, then bootstrapping is done.
featureSubsetStrategy Number of features to consider for splits at each node. Supported: "auto", "all", "sqrt", "log2", "onethird". If "auto" is set, this parameter is set based on numTrees: if numTrees == 1, set to "all"; if numTrees > 1 (forest) set to "sqrt" for classification and to "onethird" for regression.
seed Random seed for bootstrapping and choosing feature subsets.
Source:

Methods

(static) trainClassifier(input, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed) → {RandomForestModel}

Method to train a decision tree model for binary or multiclass classification.
Parameters:
Name Type Description
input module:eclairjs/rdd.RDD Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.
numClasses number number of classes for classification.
categoricalFeaturesInfo Map Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
numTrees number Number of trees in the random forest.
featureSubsetStrategy string Number of features to consider for splits at each node. Supported: "auto", "all", "sqrt", "log2", "onethird". If "auto" is set, this parameter is set based on numTrees: if numTrees == 1, set to "all"; if numTrees > 1 (forest) set to "sqrt".
impurity string Criterion used for information gain calculation. Supported values: "gini" (recommended) or "entropy".
maxDepth number Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (suggested value: 4)
maxBins number maximum number of bins used for splitting features (suggested value: 100)
seed number Random seed for bootstrapping and choosing feature subsets.
Source:
Returns:
a random forest model that can be used for prediction
Type
RandomForestModel

(static) trainRegressor(input, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed) → {RandomForestModel}

Method to train a decision tree model for regression.
Parameters:
Name Type Description
input module:eclairjs/rdd.RDD Training dataset: RDD of LabeledPoint. Labels are real numbers.
categoricalFeaturesInfo Map Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
numTrees number Number of trees in the random forest.
featureSubsetStrategy string Number of features to consider for splits at each node. Supported: "auto", "all", "sqrt", "log2", "onethird". If "auto" is set, this parameter is set based on numTrees: if numTrees == 1, set to "all"; if numTrees > 1 (forest) set to "onethird".
impurity string Criterion used for information gain calculation. Supported values: "variance".
maxDepth number Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (suggested value: 4)
maxBins number maximum number of bins used for splitting features (suggested value: 100)
seed number Random seed for bootstrapping and choosing feature subsets.
Source:
Returns:
a random forest model that can be used for prediction
Type
RandomForestModel