Class: VectorIndexer

eclairjs/ml/feature.VectorIndexer

Class for indexing categorical feature columns in a dataset of Vector. This has 2 usage modes: - Automatically identify categorical features (default behavior) - This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. - Set maxCategories to the maximum number of categorical any categorical feature should have. - E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous. - Index all features, if all features are categorical - If maxCategories is set to be very large, then this will build an index of unique values for all features. - Warning: This can cause problems if features are continuous since this will collect ALL unique values to the driver. - E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories >= 3, then both features will be declared categorical. This returns a model which can transform categorical features to use 0-based indices. Index stability: - This is not guaranteed to choose the same category index across multiple runs. - If a categorical feature includes value 0, then this is guaranteed to map value 0 to index 0. This maintains vector sparsity. - More stability may be added in the future. TODO: Future extensions: The following functionality is planned for the future: - Preserve metadata in transform; if a feature's metadata is already present, do not recompute. - Specify certain features to not index, either via a parameter or via existing metadata. - Add warning if a categorical feature has only 1 category. - Add option for allowing unknown categories.

Constructor

new VectorIndexer(uidopt)

Parameters:
Name Type Attributes Description
uid string <optional>
Source:

Extends

Methods

(static) load(path) → {module:eclairjs/ml/feature.VectorIndexer}

Parameters:
Name Type Description
path string
Source:
Returns:
Type
module:eclairjs/ml/feature.VectorIndexer

copy(extra) → {module:eclairjs/ml/feature.VectorIndexer}

Parameters:
Name Type Description
extra module:eclairjs/ml/param.ParamMap
Overrides:
Source:
Returns:
Type
module:eclairjs/ml/feature.VectorIndexer

fit(dataset) → {module:eclairjs/ml/feature.VectorIndexerModel}

Parameters:
Name Type Description
dataset module:eclairjs/sql.Dataset
Source:
Returns:
Type
module:eclairjs/ml/feature.VectorIndexerModel

setInputCol(value) → {module:eclairjs/ml/feature.VectorIndexer}

Parameters:
Name Type Description
value string
Source:
Returns:
Type
module:eclairjs/ml/feature.VectorIndexer

setMaxCategories(value) → {module:eclairjs/ml/feature.VectorIndexer}

Parameters:
Name Type Description
value number
Source:
Returns:
Type
module:eclairjs/ml/feature.VectorIndexer

setOutputCol(value) → {module:eclairjs/ml/feature.VectorIndexer}

Parameters:
Name Type Description
value string
Source:
Returns:
Type
module:eclairjs/ml/feature.VectorIndexer

transformSchema(schema) → {module:eclairjs/sql/types.StructType}

Parameters:
Name Type Description
schema module:eclairjs/sql/types.StructType
Source:
Returns:
Type
module:eclairjs/sql/types.StructType

uid() → {Promise.<string>}

An immutable unique ID for the object and its derivatives.
Source:
Returns:
Type
Promise.<string>