Class for indexing categorical feature columns in a dataset of Vector.
This has 2 usage modes:
- Automatically identify categorical features (default behavior)
- This helps process a dataset of unknown vectors into a dataset with some continuous
features and some categorical features. The choice between continuous and categorical
is based upon a maxCategories parameter.
- Set maxCategories to the maximum number of categorical any categorical feature should have.
- E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}.
If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1},
and feature 1 will be declared continuous.
- Index all features, if all features are categorical
- If maxCategories is set to be very large, then this will build an index of unique
values for all features.
- Warning: This can cause problems if features are continuous since this will collect ALL
unique values to the driver.
- E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}.
If maxCategories >= 3, then both features will be declared categorical.
This returns a model which can transform categorical features to use 0-based indices.
Index stability:
- This is not guaranteed to choose the same category index across multiple runs.
- If a categorical feature includes value 0, then this is guaranteed to map value 0 to index 0.
This maintains vector sparsity.
- More stability may be added in the future.
TODO: Future extensions: The following functionality is planned for the future:
- Preserve metadata in transform; if a feature's metadata is already present, do not recompute.
- Specify certain features to not index, either via a parameter or via existing metadata.
- Add warning if a categorical feature has only 1 category.
- Add option for allowing unknown categories.