new LDA() → (nullable) {?}
Constructs a LDA instance with default parameters.
- Source:
Returns:
- Type
- ?
Methods
getAlpha() → {number}
Alias for getDocConcentration
- Source:
Returns:
- Type
- number
getAsymmetricAlpha() → {module:eclairjs/mllib/linalg.Vector}
Alias for getAsymmetricDocConcentration
- Source:
Returns:
getAsymmetricDocConcentration() → {module:eclairjs/mllib/linalg.Vector}
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
This is the parameter to a Dirichlet distribution.
- Source:
Returns:
getBeta() → {number}
Alias for getTopicConcentration
- Source:
Returns:
- Type
- number
getCheckpointInterval() → {number}
Period (in iterations) between checkpoints.
- Source:
Returns:
- Type
- number
getDocConcentration() → {number}
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
This method assumes the Dirichlet distribution is symmetric and can be described by a single
Double parameter. It should fail if docConcentration is asymmetric.
- Source:
Returns:
- Type
- number
getK() → {integer}
Number of topics to infer. I.e., the number of soft cluster centers.
- Source:
Returns:
- Type
- integer
getMaxIterations() → {number}
Maximum number of iterations for learning.
- Source:
Returns:
- Type
- number
getSeed() → {number}
Random seed
- Source:
Returns:
- Type
- number
getTopicConcentration() → {number}
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
distributions over terms.
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper
by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
- Source:
Returns:
- Type
- number
run(documents) → {LDAModel}
Learn an LDA model using the given dataset.
Parameters:
Name | Type | Description |
---|---|---|
documents |
module:eclairjs.RDD | PairRDD | RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0. |
- Source:
Returns:
Inferred LDA model
- Type
- LDAModel
setAlphawithnumber(alpha)
Alias for [[setDocConcentration()]]
Parameters:
Name | Type | Description |
---|---|---|
alpha |
number |
- Source:
Returns:
setAlphawithVector(alpha)
Alias for [[setDocConcentration()]]
Parameters:
Name | Type | Description |
---|---|---|
alpha |
module:eclairjs/mllib/linalg.Vector |
- Source:
Returns:
setBeta(beta)
Alias for [[setTopicConcentration()]]
Parameters:
Name | Type | Description |
---|---|---|
beta |
number |
- Source:
Returns:
setCheckpointInterval(checkpointInterval)
Period (in iterations) between checkpoints (default = 10). Checkpointing helps with recovery
(when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be
important when LDA is run for many iterations. If the checkpoint directory is not set in
SparkContext, this setting is ignored.
Parameters:
Name | Type | Description |
---|---|---|
checkpointInterval |
number |
- Source:
- See:
-
- [[org.apache.spark.SparkContext#setCheckpointDir]]
Returns:
setDocConcentrationwithnumber(docConcentration)
Replicates a Double docConcentration to create a symmetric prior.
Parameters:
Name | Type | Description |
---|---|---|
docConcentration |
number |
- Source:
Returns:
setDocConcentrationwithVector(docConcentration)
Concentration parameter (commonly named "alpha") for the prior placed on documents'
distributions over topics ("theta").
This is the parameter to a Dirichlet distribution, where larger values mean more smoothing
(more regularization).
If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to
singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during
[[LDAOptimizer.initialize()]]. Otherwise, the docConcentration vector must be length k.
(default = Vector(-1) = automatic)
Optimizer-specific parameter settings:
- EM
- Currently only supports symmetric distributions, so all values in the vector should be
the same.
- Values should be > 1.0
- default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows
from Asuncion et al. (2009), who recommend a +1 adjustment for EM.
- Online
- Values should be >= 0
- default = uniformly (1.0 / k), following the implementation from
[[https://github.com/Blei-Lab/onlineldavb]].
Parameters:
Name | Type | Description |
---|---|---|
docConcentration |
module:eclairjs/mllib/linalg.Vector |
- Source:
Returns:
setK(k) → {LDA}
Number of topics to infer. I.e., the number of soft cluster centers.
(default = 10)
Parameters:
Name | Type | Description |
---|---|---|
k |
integer |
- Source:
Returns:
- Type
- LDA
setMaxIterations(maxIterations)
Maximum number of iterations for learning.
(default = 20)
Parameters:
Name | Type | Description |
---|---|---|
maxIterations |
number |
- Source:
Returns:
setOptimizer(optimizerName)
Set the LDAOptimizer used to perform the actual calculation by algorithm name.
Currently "em", "online" are supported.
Parameters:
Name | Type | Description |
---|---|---|
optimizerName |
string |
- Source:
Returns:
setSeed(seed)
Random seed
Parameters:
Name | Type | Description |
---|---|---|
seed |
number |
- Source:
Returns:
setTopicConcentration(topicConcentration)
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
distributions over terms.
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper
by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
If set to -1, then topicConcentration is set automatically.
(default = -1 = automatic)
Optimizer-specific parameter settings:
- EM
- Value should be > 1.0
- default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows
Asuncion et al. (2009), who recommend a +1 adjustment for EM.
- Online
- Value should be >= 0
- default = (1.0 / k), following the implementation from
[[https://github.com/Blei-Lab/onlineldavb]].
Parameters:
Name | Type | Description |
---|---|---|
topicConcentration |
number |
- Source: