| spark.lda {SparkR} | R Documentation | 
spark.lda fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
summary to get a summary of the fitted LDA model, spark.posterior to compute
posterior probabilities on new data, spark.perplexity to compute log perplexity on new
data and write.ml/read.ml to save/load fitted models.
spark.lda(data, ...)
spark.posterior(object, newData)
spark.perplexity(object, data)
## S4 method for signature 'LDAModel,SparkDataFrame'
spark.posterior(object, newData)
## S4 method for signature 'LDAModel'
summary(object, maxTermsPerTopic)
## S4 method for signature 'LDAModel,SparkDataFrame'
spark.perplexity(object, data)
## S4 method for signature 'LDAModel,character'
write.ml(object, path, overwrite = FALSE)
## S4 method for signature 'SparkDataFrame'
spark.lda(data, features = "features", k = 10,
  maxIter = 20, optimizer = c("online", "em"), subsamplingRate = 0.05,
  topicConcentration = -1, docConcentration = -1,
  customizedStopWords = "", maxVocabSize = bitwShiftL(1, 18))
data | 
 A SparkDataFrame for training.  | 
... | 
 additional argument(s) passed to the method.  | 
object | 
 A Latent Dirichlet Allocation model fitted by   | 
newData | 
 A SparkDataFrame for testing.  | 
maxTermsPerTopic | 
 Maximum number of terms to collect for each topic. Default value of 10.  | 
path | 
 The directory where the model is saved.  | 
overwrite | 
 Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.  | 
features | 
 Features column name. Either libSVM-format column or character-format column is valid.  | 
k | 
 Number of topics.  | 
maxIter | 
 Maximum iterations.  | 
optimizer | 
 Optimizer to train an LDA model, "online" or "em", default is "online".  | 
subsamplingRate | 
 (For online optimizer) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].  | 
topicConcentration | 
 concentration parameter (commonly named   | 
docConcentration | 
 concentration parameter (commonly named   | 
customizedStopWords | 
 stopwords that need to be removed from the given corpus. Ignore the parameter if libSVM-format column is used as the features column.  | 
maxVocabSize | 
 maximum vocabulary size, default 1 << 18  | 
spark.posterior returns a SparkDataFrame containing posterior probabilities
vectors named "topicDistribution".
summary returns summary information of the fitted model, which is a list.
The list includes
 | 
 concentration parameter commonly named   | 
 | 
 concentration parameter commonly named   | 
 | 
 log likelihood of the entire corpus  | 
 | 
 log perplexity  | 
 | 
 TRUE for distributed model while FALSE for local model  | 
 | 
 number of terms in the corpus  | 
 | 
 top 10 terms and their weights of all topics  | 
 | 
 whole terms of the training corpus, NULL if libsvm format file used as training set  | 
spark.perplexity returns the log perplexity of given SparkDataFrame, or the log
perplexity of the training data if missing argument "data".
spark.lda returns a fitted Latent Dirichlet Allocation model.
spark.posterior(LDAModel) since 2.1.0
summary(LDAModel) since 2.1.0
spark.perplexity(LDAModel) since 2.1.0
write.ml(LDAModel, character) since 2.1.0
spark.lda since 2.1.0
topicmodels: https://cran.r-project.org/package=topicmodels
## Not run: 
##D # nolint start
##D # An example "path/to/file" can be
##D # paste0(Sys.getenv("SPARK_HOME"), "/data/mllib/sample_lda_libsvm_data.txt")
##D # nolint end
##D text <- read.df("path/to/file", source = "libsvm")
##D model <- spark.lda(data = text, optimizer = "em")
##D 
##D # get a summary of the model
##D summary(model)
##D 
##D # compute posterior probabilities
##D posterior <- spark.posterior(model, text)
##D showDF(posterior)
##D 
##D # compute perplexity
##D perplexity <- spark.perplexity(model, text)
##D 
##D # save and load the model
##D path <- "path/to/model"
##D write.ml(model, path)
##D savedModel <- read.ml(path)
##D summary(savedModel)
## End(Not run)