| Title: | Multilevel Supervised Topic Models with Multiple Outcomes |
|---|---|
| Description: | Fits latent Dirichlet allocation (LDA), supervised topic models, and multilevel supervised topic models for text data with multiple outcome variables. Core estimation routines are implemented in C++ using the 'Rcpp' ecosystem. For topic models, see Blei et al. (2003) <https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf>. For supervised topic models, see Blei and McAuliffe (2007) <https://papers.nips.cc/paper_files/paper/2007/hash/d56b9fc4b0f1be8871f5e1c40c0067e7-Abstract.html>. |
| Authors: | Tomoya Himeno [aut, cre] |
| Maintainer: | Tomoya Himeno <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.6 |
| Built: | 2026-06-03 09:32:01 UTC |
| Source: | https://github.com/thimeno1993/mlstm |
This function performs a single collapsed Gibbs sampling pass over all non-zero document–term entries. Each (d, v, count) triple is treated as 'count' replicated word tokens sharing the same topic assignment.
eLDA_pass_b_fast(mod, count, ndsum, NZ, V, K, alpha, beta)eLDA_pass_b_fast(mod, count, ndsum, NZ, V, K, alpha, beta)
mod |
List with current sampler state: |
count |
IntegerMatrix of size NZ×3, where each row is a triple
(d, v, c) with 0-based indices: document index |
ndsum |
IntegerVector of length D; total token count per document
(i.e., |
NZ |
Integer, number of non-zero entries (rows in |
V |
Integer, vocabulary size. |
K |
Integer, number of topics. |
alpha |
Scalar Dirichlet prior parameter for document–topic
distributions |
beta |
Scalar Dirichlet prior parameter for topic–word
distributions |
The state is stored in a list 'mod' containing:
Integer vector of length NZ; topic assignment for each (d, v, count) triple.
D×K integer matrix; document–topic counts.
K×V integer matrix; topic–word counts.
Integer vector of length K; total word count per topic.
A list with updated state:
Updated topic assignment vector (length NZ).
Updated D×K document–topic counts.
Updated K×V topic–word counts.
Updated total word counts per topic.
Given a document-term matrix in triplet form (d, v, c) using 0-based indices, this function initializes the LDA state: - samples initial topic assignments z, - constructs document-topic counts nd, - constructs topic-word counts nw, - computes ndsum, nwsum, and normalized topic proportions X.
init_mod_from_count(count, K = NULL, phi = NULL, seed = NULL)init_mod_from_count(count, K = NULL, phi = NULL, seed = NULL)
count |
Integer matrix with 3 columns representing triples (d, v, c), where d and v are 0-based indices. |
K |
Integer, number of topics. Required if 'phi' is NULL. If 'phi' is provided, K is inferred from ncol(phi). |
phi |
Optional numeric matrix of size V x K specifying per-word topic probabilities used only during initialization. |
seed |
Optional integer random seed. |
If a topic-word probability matrix 'phi' is provided (V x K), initial topics are sampled according to phi[v+1, ]. Otherwise, topics are sampled uniformly from K topics.
A list with components:
Integer vector (length NZ) of sampled topics, 0-based.
DxK document-topic count matrix.
KxV topic-word count matrix.
Integer vector (length D) with row sums of nd.
Integer vector (length K) with row sums of nw.
DxK matrix of normalized topic proportions nd / ndsum.
Number of documents.
Vocabulary size.
Number of topics.
Number of non-zero entries (rows in count).
This function performs collapsed Gibbs sampling for the standard LDA model using a sparse document-term representation:
initializes the LDA state via init_mod_from_count(),
runs n_iter iterations of the C++ Gibbs kernel
eLDA_pass_b_fast(),
returns the final model state, including posterior topic-word and document-topic distributions.
run_lda_gibbs( count, K, alpha, beta, n_iter = 100L, phi = NULL, seed = NULL, verbose = TRUE, progress_every = 10L )run_lda_gibbs( count, K, alpha, beta, n_iter = 100L, phi = NULL, seed = NULL, verbose = TRUE, progress_every = 10L )
count |
Integer matrix of size NZ x 3 with rows (d, v, c) in 0-based
indexing: document index |
K |
Integer, number of topics. Required unless |
alpha |
Scalar Dirichlet prior parameter for document-topic distributions. |
beta |
Scalar Dirichlet prior parameter for topic-word distributions. |
n_iter |
Integer, number of Gibbs sweeps to run. |
phi |
Optional V x K topic-word probability matrix used only for
initializing topic assignments in |
seed |
Optional integer random seed passed to the initializer. |
verbose |
Logical; if |
progress_every |
Integer; print progress every this many iterations. |
A list mod containing:
Integer vector of length NZ; final topic assignments (0-based).
D x K document-topic count matrix.
K x V topic-word count matrix.
Integer vector of length D; document token counts.
Integer vector of length K; topic token counts.
V x K topic-word posterior mean
computed from nw.
D x K document-topic posterior mean
computed from nd.
Vector of log-likelihoods.
Number of documents.
Vocabulary size.
Number of topics.
Number of non-zero (d, v, c) entries.
This function fits a multi-output supervised LDA model with a hierarchical prior on regression coefficients:
run_mlstm_vi( count, Y, K, alpha, beta, mu, upsilon, Omega, phi = NULL, seed = NULL, max_iter = 200L, min_iter = 50L, tol_elbo = 1e-04, update_sigma = TRUE, tau = 20L, exact_second_moment = FALSE, show_progress = TRUE, chunk = 5000L, verbose = TRUE, sigma2_init = NULL )run_mlstm_vi( count, Y, K, alpha, beta, mu, upsilon, Omega, phi = NULL, seed = NULL, max_iter = 200L, min_iter = 50L, tol_elbo = 1e-04, update_sigma = TRUE, tau = 20L, exact_second_moment = FALSE, show_progress = TRUE, chunk = 5000L, verbose = TRUE, sigma2_init = NULL )
count |
Integer matrix with 3 columns (d, v, c), using 0-based indices.
Each row represents document index |
Y |
Numeric matrix of size D x J containing J response variables
for each of the D documents. NA values are allowed and are ignored
in the initial regression used to seed |
K |
Integer, number of topics. Required if |
alpha |
Dirichlet prior parameter for document-topic distributions. |
beta |
Dirichlet prior parameter for topic-word distributions. |
mu |
Numeric vector of length K; prior mean for each |
upsilon |
Scalar degrees of freedom for the inverse-Wishart prior
on the precision matrix |
Omega |
Numeric K x K positive-definite scale matrix for the inverse-Wishart prior. |
phi |
Optional numeric matrix of size V x K used only to initialize
topic assignments via |
seed |
Optional integer random seed used for initialization. |
max_iter |
Maximum number of variational sweeps. |
min_iter |
Minimum number of sweeps before checking convergence. |
tol_elbo |
Numeric tolerance for the relative ELBO change used in the convergence criterion. |
update_sigma |
Logical; if TRUE, update |
tau |
Log-space cutoff for local topic responsibilities in the C++ routine (controls pruning for stability and speed). |
exact_second_moment |
Logical; reserved flag intended to control whether
the exact second moment |
show_progress |
Logical; forwarded to |
chunk |
Integer; number of documents per parallel block in the C++ E-step. |
verbose |
Logical; if TRUE, print ELBO and its relative change at each sweep. |
sigma2_init |
Optional numeric scalar or length-J vector specifying
the initial noise variances. If |
The latent topic layer is standard LDA, and each response dimension j
follows a Gaussian regression on document-level topic proportions.
Variational inference is performed by repeated calls to the C++ routine
stm_multi_hier_vi_parallel() until convergence or a maximum
number of sweeps is reached.
Convergence is assessed based on the relative changes in the evidence lower bound (ELBO) and the supervised label log-likelihood:
After a minimum number of iterations, the algorithm is declared to have converged when both quantities are non-negative and smaller than the prescribed tolerance.
A list mod containing (at least):
D x K document-topic counts.
K x V topic-word counts.
Integer vector of length D; document token counts.
Integer vector of length K; topic token counts.
K x J matrix of regression coefficients.
Length-J vector of noise variances.
K x K posterior mean of (if returned by C++).
Posterior degrees of freedom (if returned by C++).
Posterior scale matrix (if returned by C++).
V x K topic-word posterior mean
computed from nw.
D x K document-topic posterior mean
computed from nd.
Final ELBO value.
Final label log-likelihood term.
Numeric vector of ELBO values over iterations.
Numeric vector of label log-likelihoods.
Number of sweeps actually performed.
Number of documents.
Vocabulary size.
Number of topics.
Number of response dimensions.
Number of non-zero (d, v, c) entries.
This function performs supervised topic model (STM) using variational inference.
It initializes topic assignments from count (optionally using a
topic-word prior phi), estimates regression parameters, and repeatedly
calls the C++ routine stm_vi_parallel() until convergence.
run_stm_vi( count, y, K, alpha, beta, phi = NULL, seed = NULL, max_iter = 200L, min_iter = 50L, tol_elbo = 1e-04, update_sigma = TRUE, tau = 20L, show_progress = TRUE, chunk = 5000L, verbose = TRUE, sigma2_init = NULL )run_stm_vi( count, y, K, alpha, beta, phi = NULL, seed = NULL, max_iter = 200L, min_iter = 50L, tol_elbo = 1e-04, update_sigma = TRUE, tau = 20L, show_progress = TRUE, chunk = 5000L, verbose = TRUE, sigma2_init = NULL )
count |
Integer matrix with 3 columns (d, v, c) in 0-based indexing.
Each row represents document index |
y |
Numeric vector of length D. Must not contain NA values. |
K |
Integer, number of topics. Required if |
alpha |
Dirichlet prior parameter for document-topic distributions. |
beta |
Dirichlet prior parameter for topic-word distributions. |
phi |
Optional V x K topic-word probability matrix used only for initializing topic assignments. |
seed |
Optional integer random seed used in the initialization step. |
max_iter |
Maximum number of variational sweeps. |
min_iter |
Minimum number of sweeps before checking ELBO convergence. |
tol_elbo |
Numeric tolerance for relative ELBO change. |
update_sigma |
Logical; if TRUE, update |
tau |
Numeric log-space cutoff used in |
show_progress |
Logical; print low-level progress inside C++. |
chunk |
Integer; number of documents per parallel block. |
verbose |
Logical; print ELBO and relative change per sweep. |
sigma2_init |
Optional numeric scalar specifying the initial
noise variance. If NULL, |
Convergence is assessed based on the relative changes in the evidence lower bound (ELBO) and the supervised label log-likelihood:
After a minimum number of iterations, the algorithm is declared to have converged when both quantities are non-negative and smaller than the prescribed tolerance.
**Important:**
This function assumes that the response vector y contains **no NA**
values. The underlying C++ implementation does not skip missing responses
and requires y[d] to be finite for all documents.
A list containing:
D x K document-topic count matrix.
K x V topic-word count matrix.
Length-D vector of document token counts.
Length-K vector of topic token counts.
K-dimensional regression coefficient vector.
Final noise variance.
V x K topic-word posterior mean.
D x K document-topic posterior mean.
Final ELBO.
Final supervised term.
ELBO values per sweep.
Label log-likelihood per sweep.
Number of iterations actually performed.
Model dimensions.
This helper configures OpenMP/BLAS threads to ensure reproducible and stable performance across the low-level C++ routines used by the package.
set_threads(num_threads = NULL)set_threads(num_threads = NULL)
num_threads |
Integer number of threads. If NULL, use (cores - 1). |
Invisibly returns an integer giving the number of threads used.
The model includes: - LDA structure: theta_d ~ Dir(alpha), phi_k ~ Dir(beta) - Gaussian response: y[d,j] ~ N(zbar_d^T eta_j, sigma_j^2) - Hierarchical prior: eta_j ~ N(mu, Lambda^-1) Lambda ~ inverse-Wishart(upsilon, Omega)
stm_multi_hier_vi_parallel( mod, docs, y, ndsum, NZ, V, K, J, alpha, beta, mu, upsilon, Omega, update_sigma = TRUE, tau = 20L, exact_second_moment = FALSE, show_progress = TRUE, chunk = 5000L )stm_multi_hier_vi_parallel( mod, docs, y, ndsum, NZ, V, K, J, alpha, beta, mu, upsilon, Omega, update_sigma = TRUE, tau = 20L, exact_second_moment = FALSE, show_progress = TRUE, chunk = 5000L )
mod |
List with model state: - nd (D x K) document-topic counts - nw (K x V) topic-word counts - eta (K x J) regression coefficients - sigma2 (J) noise variances |
docs |
IntegerMatrix (NZ x 3) with (doc_id, word_id, count). |
y |
NumericMatrix (D x J) response matrix. |
ndsum |
IntegerVector (D) document token counts. |
NZ, V, K, J
|
Model dimensions. |
alpha, beta
|
Dirichlet hyperparameters. |
mu |
NumericVector (K) prior mean. |
upsilon |
Degrees of freedom for inverse-Wishart. |
Omega |
Scale matrix for inverse-Wishart. |
update_sigma |
Logical; update sigma2 or not. |
tau |
Numeric cutoff for stability. |
exact_second_moment |
Logical flag (currently not used). |
show_progress |
Logical; print progress. |
chunk |
Integer; documents per parallel block. |
A list with updated variational parameters and diagnostics:
D x K integer matrix of document-topic counts.
K x V integer matrix of topic-word counts.
K x J numeric matrix of regression coefficients.
Length-J numeric vector of noise variances.
K x K numeric matrix, posterior mean of precision matrix Lambda.
Numeric scalar, posterior degrees of freedom.
K x K numeric matrix, posterior scale matrix.
Numeric scalar, evidence lower bound.
Numeric scalar, supervised log-likelihood term.
The model combines unsupervised topic modeling (LDA) with a Gaussian response on document-level topic proportions.
stm_vi_parallel( mod, docs, y, ndsum, NZ, V, K, alpha, beta, update_sigma = TRUE, tau = 20L, show_progress = TRUE, chunk = 5000L )stm_vi_parallel( mod, docs, y, ndsum, NZ, V, K, alpha, beta, update_sigma = TRUE, tau = 20L, show_progress = TRUE, chunk = 5000L )
mod |
A list containing the current model state:
|
docs |
IntegerMatrix of size NZ x 3, where each row is a triple (d, v, c) in 0-based indexing: document index d, word index v, and count c = n_dv. Rows with d outside [0, D-1] are ignored. |
y |
NumericVector of length D; response y_d for each document. |
ndsum |
IntegerVector of length D; total token count per document (that is, ndsum[d] = sum_v n_dv). |
NZ |
Integer, number of non-zero entries in docs (rows of docs). |
V |
Integer, vocabulary size. |
K |
Integer, number of topics. |
alpha |
Scalar Dirichlet prior parameter for document-topic distributions theta_d (symmetric prior with parameter alpha). |
beta |
Scalar Dirichlet prior parameter for topic-word distributions phi_k (symmetric prior with parameter beta). |
update_sigma |
Logical; if TRUE, update the noise variance sigma2 from residuals y_d - zbar_d^T eta, otherwise keep sigma2 fixed. |
tau |
Numeric, log-space cutoff used to prune very small topic responsibilities phi[d,i,k] for numerical stability and efficiency. |
show_progress |
Logical; if TRUE, print simple progress output during the E-step over documents. |
chunk |
Integer, number of documents to process per parallel block in the E-step. Larger values reduce overhead but may use more memory. |
This function performs one variational inference sweep with a parallel document-level E-step and simple updates for the regression parameters.
A list with updated variational parameters and diagnostics:
Updated D x K document-topic counts.
Updated K x V topic-word counts.
Updated K-dimensional regression coefficient vector.
Updated scalar noise variance.
Scalar evidence lower bound (approximate).
Gaussian response log-likelihood component.