Characterizing Topics with Background Noise • alto

This vignette demonstrates the use of alignment to characterize topics that are contaminated by a kind of background noise. It corresponds to the second simulation in the manuscript accompanying this package. The arguments to this vignette (which can be modified in the original rmarkdown’s YAML) are,

K: The true number of topics underlying the simulated data. In the manuscript, \(K = 5\).
N: The number of samples (i.e., documents) to simulate. In the manuscript, this is set to 250.
V: The number of dimensions (i.e. vocabulary size) per sample. In the manuscript, this is set to 1000.
alpha: To what extent are the simulated data from a true LDA model (as opposed to background noise). Set to 1 for data from an LDA model and 0 for pure background noise.
id: A descriptive ID to associate with any saved results.
method: The alignment strategy to pass to align_topics.
n_models: The total number of models to fit to the simulated data. In the manuscript, this is set to 10.
out_dir: If results are saved, where should they be saved to?
save: Should any results be saved?

We load packages. The sim_gradient function that generates the contaminated LDA model is sourced from the link below.

library(alto)
#> 
#> Attaching package: 'alto'
#> The following object is masked from 'package:stats':
#> 
#>     weights
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.0     ✔ readr     2.1.4
#> ✔ forcats   1.0.0     ✔ stringr   1.5.0
#> ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
#> ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
#> ✔ purrr     1.0.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the 
library(MCMCpack)
#> Loading required package: coda
#> Loading required package: MASS
#> 
#> Attaching package: 'MASS'
#> 
#> The following object is masked from 'package:dplyr':
#> 
#>     select
#> 
#> ##
#> ## Markov Chain Monte Carlo Package (MCMCpack)
#> ## Copyright (C) 2003-2023 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
#> ##
#> ## Support provided by the U.S. National Science Foundation
#> ## (Grants SES-0350646 and SES-0350613)
#> ##
source("https://raw.githubusercontent.com/krisrs1128/topic_align/main/simulations/simulation_functions.R")
my_theme()
#> Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
#> ℹ Please use the `linewidth` argument instead.

The block below simulates from the contaminated LDA model. The \(\lambda_{\beta}\) and \(\lambda_{\gamma}\) parameters are as in the LDA simulation in the sim_lda.Rmd vignette. The \(\lambda_{\nu}\) parameter specifies the hyperparameter of the Dirichlet distribution used in the background noise.

attach(params)
lambdas <- list(beta = 0.1, gamma = 0.5, nu = 0.5, count = 1e4)
sim_data_ <- simulate_gradient(2 * N, K, V, lambdas, alpha = params$alpha)
sim_data <- sim_data_
sim_data$x <- sim_data$x[1:N, ]
sim_data$gamma <- sim_data$gamma[1:N, ]

Next, we run a sequence of LDA models and compute an alignment. This code is identical across vignettes.

lda_params <- map(1:n_models, ~ list(k = .))
names(lda_params) <- str_c("K", 1:n_models)
alignment <- sim_data$x %>%
  run_lda_models(lda_params, reset = TRUE, dir = "./fits/background_",  seed = as.integer(id)) %>%
  align_topics(method = params$method)
#> Using default value 'VEM' for 'method' LDA parameter.
#> Using default value 'VEM' for 'method' LDA parameter.
#> Using default value 'VEM' for 'method' LDA parameter.
#> Using default value 'VEM' for 'method' LDA parameter.
#> Using default value 'VEM' for 'method' LDA parameter.

Next, we extract the data that summarize the quality of topics emerging from the alignment.

scores <- topics(alignment)  %>%
  mutate(id = params$id)
key_topics <- compute_number_of_paths(alignment)

By saving these data into different directories, we can gather results across a variety of \(\alpha\)’s.

id_vars <- params[c("out_dir", "method", "alpha", "id", "N", "V", "K")]
if (params$save) {
  dir.create(params$out_dir, recursive = TRUE)
  write_csv(scores, save_str("scores", id_vars))
  write_csv(key_topics, save_str("key_topics", id_vars))
  exper <- list(sim_data, alignment)
  save(exper, file = save_str("exper", id_vars, "rda"))
}

if (params$perplexity && params$save) {
  perplexities <- matrix(nrow = params$n_models - 1, ncol = 2, dimnames = list(NULL, c("train", "test")))
  for (k in seq(2, params$n_models)) {
    load(str_c("fits/background_K", k, ".Rdata"))
    perplexities[k - 1, 1] <- topicmodels::perplexity(tm, sim_data$x)
    perplexities[k - 1, 2] <- topicmodels::perplexity(tm, sim_data_$x[(N + 1):(2 * N), ])
  }

  cbind(K = seq(2, params$n_models), perplexities) %>%
    as_tibble() %>%
    write_csv(save_str("perplexity", id_vars))
}