Threading for Within Chain Parallelization

Code

library(collapsibleTree)
# library(kableExtra)
# library(patchwork)
# library(latex2exp)
# library(bayesplot)
# library(tidybayes)
# library(loo)
# library(posterior)
library(cmdstanr)
library(tidyverse)

theme_set(theme_bw(base_size = 16, base_line_size = 2))
set_cmdstan_path("~/Torsten/cmdstan")

1 Introduction

Bayesian methods are often considered too computationally expensive to implement in practice due to the necessity of sampling the posterior distribution through Markov Chain Monte Carlo (MCMC) methods¹. However, there are methods to speed up the MCMC sampling.

1.1 Completely Sequential MCMC

Historically MCMC is performed in a completely sequential manner (this is an interactive tree. Click on it):

1.2 MCMC with Parallel Chains

Stan (along with many other softwares²) can automatically sample the chains in parallel:

1.3 Within-Chain Parallelization

Within each iteration, we calculate the posterior density (up to a constant): \[ \pi(\mathbf{\theta \, | \, \mathbf{y}}) \propto \mathcal{L}(\mathbf{\theta \, | \, \mathbf{y}})\,p(\mathbf{\theta})\] where $\mathcal{L(\cdot\,|\,\cdot)}$ is the likelihood and $p(\cdot)$ is the prior. For us, we generally have $n_{subj}$ independent subjects, and so we can calculate the likelihood for each individual separately from the others³. Typically, the likelihood for each subject is calculated sequentially, but Stan has multiple methods for parallelization. The reduce_sum function implemented in most of the Stan models on this website⁴ uses multi-threading that allows the individual likelihoods to be calculated in parallel rather than sequentially:

This within-chain parallelization can lead to substantial speed-ups in computation time⁵.

2 Example: One-Compartment IV

Here’s an example of a simple, no parallelization, bare-bones, one-compartment IV model written in Stan:

// IV infusion
// One-compartment PK Model
// IIV on CL and VC (full covariance matrix)
// proportional error - DV = IPRED*(1 + eps_p)
// Matrix-exponential solution using Torsten 
// Deals with BLOQ values by the "CDF trick" (M4)
// Since we have a normal distribution on the error, but the DV must be > 0, it
//   truncates the likelihood below at 0

functions{

  real normal_lb_rng(real mu, real sigma, real lb){
    
    real p_lb = normal_cdf(lb | mu, sigma);
    real u = uniform_rng(p_lb, 1);
    real y = mu + sigma * inv_Phi(u);
    return y; 

  }
}
data{
  
  int n_subjects;
  int n_total;
  int n_obs;
  array[n_obs] int i_obs;
  array[n_total] int ID;
  array[n_total] real amt;
  array[n_total] int cmt;
  array[n_total] int evid;
  array[n_total] real rate;
  array[n_total] real ii;
  array[n_total] int addl;
  array[n_total] int ss;
  array[n_total] real time;
  vector<lower = 0>[n_total] dv;
  array[n_subjects] int subj_start;
  array[n_subjects] int subj_end;
  vector[n_total] lloq;
  array[n_total] int bloq;
  
  real<lower = 0> location_tvcl;  // Prior Location parameter for CL
  real<lower = 0> location_tvvc;  // Prior Location parameter for VC
  
  real<lower = 0> scale_tvcl;     // Prior Scale parameter for CL
  real<lower = 0> scale_tvvc;     // Prior Scale parameter for VC
  
  real<lower = 0> scale_omega_cl; // Prior scale parameter for omega_cl
  real<lower = 0> scale_omega_vc; // Prior scale parameter for omega_vc
  
  real<lower = 0> lkj_df_omega;   // Prior degrees of freedom for omega cor mat
  
  real<lower = 0> scale_sigma_p;  // Prior Scale parameter for proportional error
  
  int<lower = 0, upper = 1> prior_only; // Want to simulate from the prior?
  
}
transformed data{ 
  
  vector<lower = 0>[n_obs] dv_obs = dv[i_obs];
  array[n_obs] int dv_obs_id = ID[i_obs];
  
  vector[n_obs] lloq_obs = lloq[i_obs];
  array[n_obs] int bloq_obs = bloq[i_obs];
  
  int n_random = 2;                    // Number of random effects
  int n_cmt = 1;                       // Number of states in the ODEs
  
  array[n_random] real scale_omega = {scale_omega_cl, scale_omega_vc}; 
  
  array[n_cmt] real bioav = rep_array(1.0, n_cmt); // Hardcoding, but could be data or a parameter in another situation
  array[n_cmt] real tlag = rep_array(0.0, n_cmt);
  
}
parameters{ 
  
  real<lower = 0> TVCL;       
  real<lower = 0> TVVC;
  
  vector<lower = 0>[n_random] omega;
  cholesky_factor_corr[n_random] L;
  
  real<lower = 0> sigma_p;
  
  matrix[n_random, n_subjects] Z;
  
}
transformed parameters{
  
  vector[n_subjects] eta_cl;
  vector[n_subjects] eta_vc;
  vector[n_subjects] CL;
  vector[n_subjects] VC;
  vector[n_subjects] KE;
  
  vector[n_obs] ipred;

  {
  
    row_vector[n_random] typical_values = to_row_vector({TVCL, TVVC});

    matrix[n_subjects, n_random] eta = diag_pre_multiply(omega, L * Z)';

    matrix[n_subjects, n_random] theta =
                          (rep_matrix(typical_values, n_subjects) .* exp(eta));
                          
    vector[n_total] dv_ipred;
    matrix[n_total, n_cmt] x_ipred;                          
    
    eta_cl = col(eta, 1);
    eta_vc = col(eta, 2);
    CL = col(theta, 1);
    VC = col(theta, 2);
    KE = CL ./ VC;
  
    for(j in 1:n_subjects){
        
      matrix[n_cmt, n_cmt] K = rep_matrix(0, n_cmt, n_cmt);
      K[1, 1] = -KE[j];
      
      x_ipred[subj_start[j]:subj_end[j], ] =
        pmx_solve_linode(time[subj_start[j]:subj_end[j]],
                         amt[subj_start[j]:subj_end[j]],
                         rate[subj_start[j]:subj_end[j]],
                         ii[subj_start[j]:subj_end[j]],
                         evid[subj_start[j]:subj_end[j]],
                         cmt[subj_start[j]:subj_end[j]],
                         addl[subj_start[j]:subj_end[j]],
                         ss[subj_start[j]:subj_end[j]],
                         K, bioav, tlag)';
                      
      dv_ipred[subj_start[j]:subj_end[j]] = 
        x_ipred[subj_start[j]:subj_end[j], 1] ./ VC[j];
  
    }
  
    ipred = dv_ipred[i_obs];
  
  }
  
}
model{ 
  
  // Priors
  TVCL ~ lognormal(log(location_tvcl), scale_tvcl);
  TVVC ~ lognormal(log(location_tvvc), scale_tvvc);

  omega ~ normal(0, scale_omega);
  L ~ lkj_corr_cholesky(lkj_df_omega);
  
  sigma_p ~ normal(0, scale_sigma_p);
  
  to_vector(Z) ~ std_normal();
  
   // Likelihood
  if(prior_only == 0){
    for(i in 1:n_obs){
      real sigma_tmp = ipred[i]*sigma_p;
      if(bloq_obs[i] == 1){
        target += log_diff_exp(normal_lcdf(lloq_obs[i] | ipred[i], sigma_tmp),
                               normal_lcdf(0.0 | ipred[i], sigma_tmp)) -
                   normal_lccdf(0.0 | ipred[i], sigma_tmp); 
      }else{
        target += normal_lpdf(dv_obs[i] | ipred[i], sigma_tmp) -
                  normal_lccdf(0.0 | ipred[i], sigma_tmp);
      }
    }
  }
}

2.1 Completely Sequential MCMC

Here is the implementation of the above model in R with completely sequential sampling:

nonmem_data <- read_csv(here::here("data/iv_1cmt_prop.csv"),
                        na = ".",
                        show_col_types = FALSE) %>% 
  rename_all(tolower) %>% 
  rename(ID = "id",
         DV = "dv") %>% 
  mutate(DV = if_else(is.na(DV), 5555555, DV),    # This value can be anything except NA. It'll be indexed away 
         bloq = if_else(is.na(bloq), -999, bloq), # This value can be anything except NA. It'll be indexed away 
         cmt = 1)

n_subjects <- nonmem_data %>%  # number of individuals
  distinct(ID) %>%
  count() %>%
  deframe()

n_total <- nrow(nonmem_data)   # total number of records

i_obs <- nonmem_data %>%
  mutate(row_num = 1:n()) %>%
  filter(evid == 0) %>%
  select(row_num) %>%
  deframe()

n_obs <- length(i_obs)

subj_start <- nonmem_data %>%
  mutate(row_num = 1:n()) %>%
  group_by(ID) %>%
  slice_head(n = 1) %>%
  ungroup() %>%
  select(row_num) %>%
  deframe()

subj_end <- c(subj_start[-1] - 1, n_total)

stan_data <- list(n_subjects = n_subjects,
                  n_total = n_total,
                  n_obs = n_obs,
                  i_obs = i_obs,
                  ID = nonmem_data$ID,
                  amt = nonmem_data$amt,
                  cmt = nonmem_data$cmt,
                  evid = nonmem_data$evid,
                  rate = nonmem_data$rate,
                  ii = nonmem_data$ii,
                  addl = nonmem_data$addl,
                  ss = nonmem_data$ss,
                  time = nonmem_data$time,
                  dv = nonmem_data$DV,
                  subj_start = subj_start,
                  subj_end = subj_end,
                  lloq = nonmem_data$lloq,
                  bloq = nonmem_data$bloq,
                  location_tvcl = 0.25,
                  location_tvvc = 3,
                  scale_tvcl = 1,
                  scale_tvvc = 1,
                  scale_omega_cl = 0.4,
                  scale_omega_vc = 0.4,
                  lkj_df_omega = 2,
                  scale_sigma_p = 0.5,
                  prior_only = 0)

fit_completely_sequential <- suppressMessages(model_no_threading$sample(
  data = stan_data,
  seed = 8675309,
  chains = 4,
  parallel_chains = 1, 
  iter_warmup = 500,
  iter_sampling = 1000,
  adapt_delta = 0.8,
  refresh = 0,
  max_treedepth = 10,
  init = function() list(TVCL = rlnorm(1, log(0.25), 0.3),
                         TVVC = rlnorm(1, log(3), 0.3),
                         omega = rlnorm(2, log(0.3), 0.3),
                         sigma_p = rlnorm(1, log(0.2), 0.3))))

Running MCMC with 4 sequential chains...

Chain 1 finished in 71.2 seconds.
Chain 2 finished in 72.5 seconds.
Chain 3 finished in 64.8 seconds.
Chain 4 finished in 69.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 69.4 seconds.
Total execution time: 277.9 seconds.

2.2 MCMC with Parallel Chains

Here is the implementation of the above model in R with parallel chains:

fit_parallel_chains <- suppressMessages(model_no_threading$sample(
  data = stan_data,
  seed = 112358,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 500,
  iter_sampling = 1000,
  adapt_delta = 0.8,
  refresh = 0,
  max_treedepth = 10,
  init = function() list(TVCL = rlnorm(1, log(0.25), 0.3),
                         TVVC = rlnorm(1, log(3), 0.3),
                         omega = rlnorm(2, log(0.3), 0.3),
                         sigma_p = rlnorm(1, log(0.2), 0.3))))

1: Notice parallel_chains = 4, where it was 1 before.

Running MCMC with 4 parallel chains...

Chain 4 finished in 68.9 seconds.
Chain 1 finished in 71.3 seconds.
Chain 3 finished in 72.4 seconds.
Chain 2 finished in 72.8 seconds.

All 4 chains finished successfully.
Mean chain execution time: 71.4 seconds.
Total execution time: 73.0 seconds.

2.3 MCMC with Parallel Chains and Within-Chain Parallelization

Here’s the same one-compartment IV model written in Stan, but now the within- chain parallelization is implemented with the reduce_sum() function:

// IV infusion
// One-compartment PK Model
// IIV on CL and VC (full covariance matrix)
// proportional error - DV = IPRED*(1 + eps_p)
// Matrix-exponential solution using Torsten 
// Implements threading for within-chain parallelization 
// Deals with BLOQ values by the "CDF trick" (M4)
// Since we have a normal distribution on the error, but the DV must be > 0, it
//   truncates the likelihood below at 0

functions{

  array[] int sequence(int start, int end) { 
    array[end - start + 1] int seq;
    for (n in 1:num_elements(seq)) {
      seq[n] = n + start - 1;
    }
    return seq; 
  } 
  
  int num_between(int lb, int ub, array[] int y){
    
    int n = 0;
    for(i in 1:num_elements(y)){
      if(y[i] >= lb && y[i] <= ub)
         n = n + 1;
    }
    return n;
    
  }
  
  array[] int find_between(int lb, int ub, array[] int y) {
    // vector[num_between(lb, ub, y)] result;
    array[num_between(lb, ub, y)] int result;
    int n = 1;
    for (i in 1:num_elements(y)) {
      if (y[i] >= lb && y[i] <= ub) {
        result[n] = y[i];
        n = n + 1;
      }
    }
    return result;
  }
  
  vector find_between_vec(int lb, int ub, array[] int idx, vector y) {
    
    vector[num_between(lb, ub, idx)] result;
    int n = 1;
    if(num_elements(idx) != num_elements(y)) reject("illegal input");
    for (i in 1:rows(y)) {
      if (idx[i] >= lb && idx[i] <= ub) {
        result[n] = y[i];
        n = n + 1;
      }
    }
    return result;
  }
  
  real normal_lb_rng(real mu, real sigma, real lb){
    
    real p_lb = normal_cdf(lb | mu, sigma);
    real u = uniform_rng(p_lb, 1);
    real y = mu + sigma * inv_Phi(u);
    return y; 

  }
  
  real partial_sum_lpmf(array[] int seq_subj, int start, int end,
                        vector dv_obs, array[] int dv_obs_id, array[] int i_obs,
                        array[] real amt, array[] int cmt, array[] int evid, 
                        array[] real time, array[] real rate, array[] real ii, 
                        array[] int addl, array[] int ss,
                        array[] int subj_start, array[] int subj_end, 
                        vector CL, vector VC,
                        real sigma_p, 
                        vector lloq, array[] int bloq,
                        int n_random, int n_subjects, int n_total,
                        array[] real bioav, array[] real tlag, int n_cmt){
                           
    real ptarget = 0;
                              
    int N = end - start + 1;    // number of subjects in this slice  
    vector[n_total] dv_ipred;   
    matrix[n_total, n_cmt] x_ipred;
  
    int n_obs_slice = num_between(subj_start[start], subj_end[end], i_obs);
    array[n_obs_slice] int i_obs_slice = find_between(subj_start[start], 
                                                      subj_end[end], i_obs);
                                                
    vector[n_obs_slice] dv_obs_slice = find_between_vec(start, end, 
                                                        dv_obs_id, dv_obs);
    
    vector[n_obs_slice] ipred_slice;
    
    vector[n_obs_slice] lloq_slice = lloq[i_obs_slice];
    array[n_obs_slice] int bloq_slice = bloq[i_obs_slice];
    
    
    for(n in 1:N){           // loop over subjects in this slice
    
      int j = n + start - 1; // j is the ID of the current subject
        
      real ke = CL[j]/VC[j];
        
      matrix[n_cmt, n_cmt] K = rep_matrix(0, n_cmt, n_cmt);
      K[1, 1] = -ke;
      
      x_ipred[subj_start[j]:subj_end[j], ] =
        pmx_solve_linode(time[subj_start[j]:subj_end[j]],
                         amt[subj_start[j]:subj_end[j]],
                         rate[subj_start[j]:subj_end[j]],
                         ii[subj_start[j]:subj_end[j]],
                         evid[subj_start[j]:subj_end[j]],
                         cmt[subj_start[j]:subj_end[j]],
                         addl[subj_start[j]:subj_end[j]],
                         ss[subj_start[j]:subj_end[j]],
                         K, bioav, tlag)';
                      
      dv_ipred[subj_start[j]:subj_end[j]] = 
        x_ipred[subj_start[j]:subj_end[j], 1] ./ VC[j];
    
    }
  
    ipred_slice = dv_ipred[i_obs_slice];
    
    for(i in 1:n_obs_slice){
      real sigma_tmp = ipred_slice[i]*sigma_p;
      if(bloq_slice[i] == 1){
        ptarget += log_diff_exp(normal_lcdf(lloq_slice[i] | ipred_slice[i], 
                                                            sigma_tmp),
                                normal_lcdf(0.0 | ipred_slice[i], sigma_tmp)) -
                   normal_lccdf(0.0 | ipred_slice[i], sigma_tmp); 
      }else{
        ptarget += normal_lpdf(dv_obs_slice[i] | ipred_slice[i], sigma_tmp) -
                   normal_lccdf(0.0 | ipred_slice[i], sigma_tmp);
      }
    }                                         
                              
    return ptarget;
                           
  }
  
}
data{
  
  int n_subjects;
  int n_total;
  int n_obs;
  array[n_obs] int i_obs;
  array[n_total] int ID;
  array[n_total] real amt;
  array[n_total] int cmt;
  array[n_total] int evid;
  array[n_total] real rate;
  array[n_total] real ii;
  array[n_total] int addl;
  array[n_total] int ss;
  array[n_total] real time;
  vector<lower = 0>[n_total] dv;
  array[n_subjects] int subj_start;
  array[n_subjects] int subj_end;
  vector[n_total] lloq;
  array[n_total] int bloq;
  
  real<lower = 0> location_tvcl;  // Prior Location parameter for CL
  real<lower = 0> location_tvvc;  // Prior Location parameter for VC
  
  real<lower = 0> scale_tvcl;     // Prior Scale parameter for CL
  real<lower = 0> scale_tvvc;     // Prior Scale parameter for VC
  
  real<lower = 0> scale_omega_cl; // Prior scale parameter for omega_cl
  real<lower = 0> scale_omega_vc; // Prior scale parameter for omega_vc
  
  real<lower = 0> lkj_df_omega;   // Prior degrees of freedom for omega cor mat
  
  real<lower = 0> scale_sigma_p;  // Prior Scale parameter for proportional error
  
  int<lower = 0, upper = 1> prior_only; // Want to simulate from the prior?
  
}
transformed data{ 
  
  int grainsize = 1;
  
  vector<lower = 0>[n_obs] dv_obs = dv[i_obs];
  array[n_obs] int dv_obs_id = ID[i_obs];
  
  vector[n_obs] lloq_obs = lloq[i_obs];
  array[n_obs] int bloq_obs = bloq[i_obs];
  
  int n_random = 2;                    // Number of random effects
  int n_cmt = 1;                       // Number of states in the ODEs
  
  array[n_random] real scale_omega = {scale_omega_cl, scale_omega_vc}; 
  
  array[n_subjects] int seq_subj = sequence(1, n_subjects); // reduce_sum over subjects
  
  array[n_cmt] real bioav = rep_array(1.0, n_cmt); // Hardcoding, but could be data or a parameter in another situation
  array[n_cmt] real tlag = rep_array(0.0, n_cmt);
  
}
parameters{ 
  
  real<lower = 0> TVCL;       
  real<lower = 0> TVVC;
  
  vector<lower = 0>[n_random] omega;
  cholesky_factor_corr[n_random] L;
  
  real<lower = 0> sigma_p;
  
  matrix[n_random, n_subjects] Z;
  
}
transformed parameters{
  
  vector[n_subjects] eta_cl;
  vector[n_subjects] eta_vc;
  vector[n_subjects] CL;
  vector[n_subjects] VC;
  vector[n_subjects] KE;

  {
  
    row_vector[n_random] typical_values = to_row_vector({TVCL, TVVC});

    matrix[n_subjects, n_random] eta = diag_pre_multiply(omega, L * Z)';

    matrix[n_subjects, n_random] theta =
                          (rep_matrix(typical_values, n_subjects) .* exp(eta));
    
    eta_cl = col(eta, 1);
    eta_vc = col(eta, 2);
    CL = col(theta, 1);
    VC = col(theta, 2);
    KE = CL ./ VC;
  
  }
  
}
model{ 
  
  // Priors
  TVCL ~ lognormal(log(location_tvcl), scale_tvcl);
  TVVC ~ lognormal(log(location_tvvc), scale_tvvc);

  omega ~ normal(0, scale_omega);
  L ~ lkj_corr_cholesky(lkj_df_omega);
  
  sigma_p ~ normal(0, scale_sigma_p);
  
  to_vector(Z) ~ std_normal();
  
  // Likelihood
  if(prior_only == 0){
    target += reduce_sum(partial_sum_lupmf, seq_subj, grainsize,
                         dv_obs, dv_obs_id, i_obs,
                         amt, cmt, evid, time, 
                         rate, ii, addl, ss, subj_start, subj_end, 
                         CL, VC,
                         sigma_p,
                         lloq, bloq,
                         n_random, n_subjects, n_total,
                         bioav, tlag, n_cmt);
  }
}

Here is the implementation of this model in R with within-chain parallelization:

fit_threaded <- suppressMessages(model_threaded$sample(
  data = stan_data,
  seed = 112358,
  chains = 4,
  parallel_chains = 4,
  # threads_per_chain = parallel::detectCores()/4,
  threads_per_chain = 4,
  iter_warmup = 500,
  iter_sampling = 1000,
  adapt_delta = 0.8,
  refresh = 0,
  max_treedepth = 10,
  init = function() list(TVCL = rlnorm(1, log(0.25), 0.3),
                         TVVC = rlnorm(1, log(3), 0.3),
                         omega = rlnorm(2, log(0.3), 0.3),
                         sigma_p = rlnorm(1, log(0.2), 0.3))))

Running MCMC with 4 parallel chains, with 4 thread(s) per chain...

Chain 1 finished in 25.2 seconds.
Chain 2 finished in 26.7 seconds.
Chain 4 finished in 27.1 seconds.
Chain 3 finished in 29.4 seconds.

All 4 chains finished successfully.
Mean chain execution time: 27.1 seconds.
Total execution time: 29.5 seconds.

You can see that there is a small speed-up. If you have more available cores, then there will be a larger speed-up.

A guide for choosing the chains, parallel_chains, and threads_per_chain arguments is to

Figure out how many cores you have available (parallel::detectCores())
Choose the number of chains you want to sample (we recommend 4)
If chains < parallel::detectCores(), then have parallel_chains = chains (almost all modern machines have at least 4 cores)
threads_per_chain should be anywhere between 1 and $\frac{parallel::detectCores()}{parallel\_chains}$

For example, if your machine has 32 cores, we recommend having 4 chains, 4 parallel_chains, and 8 threads_per_chain. This will make use of all the available cores. Using more threads_per_chain than this won’t be helpful in reducing execution time, since the available cores are already in use.

2.3.1 Explanation for Stan Code Implementation of Within-Chain Parallelization

sequence(), num_between(), find_between(), and find_between_vec() are functions to help make sure correct observations are passed along with the subjects in a given slice.

sequence(start, end) creates a sequence of integers from start to end where start and end are integers, similarly to doing in R seq(start, to, by = 1).
num_between(lb, ub, y) finds the number of elements of the integer array $y$ such that $lb \leq y \leq ub$, similarly to doing in R sum(y >= lb & y <= ub) or sum(dplyr::between(y, lb, ub)).
find_between(lb, ub, y) finds the elements of the integer array $y$ such that $lb \leq y \leq ub$, similarly to doing in R y[y >= lb & y <= ub].
find_between_vec(lb, ub, idx, y) finds the elements of the vector $y$ whose index is $\geq lb$ and whose index is $\leq ub$, similarly to doing in R y[idx >= lb & idx <= ub]

partial_sum_lpmf() is the function that is input into reduce_sum(). The output of this function is the log-likelihood for the subjects in a particular slice. For example, if we have 4 threads per chain and 12 subjects, the likelihood calculations might be sent in slices of 3 subjects to each of the 4 threads. Then the output for this function would be the likelihood for the 3 subjects in the slice. The reduce_sum() function then adds the results from each slice to get the full log-probability.

Footnotes

There are situations where we can get the posterior distribution in closed form. These typically require a conjugate prior (see here) and a relatively simple model which we rarely see in the PK/PD world.↩︎
e.g. Pumas, Turing.jl (Julia), and PyMC all do this with ease. In NONMEM, you need to do this manually↩︎
In Stan we actually calculate the log-posterior density (up to a constant). Taking some liberties with notation, this means the likelihood can be written as \[\begin{align} \mathcal{L}(\mathbf{\theta \, | \, \mathbf{y}}) &= \prod_{i=1}^{n_{subj}} \mathcal{L}(\mathbf{\theta_i \, | \, \mathbf{y}}) \notag \\ \implies log\left(\mathcal{L}(\mathbf{\theta \, | \, \mathbf{y}})\right) &= \ell(\mathbf{\theta \, | \, \mathbf{y}}) \notag \\ &= \sum_{i=1}^{n_{subj}}\ell(\mathbf{\theta_i \, | \, \mathbf{y}}) \end{align}\] Hence reduce-sum instead of reduce-prod.↩︎
Torsten also implements group integrators that support parallelization through Message Passing Interface (MPI). We won’t go into that here↩︎
Be aware that there is overhead to this parallelization. If you use $M$ threads-per-chain, the speedup will be $< Mx$ and could potentially actually be slower, depending on your machine’s architecture and the computational complexity of the likelihood. In general, the more computationally intensive the likelihood calculation is, e.g. solving ODEs, the more speedup the within-chain parallelization will provide.↩︎

--- title: "Threading for Within Chain Parallelization" format: html: code-fold: false code-summary: "Code" code-tools: true number-sections: true toc-depth: 5 editor_options: chunk_output_type: console code-annotations: hover --- ```{css, echo = FALSE} .scrolling_500 { max-height: 500px; overflow-y: auto; } .scrolling_700 { max-height: 700px; overflow-y: auto; } ``` ```{r package-load} #| label: package-load #| code-fold: true #| include: true #| message: false #| warning: false library(collapsibleTree) # library(kableExtra) # library(patchwork) # library(latex2exp) # library(bayesplot) # library(tidybayes) # library(loo) # library(posterior) library(cmdstanr) library(tidyverse) theme_set(theme_bw(base_size = 16, base_line_size = 2)) set_cmdstan_path("~/Torsten/cmdstan") ``` # Introduction Bayesian methods are often considered too computationally expensive to implement in practice due to the necessity of sampling the posterior distribution through [Markov Chain Monte Carlo (MCMC)](Introduction-to-Bayesian-Inference.qmd#mcmc) methods^[There are situations where we can get the posterior distribution in closed form. These typically require a [conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior) (see [here](Introduction-to-Bayesian-Inference.qmd#appendix-b)) and a relatively simple model which we rarely see in the PK/PD world.]. However, there are methods to speed up the MCMC sampling. ## Completely Sequential MCMC Historically MCMC is performed in a completely sequential manner (this is an interactive tree. Click on it): ```{r collapsible-tree-typical, cache=TRUE, echo=FALSE} typical_mcmc <- tibble(Main = "MCMC", a1 = "Chain 1, Iteration 1", a2 = "Chain 1, Iteration 2", a3 = "...", a4 = "Chain 1, Iteration n", b1 = "Chain 2, Iteration 1", b2 = "Chain 2, Iteration 2", b3 = "...", b4 = "Chain 2, Iteration n", ellipsis = "...", m1 = "Chain m, Iteration 1", m2 = "Chain m, Iteration 2", m3 = "...", m4 = "Chain m, Iteration n") collapsibleTree(typical_mcmc, root = "Typical MCMC", hierarchy = colnames(typical_mcmc)[-1], collapsed = TRUE, linkLength = 120) ``` ## MCMC with Parallel Chains Stan (along with many other softwares^[*e.g.* Pumas, Turing.jl (Julia), and PyMC all do this with ease. In NONMEM, you need to do this manually]) can automatically sample the chains in parallel: ```{r collapsible-tree-parallel, cache=TRUE, echo=FALSE} parallel_mcmc <- tibble(Main = rep("MCMC", 4), Chain = str_c("Chain ", 1:4), a1 = rep("Iteration 1", 4), a2 = rep("Iteration 2", 4), a3 = rep("...", 4), a4 = rep("Iteration n", 4)) collapsibleTree(parallel_mcmc, root = "Parallel MCMC", hierarchy = colnames(parallel_mcmc)[-1], collapsed = TRUE) ``` ## Within-Chain Parallelization Within each iteration, we calculate the posterior density (up to a constant): $$ \pi(\mathbf{\theta \, | \, \mathbf{y}}) \propto \mathcal{L}(\mathbf{\theta \, | \, \mathbf{y}})\,p(\mathbf{\theta})$$ where $\mathcal{L(\cdot\,|\,\cdot)}$ is the likelihood and $p(\cdot)$ is the prior. For us, we generally have $n_{subj}$ independent subjects, and so we can calculate the likelihood for each individual separately from the others[^parallel-1]. Typically, the likelihood for each subject is calculated sequentially, but Stan has [multiple methods](https://mc-stan.org/docs/stan-users-guide/parallelization.html) for parallelization. The `reduce_sum` function implemented in most of the Stan models on this website[^parallel-2] uses [multi-threading](https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)) that allows the individual likelihoods to be calculated in parallel rather than sequentially: ```{r collapsible-tree-within-iteration, cache=TRUE, echo=FALSE} within_iteration <- tibble(Main = rep("Iteration i", 6), l2 = c(rep("Standard MCMC", 2), rep("Within-Chain Parallelization", 4)), l3 = rep("Posterior Density", 6), l4 = c("Likelihood", "Priors", rep("Likelihood", 3), "Priors"), l5 = c("Subject 1", "", str_c("Subject ", 1:2), "Subject 3...", ""), l6 = c("Subject 2", rep("", 5)), l7 = c("Subject 3", rep("", 5)), ellipsis = c("...", rep("", 5))) collapsibleTree(within_iteration, root = "Iteration i", hierarchy = colnames(within_iteration)[-1], collapsed = TRUE, linkLength = 100) ``` [^parallel-1]: In Stan we actually calculate the *log-*posterior density (up to a constant). Taking some liberties with notation, this means the likelihood can be written as \begin{align} \mathcal{L}(\mathbf{\theta \, | \, \mathbf{y}}) &= \prod_{i=1}^{n_{subj}} \mathcal{L}(\mathbf{\theta_i \, | \, \mathbf{y}}) \notag \\ \implies log\left(\mathcal{L}(\mathbf{\theta \, | \, \mathbf{y}})\right) &= \ell(\mathbf{\theta \, | \, \mathbf{y}}) \notag \\ &= \sum_{i=1}^{n_{subj}}\ell(\mathbf{\theta_i \, | \, \mathbf{y}}) \end{align} Hence reduce-*sum* instead of reduce-*prod*. [^parallel-2]: Torsten also implements [group integrators](https://metrumresearchgroup.github.io/Torsten/function/ode-group-integ/) that support parallelization through Message Passing Interface (MPI). We won't go into that here This within-chain parallelization can lead to substantial speed-ups in computation time[^parallel-3]. [^parallel-3]: Be aware that there is overhead to this parallelization. If you use $M$ threads-per-chain, the speedup will be $< Mx$ and could potentially actually be slower, depending on your machine's architecture and the computational complexity of the likelihood. In general, the more computationally intensive the likelihood calculation is, *e.g.* solving ODEs, the more speedup the within-chain parallelization will provide. # Example: One-Compartment IV Here's an example of a simple, no parallelization, bare-bones, one-compartment IV model written in Stan: :::{.scrolling_700} ```{verbatim} // IV infusion // One-compartment PK Model // IIV on CL and VC (full covariance matrix) // proportional error - DV = IPRED*(1 + eps_p) // Matrix-exponential solution using Torsten // Deals with BLOQ values by the "CDF trick" (M4) // Since we have a normal distribution on the error, but the DV must be > 0, it // truncates the likelihood below at 0 functions{ real normal_lb_rng(real mu, real sigma, real lb){ real p_lb = normal_cdf(lb | mu, sigma); real u = uniform_rng(p_lb, 1); real y = mu + sigma * inv_Phi(u); return y; } } data{ int n_subjects; int n_total; int n_obs; array[n_obs] int i_obs; array[n_total] int ID; array[n_total] real amt; array[n_total] int cmt; array[n_total] int evid; array[n_total] real rate; array[n_total] real ii; array[n_total] int addl; array[n_total] int ss; array[n_total] real time; vector<lower = 0>[n_total] dv; array[n_subjects] int subj_start; array[n_subjects] int subj_end; vector[n_total] lloq; array[n_total] int bloq; real<lower = 0> location_tvcl; // Prior Location parameter for CL real<lower = 0> location_tvvc; // Prior Location parameter for VC real<lower = 0> scale_tvcl; // Prior Scale parameter for CL real<lower = 0> scale_tvvc; // Prior Scale parameter for VC real<lower = 0> scale_omega_cl; // Prior scale parameter for omega_cl real<lower = 0> scale_omega_vc; // Prior scale parameter for omega_vc real<lower = 0> lkj_df_omega; // Prior degrees of freedom for omega cor mat real<lower = 0> scale_sigma_p; // Prior Scale parameter for proportional error int<lower = 0, upper = 1> prior_only; // Want to simulate from the prior? } transformed data{ vector<lower = 0>[n_obs] dv_obs = dv[i_obs]; array[n_obs] int dv_obs_id = ID[i_obs]; vector[n_obs] lloq_obs = lloq[i_obs]; array[n_obs] int bloq_obs = bloq[i_obs]; int n_random = 2; // Number of random effects int n_cmt = 1; // Number of states in the ODEs array[n_random] real scale_omega = {scale_omega_cl, scale_omega_vc}; array[n_cmt] real bioav = rep_array(1.0, n_cmt); // Hardcoding, but could be data or a parameter in another situation array[n_cmt] real tlag = rep_array(0.0, n_cmt); } parameters{ real<lower = 0> TVCL; real<lower = 0> TVVC; vector<lower = 0>[n_random] omega; cholesky_factor_corr[n_random] L; real<lower = 0> sigma_p; matrix[n_random, n_subjects] Z; } transformed parameters{ vector[n_subjects] eta_cl; vector[n_subjects] eta_vc; vector[n_subjects] CL; vector[n_subjects] VC; vector[n_subjects] KE; vector[n_obs] ipred; { row_vector[n_random] typical_values = to_row_vector({TVCL, TVVC}); matrix[n_subjects, n_random] eta = diag_pre_multiply(omega, L * Z)'; matrix[n_subjects, n_random] theta = (rep_matrix(typical_values, n_subjects) .* exp(eta)); vector[n_total] dv_ipred; matrix[n_total, n_cmt] x_ipred; eta_cl = col(eta, 1); eta_vc = col(eta, 2); CL = col(theta, 1); VC = col(theta, 2); KE = CL ./ VC; for(j in 1:n_subjects){ matrix[n_cmt, n_cmt] K = rep_matrix(0, n_cmt, n_cmt); K[1, 1] = -KE[j]; x_ipred[subj_start[j]:subj_end[j], ] = pmx_solve_linode(time[subj_start[j]:subj_end[j]], amt[subj_start[j]:subj_end[j]], rate[subj_start[j]:subj_end[j]], ii[subj_start[j]:subj_end[j]], evid[subj_start[j]:subj_end[j]], cmt[subj_start[j]:subj_end[j]], addl[subj_start[j]:subj_end[j]], ss[subj_start[j]:subj_end[j]], K, bioav, tlag)'; dv_ipred[subj_start[j]:subj_end[j]] = x_ipred[subj_start[j]:subj_end[j], 1] ./ VC[j]; } ipred = dv_ipred[i_obs]; } } model{ // Priors TVCL ~ lognormal(log(location_tvcl), scale_tvcl); TVVC ~ lognormal(log(location_tvvc), scale_tvvc); omega ~ normal(0, scale_omega); L ~ lkj_corr_cholesky(lkj_df_omega); sigma_p ~ normal(0, scale_sigma_p); to_vector(Z) ~ std_normal(); // Likelihood if(prior_only == 0){ for(i in 1:n_obs){ real sigma_tmp = ipred[i]*sigma_p; if(bloq_obs[i] == 1){ target += log_diff_exp(normal_lcdf(lloq_obs[i] | ipred[i], sigma_tmp), normal_lcdf(0.0 | ipred[i], sigma_tmp)) - normal_lccdf(0.0 | ipred[i], sigma_tmp); }else{ target += normal_lpdf(dv_obs[i] | ipred[i], sigma_tmp) - normal_lccdf(0.0 | ipred[i], sigma_tmp); } } } } ``` ::: ```{r} #| label: model-no-threading #| cache: true #| echo: false model_no_threading <- cmdstan_model(here::here("models/stan/iv_1cmt_prop_no_threading.stan")) ``` ## Completely Sequential MCMC Here is the implementation of the above model in R with completely sequential sampling: :::{.scrolling_500} ```{r} #| label: fit-completely-sequential #| cache: true #| output: true nonmem_data <- read_csv(here::here("data/iv_1cmt_prop.csv"), na = ".", show_col_types = FALSE) %>% rename_all(tolower) %>% rename(ID = "id", DV = "dv") %>% mutate(DV = if_else(is.na(DV), 5555555, DV), # This value can be anything except NA. It'll be indexed away bloq = if_else(is.na(bloq), -999, bloq), # This value can be anything except NA. It'll be indexed away cmt = 1) n_subjects <- nonmem_data %>% # number of individuals distinct(ID) %>% count() %>% deframe() n_total <- nrow(nonmem_data) # total number of records i_obs <- nonmem_data %>% mutate(row_num = 1:n()) %>% filter(evid == 0) %>% select(row_num) %>% deframe() n_obs <- length(i_obs) subj_start <- nonmem_data %>% mutate(row_num = 1:n()) %>% group_by(ID) %>% slice_head(n = 1) %>% ungroup() %>% select(row_num) %>% deframe() subj_end <- c(subj_start[-1] - 1, n_total) stan_data <- list(n_subjects = n_subjects, n_total = n_total, n_obs = n_obs, i_obs = i_obs, ID = nonmem_data$ID, amt = nonmem_data$amt, cmt = nonmem_data$cmt, evid = nonmem_data$evid, rate = nonmem_data$rate, ii = nonmem_data$ii, addl = nonmem_data$addl, ss = nonmem_data$ss, time = nonmem_data$time, dv = nonmem_data$DV, subj_start = subj_start, subj_end = subj_end, lloq = nonmem_data$lloq, bloq = nonmem_data$bloq, location_tvcl = 0.25, location_tvvc = 3, scale_tvcl = 1, scale_tvvc = 1, scale_omega_cl = 0.4, scale_omega_vc = 0.4, lkj_df_omega = 2, scale_sigma_p = 0.5, prior_only = 0) fit_completely_sequential <- suppressMessages(model_no_threading$sample( data = stan_data, seed = 8675309, chains = 4, parallel_chains = 1, iter_warmup = 500, iter_sampling = 1000, adapt_delta = 0.8, refresh = 0, max_treedepth = 10, init = function() list(TVCL = rlnorm(1, log(0.25), 0.3), TVVC = rlnorm(1, log(3), 0.3), omega = rlnorm(2, log(0.3), 0.3), sigma_p = rlnorm(1, log(0.2), 0.3)))) ``` ::: ## MCMC with Parallel Chains Here is the implementation of the above model in R with parallel chains: :::{.scrolling_700} ```{r} #| label: fit-parallel-chains #| cache: true #| output: true fit_parallel_chains <- suppressMessages(model_no_threading$sample( data = stan_data, seed = 112358, chains = 4, parallel_chains = 4, # <1> iter_warmup = 500, iter_sampling = 1000, adapt_delta = 0.8, refresh = 0, max_treedepth = 10, init = function() list(TVCL = rlnorm(1, log(0.25), 0.3), TVVC = rlnorm(1, log(3), 0.3), omega = rlnorm(2, log(0.3), 0.3), sigma_p = rlnorm(1, log(0.2), 0.3)))) ``` 1. Notice ```parallel_chains = 4```, where it was 1 before. ::: ## MCMC with Parallel Chains and Within-Chain Parallelization Here's the same one-compartment IV model written in Stan, but now the within- chain parallelization is implemented with the [reduce_sum()](https://mc-stan.org/docs/stan-users-guide/reduce-sum.html) function: :::{.scrolling_700} ```{verbatim} // IV infusion // One-compartment PK Model // IIV on CL and VC (full covariance matrix) // proportional error - DV = IPRED*(1 + eps_p) // Matrix-exponential solution using Torsten // Implements threading for within-chain parallelization // Deals with BLOQ values by the "CDF trick" (M4) // Since we have a normal distribution on the error, but the DV must be > 0, it // truncates the likelihood below at 0 functions{ array[] int sequence(int start, int end) { array[end - start + 1] int seq; for (n in 1:num_elements(seq)) { seq[n] = n + start - 1; } return seq; } int num_between(int lb, int ub, array[] int y){ int n = 0; for(i in 1:num_elements(y)){ if(y[i] >= lb && y[i] <= ub) n = n + 1; } return n; } array[] int find_between(int lb, int ub, array[] int y) { // vector[num_between(lb, ub, y)] result; array[num_between(lb, ub, y)] int result; int n = 1; for (i in 1:num_elements(y)) { if (y[i] >= lb && y[i] <= ub) { result[n] = y[i]; n = n + 1; } } return result; } vector find_between_vec(int lb, int ub, array[] int idx, vector y) { vector[num_between(lb, ub, idx)] result; int n = 1; if(num_elements(idx) != num_elements(y)) reject("illegal input"); for (i in 1:rows(y)) { if (idx[i] >= lb && idx[i] <= ub) { result[n] = y[i]; n = n + 1; } } return result; } real normal_lb_rng(real mu, real sigma, real lb){ real p_lb = normal_cdf(lb | mu, sigma); real u = uniform_rng(p_lb, 1); real y = mu + sigma * inv_Phi(u); return y; } real partial_sum_lpmf(array[] int seq_subj, int start, int end, vector dv_obs, array[] int dv_obs_id, array[] int i_obs, array[] real amt, array[] int cmt, array[] int evid, array[] real time, array[] real rate, array[] real ii, array[] int addl, array[] int ss, array[] int subj_start, array[] int subj_end, vector CL, vector VC, real sigma_p, vector lloq, array[] int bloq, int n_random, int n_subjects, int n_total, array[] real bioav, array[] real tlag, int n_cmt){ real ptarget = 0; int N = end - start + 1; // number of subjects in this slice vector[n_total] dv_ipred; matrix[n_total, n_cmt] x_ipred; int n_obs_slice = num_between(subj_start[start], subj_end[end], i_obs); array[n_obs_slice] int i_obs_slice = find_between(subj_start[start], subj_end[end], i_obs); vector[n_obs_slice] dv_obs_slice = find_between_vec(start, end, dv_obs_id, dv_obs); vector[n_obs_slice] ipred_slice; vector[n_obs_slice] lloq_slice = lloq[i_obs_slice]; array[n_obs_slice] int bloq_slice = bloq[i_obs_slice]; for(n in 1:N){ // loop over subjects in this slice int j = n + start - 1; // j is the ID of the current subject real ke = CL[j]/VC[j]; matrix[n_cmt, n_cmt] K = rep_matrix(0, n_cmt, n_cmt); K[1, 1] = -ke; x_ipred[subj_start[j]:subj_end[j], ] = pmx_solve_linode(time[subj_start[j]:subj_end[j]], amt[subj_start[j]:subj_end[j]], rate[subj_start[j]:subj_end[j]], ii[subj_start[j]:subj_end[j]], evid[subj_start[j]:subj_end[j]], cmt[subj_start[j]:subj_end[j]], addl[subj_start[j]:subj_end[j]], ss[subj_start[j]:subj_end[j]], K, bioav, tlag)'; dv_ipred[subj_start[j]:subj_end[j]] = x_ipred[subj_start[j]:subj_end[j], 1] ./ VC[j]; } ipred_slice = dv_ipred[i_obs_slice]; for(i in 1:n_obs_slice){ real sigma_tmp = ipred_slice[i]*sigma_p; if(bloq_slice[i] == 1){ ptarget += log_diff_exp(normal_lcdf(lloq_slice[i] | ipred_slice[i], sigma_tmp), normal_lcdf(0.0 | ipred_slice[i], sigma_tmp)) - normal_lccdf(0.0 | ipred_slice[i], sigma_tmp); }else{ ptarget += normal_lpdf(dv_obs_slice[i] | ipred_slice[i], sigma_tmp) - normal_lccdf(0.0 | ipred_slice[i], sigma_tmp); } } return ptarget; } } data{ int n_subjects; int n_total; int n_obs; array[n_obs] int i_obs; array[n_total] int ID; array[n_total] real amt; array[n_total] int cmt; array[n_total] int evid; array[n_total] real rate; array[n_total] real ii; array[n_total] int addl; array[n_total] int ss; array[n_total] real time; vector<lower = 0>[n_total] dv; array[n_subjects] int subj_start; array[n_subjects] int subj_end; vector[n_total] lloq; array[n_total] int bloq; real<lower = 0> location_tvcl; // Prior Location parameter for CL real<lower = 0> location_tvvc; // Prior Location parameter for VC real<lower = 0> scale_tvcl; // Prior Scale parameter for CL real<lower = 0> scale_tvvc; // Prior Scale parameter for VC real<lower = 0> scale_omega_cl; // Prior scale parameter for omega_cl real<lower = 0> scale_omega_vc; // Prior scale parameter for omega_vc real<lower = 0> lkj_df_omega; // Prior degrees of freedom for omega cor mat real<lower = 0> scale_sigma_p; // Prior Scale parameter for proportional error int<lower = 0, upper = 1> prior_only; // Want to simulate from the prior? } transformed data{ int grainsize = 1; vector<lower = 0>[n_obs] dv_obs = dv[i_obs]; array[n_obs] int dv_obs_id = ID[i_obs]; vector[n_obs] lloq_obs = lloq[i_obs]; array[n_obs] int bloq_obs = bloq[i_obs]; int n_random = 2; // Number of random effects int n_cmt = 1; // Number of states in the ODEs array[n_random] real scale_omega = {scale_omega_cl, scale_omega_vc}; array[n_subjects] int seq_subj = sequence(1, n_subjects); // reduce_sum over subjects array[n_cmt] real bioav = rep_array(1.0, n_cmt); // Hardcoding, but could be data or a parameter in another situation array[n_cmt] real tlag = rep_array(0.0, n_cmt); } parameters{ real<lower = 0> TVCL; real<lower = 0> TVVC; vector<lower = 0>[n_random] omega; cholesky_factor_corr[n_random] L; real<lower = 0> sigma_p; matrix[n_random, n_subjects] Z; } transformed parameters{ vector[n_subjects] eta_cl; vector[n_subjects] eta_vc; vector[n_subjects] CL; vector[n_subjects] VC; vector[n_subjects] KE; { row_vector[n_random] typical_values = to_row_vector({TVCL, TVVC}); matrix[n_subjects, n_random] eta = diag_pre_multiply(omega, L * Z)'; matrix[n_subjects, n_random] theta = (rep_matrix(typical_values, n_subjects) .* exp(eta)); eta_cl = col(eta, 1); eta_vc = col(eta, 2); CL = col(theta, 1); VC = col(theta, 2); KE = CL ./ VC; } } model{ // Priors TVCL ~ lognormal(log(location_tvcl), scale_tvcl); TVVC ~ lognormal(log(location_tvvc), scale_tvvc); omega ~ normal(0, scale_omega); L ~ lkj_corr_cholesky(lkj_df_omega); sigma_p ~ normal(0, scale_sigma_p); to_vector(Z) ~ std_normal(); // Likelihood if(prior_only == 0){ target += reduce_sum(partial_sum_lupmf, seq_subj, grainsize, dv_obs, dv_obs_id, i_obs, amt, cmt, evid, time, rate, ii, addl, ss, subj_start, subj_end, CL, VC, sigma_p, lloq, bloq, n_random, n_subjects, n_total, bioav, tlag, n_cmt); } } ``` ::: ```{r} #| label: model-with-threading #| cache: true #| echo: false #| output: false model_threaded <- cmdstan_model(here::here("models/stan/iv_1cmt_prop_no_gq.stan"), cpp_options = list(stan_threads = TRUE)) ``` Here is the implementation of this model in R with within-chain parallelization: :::{.scrolling_700} ```{r} #| label: fit-threading #| cache: true #| output: true fit_threaded <- suppressMessages(model_threaded$sample( data = stan_data, seed = 112358, chains = 4, parallel_chains = 4, # threads_per_chain = parallel::detectCores()/4, threads_per_chain = 4, iter_warmup = 500, iter_sampling = 1000, adapt_delta = 0.8, refresh = 0, max_treedepth = 10, init = function() list(TVCL = rlnorm(1, log(0.25), 0.3), TVVC = rlnorm(1, log(3), 0.3), omega = rlnorm(2, log(0.3), 0.3), sigma_p = rlnorm(1, log(0.2), 0.3)))) ``` ::: You can see that there is a small speed-up. If you have more available cores, then there will be a larger speed-up. A guide for choosing the `chains, parallel_chains,` and `threads_per_chain` arguments is to 1. Figure out how many cores you have available (`parallel::detectCores()`) 2. Choose the number of `chains` you want to sample (we recommend 4) 3. If `chains < parallel::detectCores()`, then have `parallel_chains = chains` (almost all modern machines have at least 4 cores) 4. `threads_per_chain` should be anywhere between 1 and $\frac{parallel::detectCores()}{parallel\_chains}$ For example, if your machine has 32 cores, we recommend having 4 `chains`, 4 `parallel_chains`, and 8 `threads_per_chain`. This will make use of all the available cores. Using more `threads_per_chain` than this won't be helpful in reducing execution time, since the available cores are already in use. ### Explanation for Stan Code Implementation of Within-Chain Parallelization ```sequence()```, ```num_between()```, ```find_between()```, and ```find_between_vec()``` are functions to help make sure correct observations are passed along with the subjects in a given slice. - ```sequence(start, end)``` creates a sequence of integers from ```start``` to ```end``` where ```start``` and ```end``` are integers, similarly to doing in R ```seq(start, to, by = 1)```. - ```num_between(lb, ub, y)``` finds the number of elements of the integer array $y$ such that $lb \leq y \leq ub$, similarly to doing in R ```sum(y >= lb & y <= ub)``` or ```sum(dplyr::between(y, lb, ub))```. - ```find_between(lb, ub, y)``` finds the elements of the integer array $y$ such that $lb \leq y \leq ub$, similarly to doing in R ```y[y >= lb & y <= ub]```. - ```find_between_vec(lb, ub, idx, y)``` finds the elements of the vector $y$ whose index is $\geq lb$ and whose index is $\leq ub$, similarly to doing in R ```y[idx >= lb & idx <= ub]``` ```partial_sum_lpmf()``` is the function that is input into ```reduce_sum()```. The output of this function is the log-likelihood for the subjects in a particular slice. For example, if we have 4 threads per chain and 12 subjects, the likelihood calculations might be sent in slices of 3 subjects to each of the 4 threads. Then the output for this function would be the likelihood for the 3 subjects in the slice. The ```reduce_sum()``` function then adds the results from each slice to get the full log-probability.