Handling Censored Data

1 Data Below the Limit of Quantification

Oftentimes, we have observations that are below the limit of quantification (BLOQ), meaning that the assay has only been validated down to a certain value, (the lower limit of quantification, LLOQ). A paper by Stuart Beal goes through 7 methods of handling this BLOQ data in NONMEM. While in my experience M1 (dropping any BLOQ values completely) and M5 (replace BLOQ values with \(LLOQ/2\)) are the most common in the pharmacometrics world, M3 (treat the BLOQ values as left-censored data) and M4 (treat the BLOQ values as left-censored data and truncated below at 0) are more technically correct and tend to produce better results.

1.1 Treat the BLOQ values as Left-Censored Data (M3)

Instead of tossing out the BLOQ data (M1) or assigning them some arbitrary value (M5-M7), we should keep them in the data set and treat them as left-censored data. This means that the likelihood contribution for observation \(c_{ij}\) is calculated differently for observed values than for BLOQ values:

\[\begin{align} \mbox{observed data} &- f\left(c_{ij} \, | \, \theta_i, \sigma, t_{ij} \right) \notag \\ \mbox{BLOQ data} &- F\left(LLOQ \, | \, \theta_i, \sigma, t_{ij} \right) \notag \\ \end{align}\]

where \(f\left(c_{ij} \, | \, \theta_i, \sigma, t_{ij} \right)\) is the density (pdf) and \(F\left(LLOQ \, | \, \theta_i, \sigma, t_{ij} \right) = P\left(c_{ij} \leq LLOQ\, | \, \theta_i, \sigma, t_{ij} \right)\) is the cumulative distribution function (cdf).

1.1.1 Stan Code Example

This example shows the code within a function that implements within-chain parallelization. The relevant lines here for calculating the likelihood are lines 69-73:

real partial_sum_lpmf(array[] int seq_subj, int start, int end,
                      vector dv_obs, array[] int dv_obs_id, array[] int i_obs,
                      array[] real amt, array[] int cmt, array[] int evid, 
                      array[] real time, array[] real rate, array[] real ii, 
                      array[] int addl, array[] int ss,
                      array[] int subj_start, array[] int subj_end, 
                      vector CL, vector VC, vector Q, vector VP, 
                      real sigma_sq_p, real sigma_sq_a, real sigma_p_a, 
                      vector lloq, array[] int bloq,
                      int n_random, int n_subjects, int n_total,
                      array[] real bioav, array[] real tlag, int n_cmt){
                           
  real ptarget = 0;
                              
  int N = end - start + 1;    // number of subjects in this slice  
  vector[n_total] dv_ipred;   
  matrix[n_total, n_cmt] x_ipred;
  
  int n_obs_slice = num_between(subj_start[start], subj_end[end], i_obs);
  array[n_obs_slice] int i_obs_slice = find_between(subj_start[start], 
                                                    subj_end[end], i_obs);
                                              
  vector[n_obs_slice] dv_obs_slice = find_between_vec(start, end, 
                                                      dv_obs_id, dv_obs);
  
  vector[n_obs_slice] ipred_slice;
    
  vector[n_obs_slice] lloq_slice = lloq[i_obs_slice];
  array[n_obs_slice] int bloq_slice = bloq[i_obs_slice];
    
    
  for(n in 1:N){            // loop over subjects in this slice
    
    int j = n + start - 1; // j is the ID of the current subject
        
    real ke = CL[j]/VC[j];
    real k_cp = Q[j]/VC[j];
    real k_pc = Q[j]/VP[j];
        
    matrix[n_cmt, n_cmt] K = rep_matrix(0, n_cmt, n_cmt);
    K[1, 1] = -(ke + k_cp);
    K[1, 2] = k_pc;
    K[2, 1] = k_cp;
    K[2, 2] = -k_pc;
      
    x_ipred[subj_start[j]:subj_end[j], ] =
      pmx_solve_linode(time[subj_start[j]:subj_end[j]],
                       amt[subj_start[j]:subj_end[j]],
                       rate[subj_start[j]:subj_end[j]],
                       ii[subj_start[j]:subj_end[j]],
                       evid[subj_start[j]:subj_end[j]],
                       cmt[subj_start[j]:subj_end[j]],
                       addl[subj_start[j]:subj_end[j]],
                       ss[subj_start[j]:subj_end[j]],
                       K, bioav, tlag)';
                      
    dv_ipred[subj_start[j]:subj_end[j]] = 
      x_ipred[subj_start[j]:subj_end[j], 1] ./ VC[j];
    
  }
  
  ipred_slice = dv_ipred[i_obs_slice];
    
  for(i in 1:n_obs_slice){
    real ipred_tmp = ipred_slice[i];
    real sigma_tmp = sqrt(square(ipred_tmp) * sigma_sq_p + sigma_sq_a + 
                          2*ipred_tmp*sigma_p_a);
      
    if(bloq_slice[i] == 1){
        ptarget += normal_lcdf(lloq_slice[i] | ipred_slice[i], sigma_tmp);
      }else{
        ptarget += normal_lpdf(dv_obs_slice[i] | ipred_slice[i], sigma_tmp);
    }
      
  }                                         
                              
  return ptarget;
                           
}

These lines implement the math (on the log scale, since Stan calculates the log-posterior).

\[\begin{align} \mathtt{normal\_lcdf(x)} &= log(F(x)), \notag \\ \mathtt{normal\_lpdf(x)} &= log(f(x)) \notag \\ \end{align}\]

where \(f(\cdot)\) and \(F(\cdot)\) are the normal density and cumulative distribution functions, respectively1.

1.2 Treat the BLOQ values as Left-Censored Data and Truncated Below at 0 (M4)

We know that drug concentrations cannot be \(< 0\), but the normal distribution has support over (\(-\infty, \, \infty\))2, so we will assume a normal distribution truncated below at 0. This will have the effect of limiting the support of our assumed distribution to \((0, \, \infty)\). Since we’re assuming a truncated distribution, we need to adjust the likelihood contributions of our data3: \[\begin{align} \mbox{observed data} &- \frac{f\left(c_{ij} \, | \, \theta_i, \sigma, t_{ij} \right)}{1 - F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right)} \\ \mbox{BLOQ data} &- \frac{F\left(LLOQ \, | \, \theta_i, \sigma, t_{ij} \right) - F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right)}{1 - F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right)} \\ \end{align}\]

1.2.1 Stan Code Example

This example shows the code within a function that implements within-chain parallelization. The relevant lines here for calculating the likelihood are lines 69-77:

real partial_sum_lpmf(array[] int seq_subj, int start, int end,
                      vector dv_obs, array[] int dv_obs_id, array[] int i_obs,
                      array[] real amt, array[] int cmt, array[] int evid, 
                      array[] real time, array[] real rate, array[] real ii, 
                      array[] int addl, array[] int ss,
                      array[] int subj_start, array[] int subj_end, 
                      vector CL, vector VC, vector Q, vector VP, 
                      real sigma_sq_p, real sigma_sq_a, real sigma_p_a, 
                      vector lloq, array[] int bloq,
                      int n_random, int n_subjects, int n_total,
                      array[] real bioav, array[] real tlag, int n_cmt){
                           
  real ptarget = 0;
                              
  int N = end - start + 1;    // number of subjects in this slice  
  vector[n_total] dv_ipred;   
  matrix[n_total, n_cmt] x_ipred;
  
  int n_obs_slice = num_between(subj_start[start], subj_end[end], i_obs);
  array[n_obs_slice] int i_obs_slice = find_between(subj_start[start], 
                                                    subj_end[end], i_obs);
                                              
  vector[n_obs_slice] dv_obs_slice = find_between_vec(start, end, 
                                                      dv_obs_id, dv_obs);
  
  vector[n_obs_slice] ipred_slice;
    
  vector[n_obs_slice] lloq_slice = lloq[i_obs_slice];
  array[n_obs_slice] int bloq_slice = bloq[i_obs_slice];
    
    
  for(n in 1:N){            // loop over subjects in this slice
    
    int j = n + start - 1; // j is the ID of the current subject
        
    real ke = CL[j]/VC[j];
    real k_cp = Q[j]/VC[j];
    real k_pc = Q[j]/VP[j];
        
    matrix[n_cmt, n_cmt] K = rep_matrix(0, n_cmt, n_cmt);
    K[1, 1] = -(ke + k_cp);
    K[1, 2] = k_pc;
    K[2, 1] = k_cp;
    K[2, 2] = -k_pc;
      
    x_ipred[subj_start[j]:subj_end[j], ] =
      pmx_solve_linode(time[subj_start[j]:subj_end[j]],
                       amt[subj_start[j]:subj_end[j]],
                       rate[subj_start[j]:subj_end[j]],
                       ii[subj_start[j]:subj_end[j]],
                       evid[subj_start[j]:subj_end[j]],
                       cmt[subj_start[j]:subj_end[j]],
                       addl[subj_start[j]:subj_end[j]],
                       ss[subj_start[j]:subj_end[j]],
                       K, bioav, tlag)';
                      
    dv_ipred[subj_start[j]:subj_end[j]] = 
      x_ipred[subj_start[j]:subj_end[j], 1] ./ VC[j];
    
  }
  
  ipred_slice = dv_ipred[i_obs_slice];
    
  for(i in 1:n_obs_slice){
    real ipred_tmp = ipred_slice[i];
    real sigma_tmp = sqrt(square(ipred_tmp) * sigma_sq_p + sigma_sq_a + 
                          2*ipred_tmp*sigma_p_a);
      
    if(bloq_slice[i] == 1){
      ptarget += log_diff_exp(normal_lcdf(lloq_slice[i] | ipred_tmp, 
                                                          sigma_tmp),
                              normal_lcdf(0.0 | ipred_tmp, sigma_tmp)) -
                 normal_lccdf(0.0 | ipred_tmp, sigma_tmp); 
    }else{
      ptarget += normal_lpdf(dv_obs_slice[i] | ipred_tmp, sigma_tmp) -
                 normal_lccdf(0.0 | ipred_tmp, sigma_tmp);
    }
  }                                         
                              
  return ptarget;
                           
}

These lines implement the math (on the log scale, since Stan calculates the log-posterior).

\[\begin{align} \mathtt{log\_diff\_exp(normal\_lcdf(lloq), normal\_lcdf(0))} &= log(F(lloq) - F(0)), \notag \\ \mathtt{normal\_lcdf(x)} &= log(F(x)), \notag \\ \mathtt{normal\_lccdf(x)} &= log(1 - F(x)), \notag \\ \mathtt{normal\_lpdf(x)} &= log(f(x)) \notag \\ \end{align}\]

where \(f(\cdot)\) and \(F(\cdot)\) are the normal density and cumulative distribution functions, respectively4.

Footnotes

  1. For more information, see the Stan documentation for normal_lcdf(), and normal_lpdf().↩︎

  2. A model that assumes log-normal error has support over \((0, \, \infty)\), so truncation is not an issue. In that case, M3 and M4 are equivalent. Mathematically, you can see this by noting that \(F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right) = 0\).↩︎

  3. For observed data with this truncated distribution, we need to “correct” the density so it integrates to 1. Division by \(1 - F(\cdot \, | \, \cdot)\) has this effect. For the censored data, the numerator is similar to the M3 method, but we must also account for the fact that it must be \(>0\), hence \(P(0 \leq c_{ij} \leq LLOQ) = F(LLOQ) - F(0)\). The denominator is corrected in the same manner as for the observed data.↩︎

  4. For more information, see the Stan documentation for log_diff_exp(), normal_lcdf(), normal_lccdf(), and normal_lpdf().↩︎