Bayesian Normal Mixture Models with RSTAN

Bayesian methods are cool. R is cool. That’s pretty much the only convincing reason I feel anyone has for using either. While many statisticians are so willing to group themselves under some banner of “Bayesian” or “Frequentist”, along with having hardline views on the correctness of one viewpoint or another, I, don’t have enough energy for that. I view statistical inference in a way probably close to the way an engineer views their work. There are many routes to a correct answer, each valid. That is not to say that debates over the respective “holes” in either theory are not interesting.

So for my first post I present so work that I submitted for my final project in Stat 430 (Applied Multivariate Methods) at the University of Illinois. I started running some analyses in RSTAN (http://mc-stan.org/rstan.html), which is wrapper for the STAN environment for R. It’s super easy to use. Stat 430 was a multivariate methods course, so while digging around in the STAN documentation (https://github.com/stan-dev/stan/releases/download/v2.1.0/stan-reference-2.1.0.pdf), I found that while it did cover mixture models, it did not cover multivariate mixture models. With a pretty quick edit to the code given in the STAN reference manual, I got exactly what I was looking for.

I started out my simulating 30 observations each from 3 multivariate normal distributions with vector means 0, 2, and 4 and CodeCogsEqn (1). The typical set up for mixture models  is as follows,

with   being a vector of response values,  (representing mixture probabilities) and  is a density for y.

I chose the RSTAN modelling language for implementing the Bayesian Mixture Models for several reasons: Firstly, RSTAN utilizes the Rcpp package for R, allowing resource heavy computations (MCMC sampling) to be completed using C++. This speeds up the computation time, as well as avoids many of the pitfalls seen with other MCMC packages in R (R Core Team, 2013) (including non-convergence of results, etc.) (Gelman, Carlin, Stern Rubin, 2003). Secondly, RSTAN is tailored for complex Bayesian Hierarchical models, and allows more complexity in a model than that typically used in the BUGS framework. Lastly, RSTAN has an extensive set of documentation and support, which is helpful in fitting complex models with many random variables. In fact, this project is mostly inspired by an example given in the STAN documentation for implementing a technique known as ”Soft K-Means” (Stan Development Team, 2013).

For the posterior sampling, RSTAN uses the Hamiltonian Monte Carlo Method (MacKay, 2003). Hamiltonian Monte Carlo is a part of the Metropolis family of sampling algorithms, and is more beneficial than simple random Metropolis methods as for many distributions the gradient of the expected value is readily available. This works in effect to reduce the variability of the random walk as the gradient points in the direction of higher probability. Specifically RSTAN uses the No-U-Turn Sampler (Neal, 2011), a form of the Hamilitonian Monte Carlo Method. While not covered in this project, there are several excellent texts that cover sampling techniques extensively such as (Kruschke, 2010), or for a more theoretical treatment (Gelman, Carlin, Stern Rubin, 2003).

Multivariate Normal Clusters on First and Second Principal Component

Multivariate Normal Clusters on First and Second Principal Component

Next, using RSTAN I fit a Bayesian Mixture Model using an empirical prior distribution for the mean of the K normal distributions. Empirical bayesian methods are similar to regular bayesian methods, with the exception that prior distribution hyperparameters are derived from sample statistics. It is shown in (Casella, 1985), for some prior distributions, an empirical bayesian estimate of a parameter θ has a lower MSE than a parametrically derived bayesian estimate, and almost certainly so for bayesian estimates using flat priors. Secondly, I fit a model using so-called ”weakly informative priors” . The models specified above are essentially the same as the model of Figure 2. In which θ is the vector of mixing parameters, μ is the vector of means for the component distributions, α is the prior mean, and β represents the prior variance.

Bayesian Graphical Model

Bayesian Graphical Model

Simulation Study One

The results for the first simulation study are shown in Table 1. The sampling procedure performed over 10000 iterations and over 4 parallel Markov Chains. Across all model parameters Rˆ = 1, which indicates good model convergence (Gelman, Carlin, Stern Rubin, 2003). The three mixing parameters θ1 = .3, θ2 = .4, θ3 = .3 stayed invariant across permutations of their labels in the sampling procedure. This is most likely due to specifying the empirical prior distributions for each μ. Each model mean μi,j represents the mean for the ith variable on the jth dimension. Thus, for the simulated multivariate normals, it is clear that the estimation procedure estimated reasonable values for each of the multi-dimensional mean vectors. The variance parameters also are estimated reasonably, where for each distribution, the variance was specified as the identity matrix (moreso to make the clusters easy to find). Table 1 also presents a measure of dispersion for the posterior distributions of each parameter, including the standard deviations of a parameter across it’s posterior distribution, as well as the .025, .25, .50, .75, .975 quantiles. neff is a crude measure of model convergence, which is roughly equivalent to the number of posterior samples needed to insure model convergence. All of the values are larger than the number of posterior samples, implying that there may be some problems with the efficiency of the model.

Simulation Study Two

The second simulation study sampled from the posterior distribution using the same multivariate normal data as simulation 1 but used a ”weakly informative” prior distribution for the mean of each component distribution, specifically, μ ∼ N(0,10). The results are presented in Table 2. The effect on the recovery of distribution parameters is clear. This model had the same number of simulations as the first, 10000, yet could not reach model convergence for the mixing parameters. Thus, these can be seen to be unreliable estimates. The model did however find convergence for the rest of the model parameters, but only the component distribution with mean 0 had reasonable mean estimates. The other mean estimates are quite far from their actual respective values. Also the variance estimates are roughly double that of their true variance. This indicates that weakly informative priors may only be useful in cases where their is large separation between the clusters, or prior information indicates a large variance in the multivariate data.

R Code

#Simulate Normal Distributions
library(mvtnorm)

### first cluster
mu1<-c(0,0,0,0)
sigma1<-matrix(c(1,0,0,0,

                 0,1,0,0,
                 0,0,1,0,
                 0,0,0,1), ncol=4, nrow=4, byrow=TRUE)

norm1<-rmvnorm(30, mean = mu1, sigma=sigma1)
### second cluster

mu2<-c(2,2,2,2)
sigma2<-matrix(c(1,0,0,0,

9

                 0,1,0,0,
                 0,0,1,0,
                 0,0,0,1
                 ) + rnorm(1,0,.1), ncol=4, nrow=4, byrow=TRUE)

norm2<-rmvnorm(30, mean=mu2, sigma=sigma2)
### third cluster

mu3 <-c(5,3.4,2,0)
sigma3<-matrix(c(5,0,0,0,

                 0,2.5,0,0,
                 0,0,1.25,0,
                 0,0,0,3) + rnorm(1,0,.2), ncol=4, nrow=4, byrow=TRUE)

norm3<-rmvnorm(30, mean=mu3, sigma=sigma3)
norms<-rbind(norm1,norm2,norm3)

N = 90
Dim = 4
y<-array(as.vector(norms), dim=c(N,Dim))

#Fit Mixture Models

mixture_dat <-list(
  N = N,

D = 4, K = 3, y=y )

mixture_model <-’
data {

  int<lower=1> D;
  int<lower=1> K;
  int<lower=1> N;
  vector[D] y[N]; //observations

}

parameters {
  simplex[K] theta;
  vector[D] mu[K];
  real<lower=0,upper=10> sigma[K];  // scales of mixture components

}

model {
  real ps[K];
  //for (k in 1:K) {
  //  mu[k] ~ normal(0,10);
  //}
  mu[1] ~ normal(0,1);
  mu[2] ~ normal(2,1);
  mu[3] ~ normal(4,1);

10

for (n in 1:N){
    for(k in 1:K){

      ps[k] <- log(theta[k])

      + normal_log(y[n],mu[k],sigma[k]);
    }

    increment_log_prob(log_sum_exp(ps));
  }

} ’

fit.1 <- stan(model_code = mixture_model, model_name ="Mixture", data = mixture_dat)
fit.2<-stan(fit=fit.1, model_name="Mixture 2", data= mixture_dat, iter=10000)

mixture_model_prior.2 <-’
data {

  int<lower=1> D;
  int<lower=1> K;
  int<lower=1> N;
  vector[D] y[N]; //observations

}

parameters {
  simplex[K] theta;
  vector[D] mu[K];
  real<lower=0,upper=10> sigma[K];  // scales of mixture components

}

model {
  real ps[K];
  for (k in 1:K) {

    mu[k] ~ normal(0,10);
  }

  for (n in 1:N){
  for(k in 1:K){

    ps[k] <- log(theta[k])
    + normal_log(y[n],mu[k],sigma[k]);

}

    increment_log_prob(log_sum_exp(ps));
  }

} ’

fit.1 <- stan(model_code = mixture_model_prior.2,
              model_name ="Mixture", data = mixture_dat, iter=1000, chains =4)

fit.2 <- stan(fit = fit.1, model_name="Mixture.2", data=mixture_dat, iter=10000, chains =4)

References

Casella, Geroge. (1985) An Introduction to Empirical Bayes Data Analysis. The American Statistician,Vol. 39, No. 2 (May, 1985), pp. 83-87

Gelman, A., Carlin, J. B., Stern, H. S., Rubin, D. B. (2003). Bayesian data analysis. (2nd ed., pp. 285-296). Boca Raton, FL: Chapman Hall / CRC Press.

Hastie, T., Tibshirani, R., Friedman, J. (2009). Elements of statistical learning. (2nd ed., pp. 214-215). Springer.

Kruschke, John K. Doing Bayesian Data Analysis: A Tutorial with R and BUGS (10 November 2010)

MacKay, D. (2003). Information theory, inference, and learning algoritms. (1st ed., pp. 389-393). Cam- bridge, UK: Cambridge University Press. Retrieved from http://www.inference.phy.cam.ac.uk/mackay/itila/

Neal, R. (2011). Handbook of markov chain monte carlo. (1st ed., pp. 1-4). Chapman Hall / CRC Press.

Nobile, A. (2005). Bayesian finite mixtures: a note on prior specification and posterior computation . Unpublished manuscript, Department of Statistics, University of Glasgow, Glasgow, Scotland.

Wagstaff, K., Cardie, C., Rogers, S. (2001). Constrained k-means clustering with background knowledge. Manuscript submitted for publication, Department of Computer Science, Cornell University, Ithaca, NY, .

R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Sta- tistical Computing, Vienna, Austria. URL http://www.R-project.org/.

Shalizi, C. R. (2012). Advanced data analysis from an elementary point of view. (1st ed., pp. 390-395). Self. DOI: http://www.stat.cmu.edu/ cshalizi/uADA/12/lectures/ch20.pdf

The Stan Development Team (2013). Stan Modeling Language User’s Guide and Reference Manual.

Wasserman, L. ( 2012, August 4). Mixture models: The twilight zone of statistics. Retrieved from http://normaldeviate.wordpress.com/2012/08/04/mixture-models-the-twilight-zone-of-statistics/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s