Multivariate Correlation Model
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
2024-10-09
In “classical” multivariate textbooks and classes, multivariate linear models fall under the names of Multivariate ANOVA (MANOVA) and multivariate regression
These methods rely upon least squares estimation, which:
The classical methods have been subsumed into the modern (likelihood- or Bayes-based) multivariate methods
We will discuss three large classes of multivariate linear modeling methods:
The theory behind each is identical—the main difference is software
Some software does a lot (Mplus software is likely the most complete), but none (as of 2024) does it all
We will start with path analysis (via the lavaan package) as the modeling method is more direct, and then move to linear mixed-models software (via the nlme and lme4 packages)
Bayesian networks will be discussed in the Bayes section of the course and will use entirely different software
Having lots of parameters creates a number of problems
Curse of dimensionality: for multivariate normal data, there is a quadratic increase in the number of parameters as the number of outcomes increases linearly
To be used as an analysis model, however, a model-implied covariance structure must “fit” as well as the saturated/unstructured covariance matrix
\[ S \rightarrow \Sigma \]
Data are simulated based on the results reported in:
Sample of 350 undergraduates (229 women, 121 men)
Make sure you install lavaan package
Dictionary
id hsl cc use msc mas mse perf female
1 1 NA 9 44 55 39 NA 14 1
2 2 3 2 77 70 42 71 12 0
3 3 NA 12 64 52 31 NA NA 1
4 4 6 20 71 65 39 84 19 0
5 5 2 15 48 19 2 60 12 0
6 6 NA 15 61 62 42 87 18 0
Desc() in package DescTools allows you to quick screen data
[1] 350 9
──────────────────────────────────────────────────────────────────────────────
Describe dataMath[, 2:8] (data.frame):
data frame: 350 obs. of 7 variables
145 complete cases (41.4%)
Nr Class ColName NAs Levels
1 int hsl 36 (10.3%)
2 int cc 37 (10.6%)
3 int use 24 (6.9%)
4 int msc 39 (11.1%)
5 int mas 46 (13.1%)
6 int mse 34 (9.7%)
7 int perf 60 (17.1%)
──────────────────────────────────────────────────────────────────────────────
1 - hsl (integer)
length n NAs unique 0s mean meanCI'
350 314 36 8 0 4.92 4.92
89.7% 10.3% 0.0% 4.92
.05 .10 .25 median .75 .90 .95
3.00 3.00 4.00 5.00 6.00 6.00 7.00
range sd vcoef mad IQR skew kurt
7.00 1.32 0.27 1.48 2.00 -0.26 -0.41
value freq perc cumfreq cumperc
1 1 1 0.3% 1 0.3%
2 2 11 3.5% 12 3.8%
3 3 34 10.8% 46 14.6%
4 4 71 22.6% 117 37.3%
5 5 79 25.2% 196 62.4%
6 6 88 28.0% 284 90.4%
7 7 27 8.6% 311 99.0%
8 8 3 1.0% 314 100.0%
' 0%-CI (classic)
──────────────────────────────────────────────────────────────────────────────
2 - cc (integer)
length n NAs unique 0s mean meanCI'
350 313 37 28 21 10.31 10.31
89.4% 10.6% 6.0% 10.31
.05 .10 .25 median .75 .90 .95
0.00 2.00 6.00 10.00 14.00 19.00 20.00
range sd vcoef mad IQR skew kurt
27.00 5.89 0.57 5.93 8.00 0.17 -0.43
lowest : 0 (21), 1 (5), 2 (9), 3 (9), 4 (12)
highest: 23, 24, 25, 26, 27
heap(?): remarkable frequency (8.6%) for the mode(s) (= 10, 13)
' 0%-CI (classic)
──────────────────────────────────────────────────────────────────────────────
3 - use (integer)
length n NAs unique 0s mean meanCI'
350 326 24 71 1 52.50 52.50
93.1% 6.9% 0.3% 52.50
.05 .10 .25 median .75 .90 .95
25.50 31.00 42.25 52.00 64.00 71.50 77.75
range sd vcoef mad IQR skew kurt
100.00 15.81 0.30 16.31 21.75 -0.11 -0.08
lowest : 0, 13, 15, 16, 19 (2)
highest: 84 (2), 86, 92, 95, 100
' 0%-CI (classic)
──────────────────────────────────────────────────────────────────────────────
4 - msc (integer)
length n NAs unique 0s mean meanCI'
350 311 39 76 0 49.79 49.79
88.9% 11.1% 0.0% 49.79
.05 .10 .25 median .75 .90 .95
24.00 29.00 38.00 49.00 61.00 72.00 80.50
range sd vcoef mad IQR skew kurt
98.00 16.96 0.34 16.31 23.00 0.23 -0.09
lowest : 5, 9, 12, 15, 16 (2)
highest: 87 (3), 89, 91, 94, 103
' 0%-CI (classic)
──────────────────────────────────────────────────────────────────────────────
5 - mas (integer)
length n NAs unique 0s mean meanCI'
350 304 46 57 1 31.50 31.50
86.9% 13.1% 0.3% 31.50
.05 .10 .25 median .75 .90 .95
13.00 17.00 24.75 32.00 39.00 45.70 50.00
range sd vcoef mad IQR skew kurt
62.00 11.32 0.36 10.38 14.25 -0.13 -0.17
lowest : 0, 2, 3 (2), 5, 6
highest: 54 (2), 55 (2), 56, 58, 62
' 0%-CI (classic)
──────────────────────────────────────────────────────────────────────────────
6 - mse (integer)
length n NAs unique 0s mean meanCI'
350 316 34 56 0 73.41 73.41
90.3% 9.7% 0.0% 73.41
.05 .10 .25 median .75 .90 .95
54.00 58.00 65.00 73.00 81.00 88.50 92.25
range sd vcoef mad IQR skew kurt
60.00 11.89 0.16 11.86 16.00 0.06 -0.26
lowest : 45 (2), 46, 47, 49 (2), 50 (4)
highest: 98, 100, 101 (2), 103, 105 (2)
' 0%-CI (classic)
──────────────────────────────────────────────────────────────────────────────
7 - perf (integer)
length n NAs unique 0s mean meanCI'
350 290 60 19 0 13.97 13.97
82.9% 17.1% 0.0% 13.97
.05 .10 .25 median .75 .90 .95
9.45 10.00 12.00 14.00 16.00 18.00 19.00
range sd vcoef mad IQR skew kurt
18.00 2.96 0.21 2.97 4.00 0.16 0.14
lowest : 5, 6, 7 (2), 8 (3), 9 (8)
highest: 19 (10), 20 (4), 21 (3), 22 (2), 23
heap(?): remarkable frequency (13.8%) for the mode(s) (= 13)
' 0%-CI (classic)
lavaanThe lavaan package is developed by Dr. Yves Rosseel to provide useRs, researchers and teachers a free open-source, but commercial-quality package for latent variable modeling. You can use lavaan to estimate a large variety of multivariate statistical models, including path analysis, confirmatory factor analysis, structural equation modeling and growth curve models.
lavaan’s default model is a linear (mixed) model that uses ML with the multivariate normal distribution
ML is sometimes called a full-information method (FIML)
The contrast is limited information (not all observations are used; typically summary statistics are used)
\[ \text{PERF}_i = \beta_{0,\color{tomato}{PERF}} + e_{i, \color{tomato}{PERF}} \]
lavaan Syntax: Generallavaan contains three steps:
model01.syntax)model01.fit <- cfa(model01.syntax, data))summary()lavaan package works by taking the typical R model syntax (as from lm()) and putting it into a quoted character variable
lavaan model syntax also includes other commands used to access other parts of the model (really, parts of the MVN distribution)lavaan model syntax in Model building step:
~~ indicates variance or covariance between variables on either side (perf ~~ perf estimates variance)~1 indicates intercept for one variable (perf~1)~ indicates LHS regresses on RHS (perf~use)cfa() function to run the model based on the syntax. Use ?lavOptions to find the interpretation of arguments.
lavaan 0.6-19 ended normally after 8 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 2
Used Total
Number of observations 290 350
Number of missing patterns 1
Model Test User Model:
Standard Scaled
Test Statistic 0.000 0.000
Degrees of freedom 0 0
Parameter Estimates:
Standard errors Sandwich
Information bread Observed
Observed information based on Hessian
Intercepts:
Estimate Std.Err z-value P(>|z|)
perf 13.966 0.174 80.397 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
perf 8.751 0.756 11.581 0.000
\[ \text{PERF}_i \sim N(13.966, 8.751) \]
Perf follows a normal distribution with mean as 13.966, variance as 8.751.Variables: We already know how to use lavaan to estimate a univariate model; let’s move on to model two variables as outcomes simultaneously:
Assumption: We will assume both to be continuous variables (conditionally MVN)
Research Question: Are students’ performance and their perceived usefulness of mathematics significantly related?
Initially, we will only look at an empty model with these two variables:
The multivariate model for perf and use is given by two regression models, which are estimated simultaneously.
\[ \text{PERF}_i = \beta_{0,\text{PERF}}+e_{i,\text{PERF}} \]
\[ \text{USE}_i = \beta_{0,\text{USE}}+e_{i,\text{USE}} \]
As there are two variables, the error terms have a joint distribution that will be multivariate normal
\[ \begin{bmatrix}e_{i,\text{PERF}}\\e_{i,\text{USE}}\end{bmatrix} \sim \mathbf{MVN} (\mathbf{0}=\begin{bmatrix}0\\0\end{bmatrix}, \mathbf{R}=\begin{bmatrix}\sigma^2_{e,\text{PERF}} & \sigma_{e,\text{PERF},\text{USE}}\\ \sigma_{e,\text{PERF},\text{USE}}&\sigma^2_{e,\text{USE}}\end{bmatrix}) \]
Before showing the syntax and the results, we must first describe how the multivariate empty model implies how our data should look
Multivariate model with matrices: \[ \boxed{\begin{bmatrix}\text{PERF}_i\\\text{USE}_i\end{bmatrix} =\begin{bmatrix}1&0\\0&1\end{bmatrix}\begin{bmatrix}\beta_{0,\text{PERF}}\\\beta_{0,\text{USE}}\end{bmatrix} + \begin{bmatrix}e_{i,\text{PERF}}\\e_{i,\text{USE}}\end{bmatrix}} \]
\[ \boxed{\mathbf{Y_i = X_iB + e_i}} \]
\[ \boxed{\begin{bmatrix}\text{PERF}_i\\\text{USE}_i\end{bmatrix} \sim \mathbf{MVN}_2 (\boldsymbol{\mu}_i=\begin{bmatrix}\beta_{0,\text{PERF}}\\\beta_{0,\text{USE}}\end{bmatrix}, \mathbf{V}=\begin{bmatrix}\sigma^2_{e,\text{PERF}} & \sigma_{e,\text{PERF},\text{USE}}\\ \sigma_{e,\text{PERF},\text{USE}}&\sigma^2_{e,\text{USE}}\end{bmatrix})} \]
\[ \boxed{\mathbf{Y}_i \sim MVN_2(\mathbf{X}_i\mathbf{B},\mathbf{V}_i)} \]
Aim: To untangle the simple correlation between perf and use. Note that no predictor is used to explain the variances of the two variables.
Parameters: The number of parameters is 5, including the two variables’ means and three variance-and-covariance components.
This model is said to be saturated: all possible parameters are estimated. It is also called an unstructured covariance matrix.
No other structure for the covariance matrix can fit better (only as well as)
The model for the means are two simple regression models without any predictors
lavaanlavaan 0.6-19 ended normally after 29 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 5
Used Total
Number of observations 348 350
Number of missing patterns 3
Model Test User Model:
Standard Scaled
Test Statistic 0.000 0.000
Degrees of freedom 0 0
Parameter Estimates:
Standard errors Sandwich
Information bread Observed
Observed information based on Hessian
Covariances:
Estimate Std.Err z-value P(>|z|)
perf ~~
use 6.847 2.850 2.403 0.016
Intercepts:
Estimate Std.Err z-value P(>|z|)
perf 13.959 0.174 80.442 0.000
use 52.440 0.872 60.140 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
perf 8.742 0.754 11.596 0.000
use 249.245 19.212 12.973 0.000
Model 1:
Used Total
Number of observations 290 350
Number of missing patterns 1
Model 2:
Used Total
Number of observations 348 350
Number of missing patterns 3
PERF and USECovariance Estimates:
lhs op rhs est se z pvalue ci.lower ci.upper
1 perf ~~ perf 8.74 0.754 11.6 0.000 7.26 10.2
2 use ~~ use 249.25 19.212 13.0 0.000 211.59 286.9
3 perf ~~ use 6.85 2.850 2.4 0.016 1.26 12.4
Standardized Parameters (correlation):
lhs op rhs est.std se z pvalue ci.lower ci.upper
1 perf ~~ perf 1.000 0.00 NA NA 1.000 1.000
2 use ~~ use 1.000 0.00 NA NA 1.000 1.000
3 perf ~~ use 0.147 0.06 2.44 0.015 0.029 0.265
We can verify that: Correlation matrix can be calculated with S (covariance matrix) and variance components.
\[ R = D^{-1}SD \]
[,1] [,2]
[1,] 1.000 0.147
[2,] 0.147 1.000
library(mvtnorm)
x = expand.grid(
perf = seq(0, 25, .1),
use = seq(0, 120, .1)
)
sim_dat <- cbind(
x,
density = dmvnorm(x, mean = c(13.959, 52.440), sigma = S)
)
mod2_density <- ggplot() +
geom_contour_filled(aes(x = perf, y = use, z = density), data = sim_dat) +
geom_point(aes(x = perf, y = use), data = dataMath, colour = "white", alpha = .5) +
labs( x= "PERF", y="USE") +
theme_classic() +
theme(text = element_text(size = 20))
mod2_densityperf and use, which can be compared to the observed data for perf and use.\[ \begin{bmatrix}\text{PERF}_i\\\text{USE}_i\end{bmatrix} \sim \mathbf{N}_2 (\boldsymbol{\mu}_i=\begin{bmatrix}13.959\\52.440\end{bmatrix}, \mathbf{V}_i =\mathbf{R}=\begin{bmatrix}8.742 & 6.847\\ 6.847 & 249.245 \end{bmatrix}) \]