Lecture 04: Distribution and Estimation

Author

Affiliation

Jihong Zhang*, Ph.D

Educational Statistics and Research Methods (ESRM) Program*

University of Arkansas

Published

September 16, 2024

0.1 Today’s Class

Review Homework 1
The building blocks: The basics of mathematical statistics:
- Random variables: Definition & Types
- Univariate distribution
  - General terminology (e.g., sufficient statistics)
  - Univariate normal (aka Gaussian)
  - Other widely used univariate distributions
- Types of distributions: Marginal | Conditional | Joint
- Expected values: means, variances, and the algebra of expectations
- Linear combinations of random variables
The finished product: How the GLM fits within statistics
- The GLM with the normal distribution
- The statistical assumptions of the GLM
- How to assess these assumptions

1 ESRM 64503: Homework 1

1.1 Question 2

Copy and paste your R syntax and R output that calculates the group Senior-Old’s standard error of group mean (use the data and model we’ve used in class).
- Aim: Test whether the group mean of senior-old significantly higher than the baseline (here, 0 score)

\mathbf{Test = \beta_0 +\beta_1Senior + \beta_2New +\beta_3Senior*New}

The group mean of Senior-Old is \beta_0 + \beta_1.
Based on algebra for variance, we know hat

Var(\beta_0 + \beta_1) = Var(\beta_0)+Var(\beta_1) +2*Cov(\beta_0, \beta_1)

vcor function provides the variances and covariances for \beta_0, \beta_1, \beta_2, and \beta_3

# R syntax
library(ESRM64503)
library(tidyverse)
model1 <- lm(Test ~ Senior + New + Senior * New, data = dataTestExperiment)
data("dataTestExperiment")
Var_beta_0 <- vcov(model1)[1,1]
Var_beta_1 <- vcov(model1)[2,2]
Cov_beta_0_1 <- vcov(model1)[1,2]
Var_Test_SeniorOld <- Var_beta_0 + Var_beta_1 + 2*Cov_beta_0_1
(SE_Test_SeniorOld <- sqrt(Var_Test_SeniorOld))

[1] 0.5364078

1.2 Question 3

Copy and paste your R syntax and R output that calculates the standard error of conditional main effect of New (New vs. Old) when Senior = 1.
- Aim: Test whether the conditional main effect of New when Senior significantly different from 0
  - In other words, when individuals are senior, whether new or old instruction method has significantly differences in their test scores

\beta_{2|Senior =1} = \beta_2 + \beta_3 * 1

Based on algebra for variance, we know that

Var(\beta_2 + \beta_3) = Var(\beta_2)+Var(\beta_3) +2*Cov(\beta_2, \beta_3)

vcor function provides the variances and covariances for \beta_0, \beta_1, \beta_2, and \beta_3

# R syntax
Var_beta_2 <- vcov(model1)[3,3]
Var_beta_3 <- vcov(model1)[4,4]
Cov_beta_2_3 <- vcov(model1)[3,4]
Var_betaofNew_Senior1 <- Var_beta_2 + Var_beta_3 + 2*Cov_beta_2_3
(SE_betaofNew_Senior1 <- sqrt(Var_betaofNew_Senior1))

[1] 0.7585952

2 Unit 1: Random Variables & Statistical Distribution

2.1 Definition of Random Variables

Random: situations in which the certainty of the outcome is unknown and is at least in part due to chance
Variable: a value that may change give the scope of a given problem or set of operations
Random Variable: a variable whose outcome depends on chance (possible values might represent the possible outcomes of a yet-to-be performed experiment)

Today, we will denote a random variable with a lower-cased: x

Question: which one in the following options is random variable:

any company’s revenue in 2024
one specific company’s monthly revenue in 2024
companies whose revenue over than $30 billions

My answer: only (a)

2.2 Types of Random Variables

Continuous
- Examples of continuous random variables:
  - x represent the height of a person, draw at random
  - Y (the outcome/DV in a GLM)
  - Some variables like exam score or motivation scores are not “true” continuous variables, but it is convenient to consider them as “continuous”
Discrete (also called categorical, generally)
- Example of discrete random variables:
  - x represents the gender of a person, drawn at random
  - Y (outcomes like yes/no; pass/not pass; master / not master a skill; survive / die)
Mixture of Continuous and Discrete:
- Example of mixture: \begin{equation} x = \begin{cases} RT & \text{between 0 and 45 seconds} \\ 0 & \text{otherwise} \end{cases} \end{equation}

2.3 Key Features of Random Variable

Random variables each are described by a probability density / mass function (PDF) – f(x)
- PDF indicates relative frequency of occurrence
- A PDF is a math function that gives rough picture of the distribution from which a random variable is draw
The type of random variable dictates the name and nature of these functions:
- Continuous random variables:
  - f(x) is called a probability density function
  - Area under curve must equal to 1 (found by calculus integration)
  - Height of curve (the function value f(x)):
    - Can be any positive number
    - Reflects relative likelihood of an observation occurring
- Discrete random variables:
  - f(x) is called a probability mass function
  - Sum across all values must equal 1

Note

Both max and min values of temperature can be considered as continuous random variables.

Question 1: what are the probabilities of integrating all values of max and min temperatures ({1, 1} or {0.5, 0.5})
Answer: it is {1, 1} for max and min temperatures. Because they are two separated random variables.

⌘+C

library(tidyverse)
temp <- read.csv("data/temp_fayetteville.csv")
temp$value_F <- (temp$value / 10 * 1.8) + 32
temp |> 
  ggplot() +
  geom_density(aes(x = value_F, fill = datatype), col = "white", alpha = .8) +
  labs(x = "Max/Min Temperature (F) at Fayetteville, AR (Sep-2023)", 
       caption = "Source: National Oceanic and Atmospheric Administration (https://www.noaa.gov/)") +
  scale_x_continuous(breaks = seq(min(temp$value_F), max(temp$value_F), by = 5)) +
  scale_fill_manual(values = c("tomato", "turquoise")) +
  theme_classic(base_size = 13)

2.4 Other key Terms

The sample space is the set of all values that a random variable x can take:
- Example 1: The sample space for a random variable x from a normal distribution x \sim N(\mu_x, \sigma^2_x) is (-\infty, +\infty).
- Example 2: The sample space for a random variable x representing the outcome of a coin flip is {H, T}
- Example 3: The sample space for a random variable x representing the outcome of a roll of a die is {1, 2, 3, 4, 5, 6}
When using generalized models, the trick is to pick a distribution with a sample space that matches the range of values obtainable by data
- Logistic regression - Match Bernoulli distribution (Example 2)
- Poisson regression - Match Poisson distribution (Example 3)

2.5 Uses of Distributions in Data Analysis

Statistical models make distributional assumptions on various parameters and / or parts of data
These assumptions govern:
- How models are estimated
- How inferences are made
- How missing data may be imputed
If data do not follow an assumed distribution, inferences may be inaccurate
- Sometimes a very big problem, other times not so much
Therefore, it can be helpful to check distributional assumptions prior to running statistical analysis

2.6 Continuous Univariate distributions

To demonstrate how continuous distributions work and look, we will discuss three:
- Uniform distribution
- Normal distribution
- Chi-square distribution
Each are described a set of parameters, which we will later see are what give us our inferences when we analysis data
What we then do is put constraints on those parameters based on hypothesized effects in data

2.7 Uniform distribution

The uniform distribution is how to help set up how continuous distributions work
- Typically, used for simulation studies that parameters are randomly generated
For a continuous random variable x that ranges from (a, b), the uniform probability density function is:

f(x) = \frac{1}{b-a}

The uniform distribution has two parameters

x = seq(0, 3, .1)
y = dunif(x, min = 0, max = 3)
ggplot() +
  geom_point(aes(x = x, y = y)) +
  geom_path(aes(x = x, y = y)) +
  theme_bw()

#| standalone: true
#| viewerHeight: 800
library(shiny)
library(bslib)
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(1234)
# Define UI for application that draws a histogram
ui <- fluidPage(
  
  # Application title
  titlePanel("Uniform distribution"),
  
  # Sidebar with a slider input for number of bins 
  sidebarLayout(
    sidebarPanel(
      sliderInput("a",
                  "Lower Bound (a):",
                  min = 1,
                  max = 20,
                  value = 1,
                  animate = animationOptions(interval = 5000, loop = TRUE)),
      uiOutput("b_slider")
    ),
    
    # Show a plot of the generated distribution
    mainPanel(
      plotOutput("distPlot")
    )
  )
)

# Define server logic required to draw a histogram
x <- seq(0, 40, .02)
server <- function(input, output) {
  observeEvent(input$a, {
    output$b_slider <<- renderUI({
      sliderInput("b",
                  "Upper Bound (b):",
                  min = as.numeric(input$a),
                  max = 40,
                  value = as.numeric(input$a) + 1)
    })
    # browser()
    y <<- reactive({dunif(x, min = as.numeric(input$a), max = as.numeric(input$b))})
  })
  # browser()
  observe({
    output$distPlot <- renderPlot({
      # generate bins based on input$bins from ui.R
      ggplot() +
        aes(x = x, y = y())+
        geom_point() +
        geom_path() +
        labs(x = "x", y = "probability") +
        theme_bw() +
        theme(text = element_text(size = 20))
    })
  })
}

# Run the application 
shinyApp(ui = ui, server = server)

2.8 More on the Uniform Distribution

To demonstrate how PDFs work, we will try a few values:

conditions <- tribble(
   ~x, ~a, ~b,
   .5,  0,  1,
  .75,  0,  1,
   15,  0, 20,
   15, 10, 20
) |> 
  mutate(y = dunif(x, min = a, max = b))
conditions

# A tibble: 4 × 4
      x     a     b     y
  <dbl> <dbl> <dbl> <dbl>
1  0.5      0     1  1   
2  0.75     0     1  1   
3 15        0    20  0.05
4 15       10    20  0.1

The uniform PDF has the feature that all values of x are equally likely across the sample space of the distribution
- Therefore, you do not see x in the PDF f(x)
The mean of the uniform distribution is \frac{1}{2}(a+b)
The variance of the uniform distribution is \frac{1}{12}(b-a)^2

2.9 Univariate Normal Distribution

For a continuous random variable x (ranging from -\infty to \infty), the univariate normal distribution function is:

f(x) = \frac1{\sqrt{2\pi\sigma^2_x}}\exp(-\frac{(x-\mu_x)^2}{2\sigma^2_x})

The shape of the distribution is governed by two parameters:
- The mean \mu_x
- The variance \sigma^2_x
- These parameters are called sufficient statistics (they contain all the information about the distribution)
The skewness (lean) and kurtosis (peakedness) are fixed
Standard notation for normal distributions is x\sim N(\mu_x, \sigma^2_x)
- Read as: “x follows a normal distribution with a mean \mu_x and a variance \sigma^2_x”
Linear combinations of random variables following normal distributions result in a random variable that is normally distributed

2.10 Univariate Normal Distribution in R: pnorm

Density (dnorm), distribution function (pnorm), quantile function (qnorm) and random generation (rnorm) for the normal distribution with mean equal to mean and standard deviation equal to sd.

Z = seq(-5, 5, .1) # Z-score
ggplot() +
  aes(x = Z, y = pnorm(q = Z, lower.tail = TRUE)) +
  geom_point() +
  geom_path() +
  labs(x = "Z-score", y = "Cumulative probability",
       title = "`pnorm()` gives the cumulative probability function P(X < T)")

2.11 Univariate Normal Distribution in R: dnorm

x <- seq(-5, 5, .1)

params <- list(
  y_set1 = c(mu =  0, sigma2 = 0.2),
  y_set2 = c(mu =  0, sigma2 = 1.0),
  y_set3 = c(mu =  0, sigma2 = 5.0),
  y_set4 = c(mu = -2, sigma2 = 0.5)
)

y <- sapply(params, function(param) dnorm(x, mean = param['mu'], sd = param['sigma2']))

dt <- cbind(x, y) |> 
  as.data.frame() |> 
  pivot_longer(starts_with("y_"))

ggplot(dt) +
  geom_path(aes(x = x, y = value, color = name, group = name), linewidth = 1.3) +
  scale_color_manual(values = 1:4,
                     name = "",
                     labels = c('y_set1' = expression(mu*"=0, "*sigma^2*"=.02"),
                                'y_set2' = expression(mu*"=0, "*sigma^2*"=1.0"),
                                'y_set3' = expression(mu*"=0, "*sigma^2*"=5.0"),
                                'y_set4' = expression(mu*"=-2, "*sigma^2*"=0.5"))
                     ) +
  theme_bw() +
  theme(legend.position = "top", text = element_text(size = 13))

2.12 Chi-Square Distribution

Another frequently used univariate distribution is the Chi-square distribution
- Sampling distribution of the variance follows a chi-square distribution
- Likelihood ratios follow a chi-square distribution
For a continuous random variable x (ranging from 0 to \infty), the chi-square distribution is given by:

f(x) =\frac1{2^{\frac{\upsilon}{2}} \Gamma(\frac{\upsilon}{2})} x^{\frac{\upsilon}{2}-1} \exp(-\frac{x}2)
\Gamma(\cdot) is called the gamma function
The chi-square distribution is govern by one parameter: \upsilon (the degrees of freedom)
- The mean is equal to \upsilon; the variance is equal to 2\upsilon

x <- seq(0.01, 15, .01)
df <- c(1, 2, 3, 5, 10)
dt2 <- as.data.frame(sapply(df, \(df) dchisq(x, df = df)))
dt2_with_x <- cbind(x = x, dt2)
dt2_with_x |> 
  pivot_longer(starts_with("V")) |> 
  ggplot() +
  geom_path(aes(x = x, y = value, color = name), linewidth = 1.2) +
  scale_y_continuous(limits = c(0, 1)) +
  scale_color_discrete(name = "", labels = paste("df =", df)) +
  labs(y = "f(x)") +
  theme_bw() +
  theme(legend.position = "top", text = element_text(size = 13))

#| standalone: true
#| viewerHeight: 800
library(shiny)
library(bslib)
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(1234)
# Define UI for application that draws a histogram
ui <- fluidPage(

    # Application title
    titlePanel("Chi-square distribution"),

    # Sidebar with a slider input for number of bins 
    sidebarLayout(
        sidebarPanel(
            sliderInput("Df",
                        "Degree of freedom:",
                        min = 1,
                        max = 20,
                        value = 1,
                        animate = animationOptions(interval = 5000, loop = TRUE)),
           verbatimTextOutput(outputId = "Df_display")
        ),

        # Show a plot of the generated distribution
        mainPanel(
           plotOutput("distPlot")
        )
    )
)

# Define server logic required to draw a histogram
server <- function(input, output) {
    x <- seq(0.01, 15, .01)
    df <- reactive({input$Df}) 
    
    output$Df_display <- renderText({
        paste0("DF = ", df())
    })
    output$distPlot <- renderPlot({
        # generate bins based on input$bins from ui.R
        
        
        dt2 <- as.data.frame(sapply(df(), \(df) dchisq(x, df = df)))
        dt2_with_x <- cbind(x = x, dt2)
        dt2_with_x |> 
            pivot_longer(starts_with("V")) |> 
            ggplot() +
            geom_path(aes(x = x, y = value), linewidth = 1.2) +
            scale_y_continuous(limits = c(0, 1)) +
            labs(y = "f(x)") +
            theme_bw() +
            theme(legend.position = "top", text = element_text(size = 13))
    })
}

# Run the application 
shinyApp(ui = ui, server = server)

3 Marginal, Joint, And Conditional Distribution

3.1 Moving from One to Multiple Random Variables

When more than one random variable is present, there are several different types of statistical distributions:
We will first consider two discrete random variables：
- x is the outcome of the flip of a penny {H_p, T_p}
  - f(x=H_p) = .5; f(x =T_p) = .5
- z is the outcome of the flip of a dime {H_d, T_d}
  - f(z=H_p) = .5; f(z =T_p) = .5
We will consider the following distributions:
- Marginal distribution
  - The distribution of one variable only (either f(x) or f(z))
- Joint distribution
  - f(x, z): the distribution of both variables (both x and z)
- Conditional distribution
  - The distribution of one variable, conditional on values of the other:
    - f(x|z): the distribution of x given z
    - f(z|x): the distribution of z given x

3.2 Marginal Distribution

Marginal distributions are what we have worked with exclusively up to this point: they represent the distribution by itself
- Continuous univariate distributions
- Categorical distributions
  - The flip of a penny
  - The flip of a dime

3.3 Joint Distribution

Joint distributions describe the distribution of more than one variable, simultaneously
- Representations of multiple variables collected
Commonly, the joint distribution function is denoted with all random variables separated by commas
- In our example, f(x,z) is the joint distribution of the outcome of flipping both a penny and a dime
  - As both are discrete, the joint distribution has four possible values:
    
    (1) f(x = H_p,z=H_d) (2) f(x = H_p,z=T_d) (3) f(x = T_p,z=H_d) (4) f(x = T_p,z=T_d)
Joint distributions are multivariate distributions
We will use joint distributions to introduce two topics
- Joint distributions of independent variables
- Joint distributions – used in maximum likelihood estimation

3.4 Joint Distributions of Independent Random Variables

Random variables are said to be independent if the occurrence of one event makes it neither more nor less probable of another event
- For joint distributions, this means: f(x,z)=f(x)f(z)

In our example, flipping a penny and flipping a dime are independent – so we can complete the following table of their joint distribution:

	z = H_d	z = T_d	Marginal (Penny)
x = H_p	\color{tomato}{f(x=H_p, z=H_d)}	\color{tomato}{f(x=H_p, z=T_d)}	\color{turquoise}{f(z=H_p)}
x = T_p	\color{tomato}{f(x= T_p, z=H_d)}	\color{tomato}{f(x=T_p, z=T_d)}	\color{turquoise}{f(z=T_d)}
Marginal (Dime)	\color{turquoise}{f(z=H_d)}	\color{turquoise}{f(z=T_d)}

3.5 Joint Distributions of Independent Random Variables

Because the coin flips are independent, this because:

	z = H_d	z = T_d	Marginal (Penny)
x = H_p	\color{tomato}{f(x=H_p)f( z=H_d)}	\color{tomato}{f(x=H_p)f( z=T_d)}	\color{turquoise}{f(z=H_p)}
x = T_p	\color{tomato}{f(x= T_p)f( z=H_d)}	\color{tomato}{f(x=T_p)f( z=T_d)}	\color{turquoise}{f(z=T_d)}
Marginal (Dime)	\color{turquoise}{f(z=H_d)}	\color{turquoise}{f(z=T_d)}

Then, with numbers:

	z = H_d	z = T_d	Marginal (Penny)
x = H_p	\color{tomato}{.25}	\color{tomato}{.25}	\color{turquoise}{.5}
x = T_p	\color{tomato}{.25}	\color{tomato}{.25}	\color{turquoise}{.5}
Marginal (Dime)	\color{turquoise}{.5}	\color{turquoise}{.5}

3.6 Marginalizing Across a Joint Distribution

If you had a joint distribution, \color{orchid}{f(x, z)}, but wanted the marginal distribution of either variable (f(x) or f(z)) you would have to marginalize across one dimension of the joint distribution.
For categorical random variables, marginalize = sum across every value of z

f(x) = \sum_zf(x, z)

For example, f(x = H_p) = f(x = H_p, z=H_d) +f(x = H_p, z=T_d)=.5
For continuous random variables, marginalize = integrate across z
- The integral:
  
  f(x) = \int_zf(x,z)dz

3.7 Conditional Distributions

For two random variables x and z, a conditional distribution is written as: f(z|x)
- The distribution of z given x
The conditional distribution is equal to the joint distribution divided by the marginal distribution of the conditioning random variable

f(z|x) = \frac{f(z,x)}{f(x)}
Conditional distributions are found everywhere in statistics
- The general linear model uses the conditional distribution variable
  
  Y \sim N(\beta_0+\beta_1X, \sigma^2_e)

3.8 Example: Conditional Distribution

For two discrete random variables with {0, 1} values, the conditional distribution can be shown in a contingency table:

	z = H_d	z = T_d	Marginal (Penny)
x = H_p	\color{tomato}{.25}	\color{tomato}{.25}	\color{turquoise}{.5}
x = T_p	\color{tomato}{.25}	\color{tomato}{.25}	\color{turquoise}{.5}
Marginal (Dime)	\color{turquoise}{.5}	\color{turquoise}{.5}

Conditional: f(z | x= H_p):

f(z=H_d|x =H_p) = \frac{f(z=H_d, x=H_p}{f(x = H_p)} = \frac{.25}{.5}=.5

f(z = T_d | x = H_p)= \frac{f(z=T_d, x=H_p}{f(x=H_p)} = \frac{.25}{.5} = .5

4 Expected Values and The Algebra of Expectation

4.1 Expected Values

Expected values are statistics taken the sample space of a random variable: they are essentially weighted averages

set.seed(1234)
x = rnorm(100, mean = 0, sd = 1)
weights = dnorm(x, mean = 0, sd = 1)
mean(weights * x)

[1] -0.05567835

The weights used in computing this average correspond to the probabilities (for a discrete random variable) or to the densities (for a continuous random variable)

Note

The expected value is represented by E(x)

The actual statistic that is being weighted by the PDF is put into the parenthesis where x is now

Expected values allow us to understand what a statistical model implies about data, for instance:
- How a GLM specifies the (conditional) mean and variance of a DV

4.2 Expected Value Calculation

For discrete random variables, the expected value is found by:

E(x) = \sum_x xP(X=x)
For example, the expected value of a roll of a die is:

E(x) = (1)\frac16+ (2)\frac16+(3)\frac16+(4)\frac16+(5)\frac16+6\frac16
For continuous random variables, the expected value is found by

E(x) = \int_x xf(x)dx
We won’t be calculating theoretical expected values with calculus… we use them only to see how models imply things about out data

4.3 Variance and Covariance… As Expected Values

A distribution’s theoretical variance can also be written as an expected value:

V(x) = E(x-E(x))^2 = E(x -\mu_x)^2
- This formula helps us understand predictions made by GLMs and how that corresponds to statistical parameters we interpret
For a roll of a die, the theoretical variance is:

V(x) = E(x - 3.5)^2 = \frac16(1-3.5)^2 + \frac16(2-3.5)^2 + \frac16(3-3.5)^2 + \frac16(4-3.5)^2 + \frac16(5-3.5)^2 + \frac16(6-3.5)^2 = 2.92
Likewise, for a pair of random variable x and z, the covariance can be found from their joint distributions:

\text{Cov}(x,z)=E(xz)-E(x)E(z) = E(xz)-\mu_x\mu_z

set.seed(1234)
N = 100
x = rnorm(N, 2, 1)
z = rnorm(N, 3, 1)
xz = x*z
(Cov_xz = mean(xz)-mean(x)*mean(z)) / ((N-1)/N) # unbiased covariance
cov(x, z)

[1] -0.02631528
[1] -0.02631528

5 Linear Combination of Random Variables

5.1 Linear Combinations of Random Variables

A linear combination is an expression constructed from a set of terms by multiplying each term by a constant and then adding the results x = c + a_1z_1+a_2z_2+a_3z_3 +\cdots+a_nz_n
More generally, linear combinations of random variables have specific implications for the mean, variance, and possibly covariance of the new random variable
As such, there are predicable ways in which the means, variances, and covariances change

5.2 Algebra of Expectations

Sums of Constants

E(x+c) = E(x)+c \\ \text{Var}(x+c) = \text{Var}(x) \\ \text{Cov}(x+c, z) = \text{Cov}(x, z)

5.2.1 Example

Imagine for weight variable, each individual increases 3 lbs:

set.seed(1234)
library(ESRM64503)
x = dataSexHeightWeight$weightLB
c_ = 3
z = dataSexHeightWeight$heightIN
mean(x + c_); mean(x)+c_

[1] 186.4

[1] 186.4

var(x + c_); var(x)

[1] 3179.095

[1] 3179.095

cov(x, z); cov(x+c_, z) # decimal place issue, near() accepts two values' diff less than .00001

[1] 334.8316

[1] 334.8316

Products of Constants:

E(cx) = cE(x) \\ \text{Var}(cx) = c^2\text{Var}(x) \\ \text{Cov}(cx, dz) = c*d*\text{Cov}(x, z) Imagine you wanted to convert weight from pounds to kilograms (where 1 pound = .453 kg) and convert height from inches to cm (where 1 inch = 2.54 cm)

c_ = .453
mean(x*c_); mean(x)*c_

[1] 83.0802

[1] 83.0802

var(x*c_); var(x)*c_^2

[1] 652.3789

[1] 652.3789

d_ = 2.54
cov(c_*x, d_*z);c_*d_*cov(x, z)

[1] 385.2639

[1] 385.2639

Sums of Multiple Random Variables:

E(cx+dz) = cE(x) + dE(z) \\ \text{Var}(cx+dz) = c^2\text{Var}(x) + d^2\text{Var}(z) + 2c*d*\text{Cov}(x,z)\\

mean(x*c_+z*d_); mean(x)*c_+mean(z)*d_

[1] 255.5462

[1] 255.5462

var(x*c_+z*d_); c_^2*var(x) + d_^2*var(z) + 2*c_*d_*cov(x, z)

[1] 1780.054

[1] 1780.054

5.3 Where We Use This Algebra

Remember how we calculated the standard error of conditional main effect from simple main effect and interaction effect

\mathbf{Test = 82.20 + 2.16*Senior + 7.76*New - 3.04*Senior*New}

Conditional Main Effects of Senior When New is 1 is (2.16 - 3.04) = -0.88
- We know that Var(\beta_{Senior}) is 0.575, Var(\beta_{Senior*New}) is 1.151, and Cov(\beta_{Senior}, \beta_{Senior*New}) is -0.575.

model1 <- lm(Test ~ Senior + New + Senior * New, data = dataTestExperiment)
vcov(model1)

            (Intercept)     Senior        New Senior:New
(Intercept)   0.2877333 -0.2877333 -0.2877333  0.2877333
Senior       -0.2877333  0.5754667  0.2877333 -0.5754667
New          -0.2877333  0.2877333  0.5754667 -0.5754667
Senior:New    0.2877333 -0.5754667 -0.5754667  1.1509333

Then,

\mathbf{Var(\beta_{Senior}+\beta_{Senior*New}) = Var(\beta_{Senior}) + Var(\beta_{Senior*New}) + 2Cov(\beta_{Senior},\beta_{Senior*New}) \\ = 0.575 + 1.151 - 2* 0.575 = 0.576} Thus, SE(\beta_{Senior}+\beta_{Senior*New})=0.759

5.4 Combining the GLM with Expections

Using the algebra of expectations predicting Y from X and Z:

\hat{Y}_p=E(Y_p)=E(\beta_0+\beta_1X_p+\beta_2Z_p+\beta_3X_pZ_p+e_p) \\ = \beta_0+\beta_1X_p+\beta_2Z_p+\beta_3X_pZ_p+E(e_p) \\ = \beta_0+\beta_1X_p+\beta_2Z_p+\beta_3X_pZ_p

The variance of f(Y_p|X_p, Z_p):

V(Y_P)=V(\beta_0+\beta_1X_p+\beta_2Z_p+\beta_3X_pZ_p+e_p)\\ = V(e_p)=\sigma_e^2

5.5 Examining What This Means in the Context of Data

If you recall from the regression analysis of the height/weight data, the final model we decided to interpret: Model 5

W_p = \beta_0+\beta_1 (H_p-\bar{H}) +\beta_2F_p+\beta_3 (H_p-\bar{H}) F_p + e_p

where e_p \sim N(0, \sigma_e^2)

dat <- dataSexHeightWeight
dat$heightIN_MC <- dat$heightIN - mean(dat$heightIN)
dat$female <- dat$sex == 'F'
mod5 <- lm(weightLB ~ heightIN_MC + female + female*heightIN_MC, data = dat)
summary(mod5)


Call:
lm(formula = weightLB ~ heightIN_MC + female + female * heightIN_MC, 
    data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8312 -1.7797  0.4958  1.3575  3.3585 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            222.1842     0.8381  265.11  < 2e-16 ***
heightIN_MC              3.1897     0.1114   28.65 3.55e-15 ***
femaleTRUE             -82.2719     1.2111  -67.93  < 2e-16 ***
heightIN_MC:femaleTRUE  -1.0939     0.1678   -6.52 7.07e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.175 on 16 degrees of freedom
Multiple R-squared:  0.9987,    Adjusted R-squared:  0.9985 
F-statistic:  4250 on 3 and 16 DF,  p-value: < 2.2e-16

5.6 Picturing the GLM with Distributions

The distributional assumptions of the GLM are the reason why we do not need to worry if our dependent variable is normally distributed
Our dependent variable should be conditionally normal
We can check this assumption by checking our assumption about the residuals, e_p \sim N(0, \sigma^2_e)

5.7 Assessing Distributional Assumptions Graphically

plot(mod5)

5.8 Hypothesis Tests for Normality

If a given test is significant, then it is saying that your data do not come from a normal distribution

In practice, test will give diverging information quite frequently: the best way to evaluate normality is to consider both plots and tests (approximate = good)

shapiro.test(mod5$residuals)


    Shapiro-Wilk normality test

data:  mod5$residuals
W = 0.95055, p-value = 0.3756

5.9 Wrapping Up

Today was an introduction to mathematical statistics as a way to understand the implications statistical models make about data
Although many of these topics do not seem directly relevant, they help provide insights that untrained analysts may not easily attain

--- title: "Lecture 04: Distribution and Estimation" author: "Jihong Zhang*, Ph.D" institute: | Educational Statistics and Research Methods (ESRM) Program* University of Arkansas date: "2024-09-16" sidebar: false execute: echo: true warning: false output-location: column format: html: page-layout: full toc: true toc-depth: 2 toc-expand: true lightbox: true code-fold: false uark-revealjs: scrollable: true chalkboard: true embed-resources: false code-fold: false number-sections: false footer: "ESRM 64503: Lecture 04: Distribution and Estimation" slide-number: c/t tbl-colwidths: auto output-file: slides-index.html # filters: # - shinylive --- ## Today's Class - Review Homework 1 - [The building blocks]{style="color: tomato"}: **The basics of mathematical statistics:** - Random variables: Definition & Types - Univariate distribution - General terminology (e.g., sufficient statistics) - Univariate normal (aka Gaussian) - Other widely used univariate distributions - Types of distributions: Marginal \| Conditional \| Joint - Expected values: means, variances, and the algebra of expectations - Linear combinations of random variables - [The finished product]{style="color: violet"}: **How the GLM fits within statistics** - The GLM with the normal distribution - The statistical assumptions of the GLM - How to assess these assumptions # ESRM 64503: Homework 1 ## Question 2 <details> - Copy and paste your R syntax and R output that calculates the group **Senior-Old's standard error of group mean** (use the data and model we've used in class). - **Aim**: Test whether the group mean of senior-old significantly higher than the baseline (here, 0 score) $$ \mathbf{Test = \beta_0 +\beta_1Senior + \beta_2New +\beta_3Senior*New} $$ - The group mean of Senior-Old is $\beta_0 + \beta_1$. - Based on algebra for variance, we know hat $$ Var(\beta_0 + \beta_1) = Var(\beta_0)+Var(\beta_1) +2*Cov(\beta_0, \beta_1) $$ - `vcor` function provides the variances and covariances for $\beta_0$, $\beta_1$, $\beta_2$, and $\beta_3$ ```{r} #| output-location: default # R syntax library(ESRM64503) library(tidyverse) model1 <- lm(Test ~ Senior + New + Senior * New, data = dataTestExperiment) data("dataTestExperiment") Var_beta_0 <- vcov(model1)[1,1] Var_beta_1 <- vcov(model1)[2,2] Cov_beta_0_1 <- vcov(model1)[1,2] Var_Test_SeniorOld <- Var_beta_0 + Var_beta_1 + 2*Cov_beta_0_1 (SE_Test_SeniorOld <- sqrt(Var_Test_SeniorOld)) ``` </details> ## Question 3 <details> - Copy and paste your R syntax and R output that calculates the standard error of conditional main effect of New (New vs. Old) when Senior = 1. - **Aim**: Test whether the conditional main effect of New when Senior significantly different from 0 - In other words, when individuals are senior, whether new or old instruction method has significantly differences in their test scores $$ \beta_{2|Senior =1} = \beta_2 + \beta_3 * 1 $$ - Based on algebra for variance, we know that $$ Var(\beta_2 + \beta_3) = Var(\beta_2)+Var(\beta_3) +2*Cov(\beta_2, \beta_3) $$ - `vcor` function provides the variances and covariances for $\beta_0$, $\beta_1$, $\beta_2$, and $\beta_3$ ```{r} #| output-location: default # R syntax Var_beta_2 <- vcov(model1)[3,3] Var_beta_3 <- vcov(model1)[4,4] Cov_beta_2_3 <- vcov(model1)[3,4] Var_betaofNew_Senior1 <- Var_beta_2 + Var_beta_3 + 2*Cov_beta_2_3 (SE_betaofNew_Senior1 <- sqrt(Var_betaofNew_Senior1)) ``` </details> # Unit 1: Random Variables & Statistical Distribution ## Definition of Random Variables - [**Random:**]{.underline} situations in which the certainty of the outcome is unknown and is at least in part due to chance - [**Variable:**]{.underline} a value that may change give the scope of a given problem or set of operations - [**Random Variable**]{.underline}: a variable whose outcome depends on chance (possible values might represent the possible outcomes of a yet-to-be performed experiment) Today, we will denote a random variable with a lower-cased: ***x*** **Question**: which one in the following options is random variable: a. any company's revenue in 2024 b. one specific company's monthly revenue in 2024 c. companies whose revenue over than $30 billions **My answer**: only (a) ## Types of Random Variables 1. **Continuous** - Examples of continuous random variables: - ***x*** represent the height of a person, draw at random - *Y* (the outcome/DV in a GLM) - Some variables like **exam score or motivation scores** are not "true" continuous variables, but it is convenient to consider them as "continuous" 2. **Discrete** (also called categorical, generally) - Example of discrete random variables: - ***x*** represents the gender of a person, drawn at random - *Y* (outcomes like yes/no; pass/not pass; master / not master a skill; survive / die) 3. **Mixture of Continuous and Discrete**: - Example of mixture: \begin{equation} x = \begin{cases} RT & \text{between 0 and 45 seconds} \\ 0 & \text{otherwise} \end{cases} \end{equation} ## Key Features of Random Variable 1. Random variables each are described by a **probability density / mass function (PDF) –** $f(x)$ - PDF indicates relative frequency of occurrence - A PDF is a math function that gives rough picture of the distribution from which a random variable is draw 2. The type of random variable dictates the name and nature of these functions: - [Continuous random variables]{style="color: tomato"}: - $f(x)$ is called a probability density function - Area under curve must equal to 1 (found by calculus integration) - Height of curve (the function value $f(x)$): - Can be any positive number - Reflects relative likelihood of an observation occurring - [Discrete random variables]{style="color: violet"}: - $f(x)$ is called a probability mass function - Sum across all values must equal 1 ------------------------------------------------------------------------ ::: {.callout-note style="font-size: 1.4em;"} Both max and min values of temperature can be considered as continuous random variables. ::: - Question 1: what are the probabilities of integrating all values of max and min temperatures ({1, 1} or {0.5, 0.5}) - **Answer**: it is {1, 1} for max and min temperatures. Because they are two separated random variables. ```{r} #| code-fold: true #| output-location: default #| fig-align: center #| fig-asp: .5 library(tidyverse) temp <- read.csv("data/temp_fayetteville.csv") temp$value_F <- (temp$value / 10 * 1.8) + 32 temp |> ggplot() + geom_density(aes(x = value_F, fill = datatype), col = "white", alpha = .8) + labs(x = "Max/Min Temperature (F) at Fayetteville, AR (Sep-2023)", caption = "Source: National Oceanic and Atmospheric Administration (https://www.noaa.gov/)") + scale_x_continuous(breaks = seq(min(temp$value_F), max(temp$value_F), by = 5)) + scale_fill_manual(values = c("tomato", "turquoise")) + theme_classic(base_size = 13) ``` ## Other key Terms - The sample space is the set of all values that a random variable x can take: - **Example 1:** The sample space for a random variable x from a normal distribution $x \sim N(\mu_x, \sigma^2_x)$ is $(-\infty, +\infty)$. - **Example 2:** The sample space for a random variable x representing the outcome of a coin flip is {H, T} - **Example 3:** The sample space for a random variable x representing the outcome of a roll of a die is {1, 2, 3, 4, 5, 6} - When using generalized models, the trick is to pick a distribution with a sample space that matches the range of values obtainable by data - Logistic regression - Match Bernoulli distribution (Example 2) - Poisson regression - Match Poisson distribution (Example 3) ## Uses of Distributions in Data Analysis - Statistical models make distributional assumptions on various parameters and / or parts of data - These assumptions govern: - How models are estimated - How inferences are made - How missing data may be imputed - If data do not follow an assumed distribution, inferences may be inaccurate - Sometimes a very big problem, other times not so much - Therefore, it can be helpful to check distributional assumptions prior to running statistical analysis ## Continuous Univariate distributions - To demonstrate how continuous distributions work and look, we will discuss three: - Uniform distribution - Normal distribution - Chi-square distribution - Each are described a set of parameters, which we will later see are what give us our inferences when we analysis data - What we then do is put constraints on those parameters based on hypothesized effects in data ## Uniform distribution - The uniform distribution is how to help set up how continuous distributions work - Typically, used for simulation studies that parameters are randomly generated - For a continuous random variable x that ranges from (a, b), the uniform probability density function is: $f(x) = \frac{1}{b-a}$ - The uniform distribution has two parameters ```{r} x = seq(0, 3, .1) y = dunif(x, min = 0, max = 3) ggplot() + geom_point(aes(x = x, y = y)) + geom_path(aes(x = x, y = y)) + theme_bw() ``` ------------------------------------------------------------------------ ```{shinylive-r} #| standalone: true #| viewerHeight: 800 library(shiny) library(bslib) library(ggplot2) library(dplyr) library(tidyr) set.seed(1234) # Define UI for application that draws a histogram ui <- fluidPage( # Application title titlePanel("Uniform distribution"), # Sidebar with a slider input for number of bins sidebarLayout( sidebarPanel( sliderInput("a", "Lower Bound (a):", min = 1, max = 20, value = 1, animate = animationOptions(interval = 5000, loop = TRUE)), uiOutput("b_slider") ), # Show a plot of the generated distribution mainPanel( plotOutput("distPlot") ) ) ) # Define server logic required to draw a histogram x <- seq(0, 40, .02) server <- function(input, output) { observeEvent(input$a, { output$b_slider <<- renderUI({ sliderInput("b", "Upper Bound (b):", min = as.numeric(input$a), max = 40, value = as.numeric(input$a) + 1) }) # browser() y <<- reactive({dunif(x, min = as.numeric(input$a), max = as.numeric(input$b))}) }) # browser() observe({ output$distPlot <- renderPlot({ # generate bins based on input$bins from ui.R ggplot() + aes(x = x, y = y())+ geom_point() + geom_path() + labs(x = "x", y = "probability") + theme_bw() + theme(text = element_text(size = 20)) }) }) } # Run the application shinyApp(ui = ui, server = server) ``` ## More on the Uniform Distribution - To demonstrate how PDFs work, we will try a few values: ```{r} conditions <- tribble( ~x, ~a, ~b, .5, 0, 1, .75, 0, 1, 15, 0, 20, 15, 10, 20 ) |> mutate(y = dunif(x, min = a, max = b)) conditions ``` - The uniform PDF has the feature that all values of ***x*** are equally likely across the sample space of the distribution - Therefore, you do not see ***x*** in the PDF $f(x)$ - The mean of the uniform distribution is $\frac{1}{2}(a+b)$ - The variance of the uniform distribution is $\frac{1}{12}(b-a)^2$ ## Univariate Normal Distribution - For a continuous random variable ***x*** (ranging from $-\infty$ to $\infty$), the univariate normal distribution function is: $$ f(x) = \frac1{\sqrt{2\pi\sigma^2_x}}\exp(-\frac{(x-\mu_x)^2}{2\sigma^2_x}) $$ - The shape of the distribution is governed by two parameters: - The mean $\mu_x$ - The variance $\sigma^2_x$ - These parameters are called **sufficient statistics** (they contain all the information about the distribution) - The skewness (lean) and kurtosis (peakedness) are fixed - Standard notation for normal distributions is $x\sim N(\mu_x, \sigma^2_x)$ - Read as: "*x* follows a normal distribution with a mean $\mu_x$ and a variance $\sigma^2_x$" - Linear combinations of random variables following normal distributions result in a random variable that is normally distributed ## Univariate Normal Distribution in R: pnorm Density (`dnorm`), distribution function (`pnorm`), quantile function (`qnorm`) and random generation (`rnorm`) for the normal distribution with mean equal to `mean` and standard deviation equal to `sd`. ```{r} Z = seq(-5, 5, .1) # Z-score ggplot() + aes(x = Z, y = pnorm(q = Z, lower.tail = TRUE)) + geom_point() + geom_path() + labs(x = "Z-score", y = "Cumulative probability", title = "`pnorm()` gives the cumulative probability function P(X < T)") ``` ## Univariate Normal Distribution in R: dnorm ```{r} x <- seq(-5, 5, .1) params <- list( y_set1 = c(mu = 0, sigma2 = 0.2), y_set2 = c(mu = 0, sigma2 = 1.0), y_set3 = c(mu = 0, sigma2 = 5.0), y_set4 = c(mu = -2, sigma2 = 0.5) ) y <- sapply(params, function(param) dnorm(x, mean = param['mu'], sd = param['sigma2'])) dt <- cbind(x, y) |> as.data.frame() |> pivot_longer(starts_with("y_")) ggplot(dt) + geom_path(aes(x = x, y = value, color = name, group = name), linewidth = 1.3) + scale_color_manual(values = 1:4, name = "", labels = c('y_set1' = expression(mu*"=0, "*sigma^2*"=.02"), 'y_set2' = expression(mu*"=0, "*sigma^2*"=1.0"), 'y_set3' = expression(mu*"=0, "*sigma^2*"=5.0"), 'y_set4' = expression(mu*"=-2, "*sigma^2*"=0.5")) ) + theme_bw() + theme(legend.position = "top", text = element_text(size = 13)) ``` ## Chi-Square Distribution - Another frequently used univariate distribution is the Chi-square distribution - Sampling distribution of the variance follows a chi-square distribution - Likelihood ratios follow a chi-square distribution - For a continuous random variable *x* (ranging from 0 to $\infty$), the chi-square distribution is given by: $$ f(x) =\frac1{2^{\frac{\upsilon}{2}} \Gamma(\frac{\upsilon}{2})} x^{\frac{\upsilon}{2}-1} \exp(-\frac{x}2) $$ - $\Gamma(\cdot)$ is called the gamma function - The chi-square distribution is govern by one parameter: $\upsilon$ (the degrees of freedom) - The mean is equal to $\upsilon$; the variance is equal to 2$\upsilon$ ------------------------------------------------------------------------ ```{r} x <- seq(0.01, 15, .01) df <- c(1, 2, 3, 5, 10) dt2 <- as.data.frame(sapply(df, \(df) dchisq(x, df = df))) dt2_with_x <- cbind(x = x, dt2) dt2_with_x |> pivot_longer(starts_with("V")) |> ggplot() + geom_path(aes(x = x, y = value, color = name), linewidth = 1.2) + scale_y_continuous(limits = c(0, 1)) + scale_color_discrete(name = "", labels = paste("df =", df)) + labs(y = "f(x)") + theme_bw() + theme(legend.position = "top", text = element_text(size = 13)) ``` ------------------------------------------------------------------------ ```{shinylive-r} #| standalone: true #| viewerHeight: 800 library(shiny) library(bslib) library(ggplot2) library(dplyr) library(tidyr) set.seed(1234) # Define UI for application that draws a histogram ui <- fluidPage( # Application title titlePanel("Chi-square distribution"), # Sidebar with a slider input for number of bins sidebarLayout( sidebarPanel( sliderInput("Df", "Degree of freedom:", min = 1, max = 20, value = 1, animate = animationOptions(interval = 5000, loop = TRUE)), verbatimTextOutput(outputId = "Df_display") ), # Show a plot of the generated distribution mainPanel( plotOutput("distPlot") ) ) ) # Define server logic required to draw a histogram server <- function(input, output) { x <- seq(0.01, 15, .01) df <- reactive({input$Df}) output$Df_display <- renderText({ paste0("DF = ", df()) }) output$distPlot <- renderPlot({ # generate bins based on input$bins from ui.R dt2 <- as.data.frame(sapply(df(), \(df) dchisq(x, df = df))) dt2_with_x <- cbind(x = x, dt2) dt2_with_x |> pivot_longer(starts_with("V")) |> ggplot() + geom_path(aes(x = x, y = value), linewidth = 1.2) + scale_y_continuous(limits = c(0, 1)) + labs(y = "f(x)") + theme_bw() + theme(legend.position = "top", text = element_text(size = 13)) }) } # Run the application shinyApp(ui = ui, server = server) ``` # Marginal, Joint, And Conditional Distribution ## Moving from One to Multiple Random Variables - When more than one random variable is present, there are several different types of statistical distributions: - We will first consider two discrete random variables： - *x* is the outcome of the flip of a penny {$H_p$, $T_p$} - $f(x=H_p) = .5$; $f(x =T_p) = .5$ - *z* is the outcome of the flip of a dime {$H_d$, $T_d$} - $f(z=H_p) = .5$; $f(z =T_p) = .5$ - We will consider the following distributions: - Marginal distribution - The distribution of one variable only (either $f(x)$ or $f(z)$) - Joint distribution - $f(x, z)$: the distribution of both variables (both *x* and *z*) - **Conditional distribution** - The distribution of one variable, conditional on values of the other: - $f(x|z)$: the distribution of *x* given *z* - $f(z|x)$: the distribution of *z* given x ## Marginal Distribution - Marginal distributions are what we have worked with exclusively up to this point: they represent the distribution by itself - Continuous univariate distributions - Categorical distributions - The flip of a penny - The flip of a dime ## Joint Distribution - Joint distributions describe the distribution of more than one variable, simultaneously - Representations of multiple variables collected - Commonly, the joint distribution function is denoted with all random variables separated by commas - In our example, $f(x,z)$ is the joint distribution of the outcome of flipping both a penny and a dime - As both are discrete, the joint distribution has four possible values: $1$ $f(x = H_p,z=H_d)$ (2) $f(x = H_p,z=T_d)$ (3) $f(x = T_p,z=H_d)$ (4) $f(x = T_p,z=T_d)$ - Joint distributions are **multivariate distributions** - We will use joint distributions to introduce two topics - Joint distributions of independent variables - Joint distributions – used in maximum likelihood estimation ## Joint Distributions of Independent Random Variables - Random variables are said to be independent if the occurrence of one event makes it neither more nor less probable of another event - For joint distributions, this means: $f(x,z)=f(x)f(z)$ - In our example, flipping a penny and flipping a dime are independent – so we can complete the following table of their joint distribution: | | z = $H_d$ | z = $T_d$ | Marginal (Penny) | |------------------|-------------------|------------------|------------------| | $x = H_p$ | $\color{tomato}{f(x=H_p, z=H_d)}$ | $\color{tomato}{f(x=H_p, z=T_d)}$ | $\color{turquoise}{f(z=H_p)}$ | | $x = T_p$ | $\color{tomato}{f(x= T_p, z=H_d)}$ | $\color{tomato}{f(x=T_p, z=T_d)}$ | $\color{turquoise}{f(z=T_d)}$ | | Marginal (Dime) | $\color{turquoise}{f(z=H_d)}$ | $\color{turquoise}{f(z=T_d)}$ | | ## Joint Distributions of Independent Random Variables - Because the coin flips are independent, this because: | | z = $H_d$ | z = $T_d$ | Marginal (Penny) | |-----------------|-------------------|-------------------|-----------------| | $x = H_p$ | $\color{tomato}{f(x=H_p)f( z=H_d)}$ | $\color{tomato}{f(x=H_p)f( z=T_d)}$ | $\color{turquoise}{f(z=H_p)}$ | | $x = T_p$ | $\color{tomato}{f(x= T_p)f( z=H_d)}$ | $\color{tomato}{f(x=T_p)f( z=T_d)}$ | $\color{turquoise}{f(z=T_d)}$ | | Marginal (Dime) | $\color{turquoise}{f(z=H_d)}$ | $\color{turquoise}{f(z=T_d)}$ | | - Then, with numbers: | | z = $H_d$ | z = $T_d$ | Marginal (Penny) | |------------------|------------------|------------------|------------------| | $x = H_p$ | $\color{tomato}{.25}$ | $\color{tomato}{.25}$ | $\color{turquoise}{.5}$ | | $x = T_p$ | $\color{tomato}{.25}$ | $\color{tomato}{.25}$ | $\color{turquoise}{.5}$ | | Marginal (Dime) | $\color{turquoise}{.5}$ | $\color{turquoise}{.5}$ | | ## Marginalizing Across a Joint Distribution - If you had a joint distribution, $\color{orchid}{f(x, z)}$, but wanted the marginal distribution of either variable ($f(x)$ or $f(z)$) you would have to **marginalize** across one dimension of the joint distribution. - For [**categorical random variables**]{.underline}, marginalize = sum across every value of z $$ f(x) = \sum_zf(x, z) $$ - For example, $f(x = H_p) = f(x = H_p, z=H_d) +f(x = H_p, z=T_d)=.5$ - For [continuous random variables,]{.underline} marginalize = integrate across z - The integral: $$ f(x) = \int_zf(x,z)dz $$ ## Conditional Distributions - For two random variables x and z, a conditional distribution is written as: $f(z|x)$ - The distribution of z given x - The conditional distribution is equal to the joint distribution divided by the marginal distribution of the conditioning random variable $$ f(z|x) = \frac{f(z,x)}{f(x)} $$ - Conditional distributions are found everywhere in statistics - The general linear model uses the conditional distribution variable $$ Y \sim N(\beta_0+\beta_1X, \sigma^2_e) $$ ## Example: Conditional Distribution - For two discrete random variables with {0, 1} values, the conditional distribution can be shown in a contingency table: | | z = $H_d$ | z = $T_d$ | Marginal (Penny) | |------------------|------------------|------------------|------------------| | $x = H_p$ | $\color{tomato}{.25}$ | $\color{tomato}{.25}$ | $\color{turquoise}{.5}$ | | $x = T_p$ | $\color{tomato}{.25}$ | $\color{tomato}{.25}$ | $\color{turquoise}{.5}$ | | Marginal (Dime) | $\color{turquoise}{.5}$ | $\color{turquoise}{.5}$ | | Conditional: $f(z | x= H_p)$: $f(z=H_d|x =H_p) = \frac{f(z=H_d, x=H_p}{f(x = H_p)} = \frac{.25}{.5}=.5$ $f(z = T_d | x = H_p)= \frac{f(z=T_d, x=H_p}{f(x=H_p)} = \frac{.25}{.5} = .5$ # Expected Values and The Algebra of Expectation ## Expected Values - Expected values are statistics taken the sample space of a random variable: they are essentially weighted averages ```{r} #| output-location: default set.seed(1234) x = rnorm(100, mean = 0, sd = 1) weights = dnorm(x, mean = 0, sd = 1) mean(weights * x) ``` - The weights used in computing this average correspond to the probabilities (for a discrete random variable) or to the densities (for a continuous random variable) ::: {.callout-note font-size="1.2em"} The expected value is represented by $E(x)$ The actual statistic that is being weighted by the PDF is put into the parenthesis where x is now ::: - Expected values allow us to understand what a statistical model implies about data, for instance: - How a GLM specifies the (conditional) mean and variance of a DV ## Expected Value Calculation - For discrete random variables, the expected value is found by: $$ E(x) = \sum_x xP(X=x) $$ - For example, the expected value of a roll of a die is: $$ E(x) = (1)\frac16+ (2)\frac16+(3)\frac16+(4)\frac16+(5)\frac16+6\frac16 $$ - For continuous random variables, the expected value is found by $$ E(x) = \int_x xf(x)dx $$ - We won't be calculating theoretical expected values with calculus... we use them only to see how models imply things about out data ## Variance and Covariance... As Expected Values - A distribution's theoretical variance can also be written as an expected value: $$ V(x) = E(x-E(x))^2 = E(x -\mu_x)^2 $$ - This formula helps us understand predictions made by GLMs and how that corresponds to statistical parameters we interpret - For a roll of a die, the theoretical variance is: $$ V(x) = E(x - 3.5)^2 = \frac16(1-3.5)^2 + \frac16(2-3.5)^2 + \frac16(3-3.5)^2 + \frac16(4-3.5)^2 + \frac16(5-3.5)^2 + \frac16(6-3.5)^2 = 2.92 $$ - Likewise, for a pair of random variable *x* and *z*, the covariance can be found from their joint distributions: $$ \text{Cov}(x,z)=E(xz)-E(x)E(z) = E(xz)-\mu_x\mu_z $$ ```{r} #| results: hold set.seed(1234) N = 100 x = rnorm(N, 2, 1) z = rnorm(N, 3, 1) xz = x*z (Cov_xz = mean(xz)-mean(x)*mean(z)) / ((N-1)/N) # unbiased covariance cov(x, z) ``` # Linear Combination of Random Variables ## Linear Combinations of Random Variables - A linear combination is an expression constructed from a set of terms by multiplying each term by a constant and then adding the results $$ x = c + a_1z_1+a_2z_2+a_3z_3 +\cdots+a_nz_n $$ - More generally, linear combinations of random variables have specific implications for the mean, variance, and possibly covariance of the new random variable - As such, there are predicable ways in which the means, variances, and covariances change ## Algebra of Expectations - Sums of Constants $$ E(x+c) = E(x)+c \\ \text{Var}(x+c) = \text{Var}(x) \\ \text{Cov}(x+c, z) = \text{Cov}(x, z) $$ ### Example Imagine for weight variable, each individual increases 3 lbs: ```{r} #| output-location: default set.seed(1234) library(ESRM64503) x = dataSexHeightWeight$weightLB c_ = 3 z = dataSexHeightWeight$heightIN mean(x + c_); mean(x)+c_ var(x + c_); var(x) cov(x, z); cov(x+c_, z) # decimal place issue, near() accepts two values' diff less than .00001 ``` ------------------------------------------------------------------------ - Products of Constants: $$ E(cx) = cE(x) \\ \text{Var}(cx) = c^2\text{Var}(x) \\ \text{Cov}(cx, dz) = c*d*\text{Cov}(x, z) $$ Imagine you wanted to convert weight from pounds to kilograms (where 1 pound = .453 kg) and convert height from inches to cm (where 1 inch = 2.54 cm) ```{r} #| output-location: default c_ = .453 mean(x*c_); mean(x)*c_ var(x*c_); var(x)*c_^2 d_ = 2.54 cov(c_*x, d_*z);c_*d_*cov(x, z) ``` ------------------------------------------------------------------------ - Sums of Multiple Random Variables: $$ E(cx+dz) = cE(x) + dE(z) \\ \text{Var}(cx+dz) = c^2\text{Var}(x) + d^2\text{Var}(z) + 2c*d*\text{Cov}(x,z)\\ $$ ```{r} #| output-location: default mean(x*c_+z*d_); mean(x)*c_+mean(z)*d_ var(x*c_+z*d_); c_^2*var(x) + d_^2*var(z) + 2*c_*d_*cov(x, z) ``` ## Where We Use This Algebra - Remember how we calculated the standard error of conditional main effect from simple main effect and interaction effect $$ \mathbf{Test = 82.20 + 2.16*Senior + 7.76*New - 3.04*Senior*New} $$ - Conditional Main Effects of Senior When New is 1 is (2.16 - 3.04) = -0.88 - We know that $Var(\beta_{Senior})$ is 0.575, $Var(\beta_{Senior*New})$ is 1.151, and $Cov(\beta_{Senior}, \beta_{Senior*New})$ is -0.575. ```{r} #| output-location: default model1 <- lm(Test ~ Senior + New + Senior * New, data = dataTestExperiment) vcov(model1) ``` Then, $$ \mathbf{Var(\beta_{Senior}+\beta_{Senior*New}) = Var(\beta_{Senior}) + Var(\beta_{Senior*New}) + 2Cov(\beta_{Senior},\beta_{Senior*New}) \\ = 0.575 + 1.151 - 2* 0.575 = 0.576} $$ Thus, $SE(\beta_{Senior}+\beta_{Senior*New})=0.759$ ## Combining the GLM with Expections - Using the algebra of expectations predicting Y from X and Z: $$ \hat{Y}_p=E(Y_p)=E(\beta_0+\beta_1X_p+\beta_2Z_p+\beta_3X_pZ_p+e_p) \\ = \beta_0+\beta_1X_p+\beta_2Z_p+\beta_3X_pZ_p+E(e_p) \\ = \beta_0+\beta_1X_p+\beta_2Z_p+\beta_3X_pZ_p $$ - The variance of $f(Y_p|X_p, Z_p)$: $$ V(Y_P)=V(\beta_0+\beta_1X_p+\beta_2Z_p+\beta_3X_pZ_p+e_p)\\ = V(e_p)=\sigma_e^2 $$ ## Examining What This Means in the Context of Data - If you recall from the regression analysis of the height/weight data, the final model we decided to interpret: Model 5 $$ W_p = \beta_0+\beta_1 (H_p-\bar{H}) +\beta_2F_p+\beta_3 (H_p-\bar{H}) F_p + e_p $$ where $e_p \sim N(0, \sigma_e^2)$ ```{r} #| output-location: default dat <- dataSexHeightWeight dat$heightIN_MC <- dat$heightIN - mean(dat$heightIN) dat$female <- dat$sex == 'F' mod5 <- lm(weightLB ~ heightIN_MC + female + female*heightIN_MC, data = dat) summary(mod5) ``` ## Picturing the GLM with Distributions - The distributional assumptions of the GLM are the reason why we do not need to worry if our dependent variable is normally distributed - Our dependent variable should be conditionally normal - We can check this assumption by checking our assumption about the residuals, $e_p \sim N(0, \sigma^2_e)$ ```{r} #| echo: false #| output-location: default library(ggridges) library(ggplot2) library(dplyr) library(tidyr) set.seed(1234) pred_y <- predict(mod5, newdata = dat) pred_y_dat <- data.frame(Point = names(pred_y), weightLB = pred_y, heightIN_MC = dat$heightIN_MC) sigma <- sd(residuals(mod5)) residual_dat <- as.data.frame(sapply(pred_y, \(x) rnorm(100, mean = x, sd = sigma))) residual_dat_long <- residual_dat |> pivot_longer(everything()) colnames(residual_dat_long) <- c("Point", "weightLBVaribant") residual_dat_long <- residual_dat_long |> left_join(pred_y_dat) |> arrange(Point) ggplot() + geom_density_ridges(aes(x = weightLBVaribant, y = as.factor(round(heightIN_MC, 3))), data = residual_dat_long, rel_min_height = 0.005, alpha = .5, scale = .7) + geom_point(aes(col = female, x = weightLB, y = as.factor(round(heightIN_MC, 3))), data= dat) + geom_smooth(aes(col = female, group = female, x = weightLB, y = as.factor(round(heightIN_MC, 3))), method = 'lm', se = FALSE, data= dat) + labs(y = "HeightLB", x = "WeightLB") + theme_classic() + coord_flip() ``` ## Assessing Distributional Assumptions Graphically ```{r} plot(mod5) ``` ## Hypothesis Tests for Normality If a given test is **significant**, then it is saying that your data do not come from a normal distribution In practice, test will give diverging information quite frequently: the best way to evaluate normality is to consider both plots and tests (approximate = good) ```{r} shapiro.test(mod5$residuals) ``` ## Wrapping Up - Today was an introduction to mathematical statistics as a way to understand the implications statistical models make about data - Although many of these topics do not seem directly relevant, they help provide insights that untrained analysts may not easily attain