Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
2024-09-16
Today’s Class
Review Homework 1
The building blocks: The basics of mathematical statistics:
Random variables: Definition & Types
Univariate distribution
General terminology (e.g., sufficient statistics)
Univariate normal (aka Gaussian)
Other widely used univariate distributions
Types of distributions: Marginal | Conditional | Joint
Expected values: means, variances, and the algebra of expectations
Linear combinations of random variables
The finished product: How the GLM fits within statistics
The GLM with the normal distribution
The statistical assumptions of the GLM
How to assess these assumptions
ESRM 64503: Homework 1
Question 2
Copy and paste your R syntax and R output that calculates the group Senior-Old’s standard error of group mean (use the data and model we’ve used in class).
Aim: Test whether the group mean of senior-old significantly higher than the baseline (here, 0 score)
Unit 1: Random Variables & Statistical Distribution
Definition of Random Variables
Random: situations in which the certainty of the outcome is unknown and is at least in part due to chance
Variable: a value that may change give the scope of a given problem or set of operations
Random Variable: a variable whose outcome depends on chance (possible values might represent the possible outcomes of a yet-to-be performed experiment)
Today, we will denote a random variable with a lower-cased: x
Question: which one in the following options is random variable:
any company’s revenue in 2024
one specific company’s monthly revenue in 2024
companies whose revenue over than $30 billions
My answer: only (a)
Types of Random Variables
Continuous
Examples of continuous random variables:
x represent the height of a person, draw at random
Y (the outcome/DV in a GLM)
Some variables like exam score or motivation scores are not “true” continuous variables, but it is convenient to consider them as “continuous”
Discrete (also called categorical, generally)
Example of discrete random variables:
x represents the gender of a person, drawn at random
Y (outcomes like yes/no; pass/not pass; master / not master a skill; survive / die)
Mixture of Continuous and Discrete:
Example of mixture: \[\begin{equation}
x =
\begin{cases}
RT & \text{between 0 and 45 seconds} \\
0 & \text{otherwise}
\end{cases}
\end{equation}\]
Key Features of Random Variable
Random variables each are described by a probability density / mass function (PDF) –\(f(x)\)
PDF indicates relative frequency of occurrence
A PDF is a math function that gives rough picture of the distribution from which a random variable is draw
The type of random variable dictates the name and nature of these functions:
Continuous random variables:
\(f(x)\) is called a probability density function
Area under curve must equal to 1 (found by calculus integration)
Height of curve (the function value \(f(x)\)):
Can be any positive number
Reflects relative likelihood of an observation occurring
Discrete random variables:
\(f(x)\) is called a probability mass function
Sum across all values must equal 1
Note
Both max and min values of temperature can be considered as continuous random variables.
Question 1: what are the probabilities of integrating all values of max and min temperatures ({1, 1} or {0.5, 0.5})
Answer: it is {1, 1} for max and min temperatures. Because they are two separated random variables.
Code
library(tidyverse)temp <-read.csv("data/temp_fayetteville.csv")temp$value_F <- (temp$value /10*1.8) +32temp |>ggplot() +geom_density(aes(x = value_F, fill = datatype), col ="white", alpha = .8) +labs(x ="Max/Min Temperature (F) at Fayetteville, AR (Sep-2023)", caption ="Source: National Oceanic and Atmospheric Administration (https://www.noaa.gov/)") +scale_x_continuous(breaks =seq(min(temp$value_F), max(temp$value_F), by =5)) +scale_fill_manual(values =c("tomato", "turquoise")) +theme_classic(base_size =13)
Other key Terms
The sample space is the set of all values that a random variable x can take:
Example 1: The sample space for a random variable x from a normal distribution \(x \sim N(\mu_x, \sigma^2_x)\) is \((-\infty, +\infty)\).
Example 2: The sample space for a random variable x representing the outcome of a coin flip is {H, T}
Example 3: The sample space for a random variable x representing the outcome of a roll of a die is {1, 2, 3, 4, 5, 6}
When using generalized models, the trick is to pick a distribution with a sample space that matches the range of values obtainable by data
Logistic regression - Match Bernoulli distribution (Example 2)
Poisson regression - Match Poisson distribution (Example 3)
Uses of Distributions in Data Analysis
Statistical models make distributional assumptions on various parameters and / or parts of data
These assumptions govern:
How models are estimated
How inferences are made
How missing data may be imputed
If data do not follow an assumed distribution, inferences may be inaccurate
Sometimes a very big problem, other times not so much
Therefore, it can be helpful to check distributional assumptions prior to running statistical analysis
Continuous Univariate distributions
To demonstrate how continuous distributions work and look, we will discuss three:
Uniform distribution
Normal distribution
Chi-square distribution
Each are described a set of parameters, which we will later see are what give us our inferences when we analysis data
What we then do is put constraints on those parameters based on hypothesized effects in data
Uniform distribution
The uniform distribution is how to help set up how continuous distributions work
Typically, used for simulation studies that parameters are randomly generated
For a continuous random variable x that ranges from (a, b), the uniform probability density function is:
\(f(x) = \frac{1}{b-a}\)
The uniform distribution has two parameters
x =seq(0, 3, .1)y =dunif(x, min =0, max =3)ggplot() +geom_point(aes(x = x, y = y)) +geom_path(aes(x = x, y = y)) +theme_bw()
#| standalone: true#| viewerHeight: 800library(shiny)library(bslib)library(ggplot2)library(dplyr)library(tidyr)set.seed(1234)# Define UI for application that draws a histogramui <- fluidPage( # Application title titlePanel("Uniform distribution"), # Sidebar with a slider input for number of bins sidebarLayout( sidebarPanel( sliderInput("a", "Lower Bound (a):", min = 1, max = 20, value = 1, animate = animationOptions(interval = 5000, loop = TRUE)), uiOutput("b_slider") ), # Show a plot of the generated distribution mainPanel( plotOutput("distPlot") ) ))# Define server logic required to draw a histogramx <- seq(0, 40, .02)server <- function(input, output) { observeEvent(input$a, { output$b_slider <<- renderUI({ sliderInput("b", "Upper Bound (b):", min = as.numeric(input$a), max = 40, value = as.numeric(input$a) + 1) }) # browser() y <<- reactive({dunif(x, min = as.numeric(input$a), max = as.numeric(input$b))}) }) # browser() observe({ output$distPlot <- renderPlot({ # generate bins based on input$bins from ui.R ggplot() + aes(x = x, y = y())+ geom_point() + geom_path() + labs(x = "x", y = "probability") + theme_bw() + theme(text = element_text(size = 20)) }) })}# Run the application shinyApp(ui = ui, server = server)
More on the Uniform Distribution
To demonstrate how PDFs work, we will try a few values:
conditions <-tribble(~x, ~a, ~b, .5, 0, 1, .75, 0, 1,15, 0, 20,15, 10, 20) |>mutate(y =dunif(x, min = a, max = b))conditions
# A tibble: 4 × 4
x a b y
<dbl> <dbl> <dbl> <dbl>
1 0.5 0 1 1
2 0.75 0 1 1
3 15 0 20 0.05
4 15 10 20 0.1
The uniform PDF has the feature that all values of x are equally likely across the sample space of the distribution
Therefore, you do not see x in the PDF \(f(x)\)
The mean of the uniform distribution is \(\frac{1}{2}(a+b)\)
The variance of the uniform distribution is \(\frac{1}{12}(b-a)^2\)
Univariate Normal Distribution
For a continuous random variable x (ranging from \(-\infty\) to \(\infty\)), the univariate normal distribution function is:
The shape of the distribution is governed by two parameters:
The mean \(\mu_x\)
The variance \(\sigma^2_x\)
These parameters are called sufficient statistics (they contain all the information about the distribution)
The skewness (lean) and kurtosis (peakedness) are fixed
Standard notation for normal distributions is \(x\sim N(\mu_x, \sigma^2_x)\)
Read as: “x follows a normal distribution with a mean \(\mu_x\) and a variance \(\sigma^2_x\)”
Linear combinations of random variables following normal distributions result in a random variable that is normally distributed
Univariate Normal Distribution in R: pnorm
Density (dnorm), distribution function (pnorm), quantile function (qnorm) and random generation (rnorm) for the normal distribution with mean equal to mean and standard deviation equal to sd.
Z =seq(-5, 5, .1) # Z-scoreggplot() +aes(x = Z, y =pnorm(q = Z, lower.tail =TRUE)) +geom_point() +geom_path() +labs(x ="Z-score", y ="Cumulative probability",title ="`pnorm()` gives the cumulative probability function P(X < T)")
Joint distributions are multivariate distributions
We will use joint distributions to introduce two topics
Joint distributions of independent variables
Joint distributions – used in maximum likelihood estimation
Joint Distributions of Independent Random Variables
Random variables are said to be independent if the occurrence of one event makes it neither more nor less probable of another event
For joint distributions, this means: \(f(x,z)=f(x)f(z)\)
In our example, flipping a penny and flipping a dime are independent – so we can complete the following table of their joint distribution:
z = \(H_d\)
z = \(T_d\)
Marginal (Penny)
\(x = H_p\)
\(\color{tomato}{f(x=H_p, z=H_d)}\)
\(\color{tomato}{f(x=H_p, z=T_d)}\)
\(\color{turquoise}{f(z=H_p)}\)
\(x = T_p\)
\(\color{tomato}{f(x= T_p, z=H_d)}\)
\(\color{tomato}{f(x=T_p, z=T_d)}\)
\(\color{turquoise}{f(z=T_d)}\)
Marginal (Dime)
\(\color{turquoise}{f(z=H_d)}\)
\(\color{turquoise}{f(z=T_d)}\)
Joint Distributions of Independent Random Variables
Because the coin flips are independent, this because:
z = \(H_d\)
z = \(T_d\)
Marginal (Penny)
\(x = H_p\)
\(\color{tomato}{f(x=H_p)f( z=H_d)}\)
\(\color{tomato}{f(x=H_p)f( z=T_d)}\)
\(\color{turquoise}{f(z=H_p)}\)
\(x = T_p\)
\(\color{tomato}{f(x= T_p)f( z=H_d)}\)
\(\color{tomato}{f(x=T_p)f( z=T_d)}\)
\(\color{turquoise}{f(z=T_d)}\)
Marginal (Dime)
\(\color{turquoise}{f(z=H_d)}\)
\(\color{turquoise}{f(z=T_d)}\)
Then, with numbers:
z = \(H_d\)
z = \(T_d\)
Marginal (Penny)
\(x = H_p\)
\(\color{tomato}{.25}\)
\(\color{tomato}{.25}\)
\(\color{turquoise}{.5}\)
\(x = T_p\)
\(\color{tomato}{.25}\)
\(\color{tomato}{.25}\)
\(\color{turquoise}{.5}\)
Marginal (Dime)
\(\color{turquoise}{.5}\)
\(\color{turquoise}{.5}\)
Marginalizing Across a Joint Distribution
If you had a joint distribution, \(\color{orchid}{f(x, z)}\), but wanted the marginal distribution of either variable (\(f(x)\) or \(f(z)\)) you would have to marginalize across one dimension of the joint distribution.
For categorical random variables, marginalize = sum across every value of z
\[
f(x) = \sum_zf(x, z)
\]
For example, \(f(x = H_p) = f(x = H_p, z=H_d) +f(x = H_p, z=T_d)=.5\)
For continuous random variables, marginalize = integrate across z
The integral:
\[
f(x) = \int_zf(x,z)dz
\]
Conditional Distributions
For two random variables x and z, a conditional distribution is written as: \(f(z|x)\)
The distribution of z given x
The conditional distribution is equal to the joint distribution divided by the marginal distribution of the conditioning random variable
\[
f(z|x) = \frac{f(z,x)}{f(x)}
\]
Conditional distributions are found everywhere in statistics
The general linear model uses the conditional distribution variable
\[
Y \sim N(\beta_0+\beta_1X, \sigma^2_e)
\]
Example: Conditional Distribution
For two discrete random variables with {0, 1} values, the conditional distribution can be shown in a contingency table:
Expected values are statistics taken the sample space of a random variable: they are essentially weighted averages
set.seed(1234)x =rnorm(100, mean =0, sd =1)weights =dnorm(x, mean =0, sd =1)mean(weights * x)
[1] -0.05567835
The weights used in computing this average correspond to the probabilities (for a discrete random variable) or to the densities (for a continuous random variable)
Note
The expected value is represented by \(E(x)\)
The actual statistic that is being weighted by the PDF is put into the parenthesis where x is now
Expected values allow us to understand what a statistical model implies about data, for instance:
How a GLM specifies the (conditional) mean and variance of a DV
Expected Value Calculation
For discrete random variables, the expected value is found by:
\[
E(x) = \sum_x xP(X=x)
\]
For example, the expected value of a roll of a die is:
A linear combination is an expression constructed from a set of terms by multiplying each term by a constant and then adding the results \[
x = c + a_1z_1+a_2z_2+a_3z_3 +\cdots+a_nz_n
\]
More generally, linear combinations of random variables have specific implications for the mean, variance, and possibly covariance of the new random variable
As such, there are predicable ways in which the means, variances, and covariances change
cov(x, z); cov(x+c_, z) # decimal place issue, near() accepts two values' diff less than .00001
[1] 334.8316
[1] 334.8316
Products of Constants:
\[
E(cx) = cE(x) \\
\text{Var}(cx) = c^2\text{Var}(x) \\
\text{Cov}(cx, dz) = c*d*\text{Cov}(x, z)
\] Imagine you wanted to convert weight from pounds to kilograms (where 1 pound = .453 kg) and convert height from inches to cm (where 1 inch = 2.54 cm)
dat <- dataSexHeightWeightdat$heightIN_MC <- dat$heightIN -mean(dat$heightIN)dat$female <- dat$sex =='F'mod5 <-lm(weightLB ~ heightIN_MC + female + female*heightIN_MC, data = dat)summary(mod5)
Call:
lm(formula = weightLB ~ heightIN_MC + female + female * heightIN_MC,
data = dat)
Residuals:
Min 1Q Median 3Q Max
-3.8312 -1.7797 0.4958 1.3575 3.3585
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 222.1842 0.8381 265.11 < 2e-16 ***
heightIN_MC 3.1897 0.1114 28.65 3.55e-15 ***
femaleTRUE -82.2719 1.2111 -67.93 < 2e-16 ***
heightIN_MC:femaleTRUE -1.0939 0.1678 -6.52 7.07e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.175 on 16 degrees of freedom
Multiple R-squared: 0.9987, Adjusted R-squared: 0.9985
F-statistic: 4250 on 3 and 16 DF, p-value: < 2.2e-16
Picturing the GLM with Distributions
The distributional assumptions of the GLM are the reason why we do not need to worry if our dependent variable is normally distributed
Our dependent variable should be conditionally normal
We can check this assumption by checking our assumption about the residuals, \(e_p \sim N(0, \sigma^2_e)\)
Assessing Distributional Assumptions Graphically
plot(mod5)
Hypothesis Tests for Normality
If a given test is significant, then it is saying that your data do not come from a normal distribution
In practice, test will give diverging information quite frequently: the best way to evaluate normality is to consider both plots and tests (approximate = good)
shapiro.test(mod5$residuals)
Shapiro-Wilk normality test
data: mod5$residuals
W = 0.95055, p-value = 0.3756
Wrapping Up
Today was an introduction to mathematical statistics as a way to understand the implications statistical models make about data
Although many of these topics do not seem directly relevant, they help provide insights that untrained analysts may not easily attain