Experimental Design in Education
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
2025-02-24
Class Outline
There are three basic types of experimental research designs:
Pre-experimental designs: no control group
True experimental designs: control group with random assignment
Quasi-experimental designs: control group but no random assignment; assignment is often based on pre-existing criteria.
A true experimental design also has different sub-types: post-test only, pre-post, four-group, factorial design, block design
They are characterized by the methods of random assignment and random selection.
These designs help control for extraneous variables.
| Group | Treatment | Post-test |
|---|---|---|
| 1 (treatment) | X | O |
| 2 (control) | O |
Example
A researcher wants to determine if a new reading intervention program improves reading comprehension in second-grade students.
By comparing the pre-test and post-test scores between the two groups, the researcher can determine if the intervention program caused a significant improvement in reading comprehension compared to the standard curriculum.
Example
Imagine a study evaluating the effectiveness of a new diversity and inclusion training program in a company. Researchers are concerned that a pre-test measuring employees’ attitudes might make them more aware of the issues and thus influence their responses on the post-test, regardless of the quality of the training.
By comparing the post-test results across these four groups, the researchers can determine the true effect of the training program, the effect of being pre-tested, and whether the pre-test made employees more or less receptive to the training.
| Group | Treatment | Pre-test | Post-test |
|---|---|---|---|
| 1 | X | O | O |
| 2 | X | O | |
| 3 | O | O | |
| 4 | O |
The researcher manipulates two or more independent variables (factors) simultaneously to observe their effects on the dependent variable.
Features:
Example
A study investigating the factors that cause workplace stress seeks to examine the effect of various combinations of background noise and interruptions on employee stress levels.
This is a 3x2 factorial design, which creates 3 * 2 = 6 different experimental conditions or groups:
Participants would be randomly assigned to one of these six conditions. This design allows researchers to answer three key questions:
What is the main effect of background noise on stress? (i.e., does noise level, in general, affect stress?)
What is the main effect of interruptions on stress? (i.e., does the interruption rate, in general, affect stress?)
What is the interaction effect between noise and interruptions? (i.e., does the effect of interruptions on stress depend on the level of background noise? For example, perhaps high interruptions are only stressful when combined with high noise.)
This design is a technique for dealing with nuisance factors — variables that are not of primary interest but may nonetheless influence the outcome variable.
Features:
Purpose: To minimize the effect of a single, known nuisance variable on the outcome.
Blocking: Participants are first divided into homogeneous groups or “blocks” based on the nuisance variable (e.g., age, gender, IQ).
Randomization: Within each block, participants are randomly assigned to the treatment or control conditions.
Benefit: This design reduces variability within each block, making it easier to detect the true effect of the treatment.
Suppose we want to conduct a post-test-only design and recognize that our sample contains several homogeneous subgroups.
Example
In a study of college students, we might expect students to be relatively homogeneous with respect to academic year.
| Block | Group | Treatment | Post-test |
|---|---|---|---|
| F. | 1 | X | O |
| F. | 2 | O | |
| Sop. | 1 | X | O |
| Sop. | 2 | O | |
| Ju. | 1 | X | O |
| Ju. | 2 | O | |
| Sen. | 1 | X | O |
| Sen. | 2 | O |
Example
A study aims to test the effectiveness of a new medication for lowering blood pressure. Instead of using different groups for treatment and control, researchers recruit one group of patients.
Because the same participants are measured at multiple points in time, this is a repeated measures design. The key advantage is that it controls for individual differences between participants, making it a very powerful way to detect the effect of the treatment.
Choosing the right experimental design depends on your research question, available resources, and the threats to validity you most need to control.
| Design | Pros | Cons | Use When | Example |
|---|---|---|---|---|
| Post-test Only | Simple; no pre-test sensitization; low cost | No baseline; relies entirely on randomization for equivalence | Random assignment is ensured and pre-testing would sensitize participants | Randomly assigning classrooms to a new teaching method and comparing end-of-unit scores |
| Pre-Post-test | Establishes baseline; documents individual change | Pre-test may sensitize participants; requires more time and resources | Baseline measurement is essential and sensitization is not a major concern | Measuring student anxiety before and after an 8-week mindfulness program |
| Solomon Four-Group | Controls for both pre-test sensitization and treatment effects | Requires four groups; expensive; complex to implement and analyze | Rigorous control of testing threats is needed and sufficient participants are available | Testing a bias-awareness training where the pre-test survey itself could trigger attitude change |
| Factorial | Tests multiple IVs and their interactions efficiently in one study | Larger samples needed; complex analysis; interactions can be hard to interpret | Two or more IVs may interact; interested in how the effect of one variable depends on another | Studying how teaching method (lecture vs. inquiry) and class size (small vs. large) jointly affect achievement |
| Randomized Block | Reduces error variance by controlling a known nuisance variable; increases power | Nuisance variable must be identified before data collection | A known covariate (e.g., SES, grade level, gender) is expected to substantially influence the outcome | Blocking by prior GPA before randomly assigning students to a study-skills intervention |
| Repeated Measures | Each participant is their own control; fewer participants needed; high statistical power | Carryover, order, and fatigue effects; attrition over time | Tracking within-person change over time, or when sample size is limited and individual differences are large | Measuring the same students’ reading fluency at four time points across a school year |
Quick Decision Checklist
Before finalizing your design, ask yourself:
For each scenario below, select the experimental design that best fits the research purpose. Be prepared to explain your reasoning.
Question 1
A researcher wants to evaluate whether a new phonics-based reading program improves literacy in first graders. She randomly assigns 40 students to either the new program or the standard curriculum. Because she worries that administering a reading pre-test might prime students to pay special attention to phonics — independently of the program — she only collects reading scores at the end of the 6-week period.
Which experimental design is she using?
Question 2
A school counselor evaluates whether a mindfulness-based stress reduction program decreases test anxiety in 9th graders. She measures each student’s anxiety at the start of the semester, delivers the 8-week program, and then measures anxiety again at the end. Her primary goal is to document how much each individual student’s anxiety changed over the course of the program.
Which design best fits her study?
Question 3
A researcher is testing whether a new anti-bullying curriculum changes middle schoolers’ attitudes toward bullying. She is concerned that completing a pre-test survey about bullying might itself make students more sensitive to the issue — independent of the curriculum. She has 200 participants available and wants to isolate the true effect of the curriculum from any pre-test sensitization.
Which design should she use?
Question 4
An educational psychologist hypothesizes that the effectiveness of a growth mindset intervention depends on both the student’s grade level (middle school vs. high school) and the delivery format (in-person vs. online). She wants to examine not only whether each factor independently affects outcomes, but also whether the two factors interact with each other.
Which design is most appropriate?
Question 5
A developmental psychologist wants to track how children’s executive function skills develop across one academic year. She measures the same 25 children on an executive function task at the beginning, middle, and end of the year. Her sample is small because the assessment is expensive and time-consuming, and she wants to maximize statistical power by using each child as their own control.
Which design does her study employ?
When we review experiments with a critical view, one question to ask is “Is this study valid?”
Validity is the foundation of trustworthy research. It ensures that the conclusions we draw are accurate and meaningful. Without it, we might be measuring the wrong thing, mistaking correlation for causation, or finding results that don’t apply to the real world.
Why It Matters: Mini Examples
Randomized experiments are often called the “gold standard” of research design, particularly for establishing internal validity, because they are the most effective way to establish a cause-and-effect relationship between a treatment and an outcome.
By randomly assigning participants to groups, the experimenter creates two or more groups that are statistically equivalent, on average, before the treatment is applied. This process minimizes selection bias and ensures that other potential causes (e.g., age, motivation, prior knowledge) are distributed equally across the groups.
Therefore, if a difference is observed between the groups after the treatment, the researcher can be much more confident that the difference was caused by the treatment and not by some other pre-existing factor.
Randomized experiments allow researchers to scientifically measure the impact of an intervention on a particular outcome of interest (e.g., the effect of intervention methods on performance).
The key to a randomized experimental research design is the random assignment of study subjects:
Randomization has a very specific meaning:
Randomization in this context means: care is taken to ensure that no pattern exists between the assignment of subjects into groups and any characteristics of those subjects.

Note
Note
Concern:
Threats:
Causes of Threats:
Examples of Statistical Validity Threats in Education & Psychology
Example 1 — Underpowered study (Low Power)
A school counselor wants to test whether a 4-week mindfulness program reduces math anxiety in middle school students. She recruits only 10 students per group (treatment vs. control) and runs an independent-samples t-test. The program actually has a small-to-medium effect, but with only 20 total participants the study has roughly 30% power — far below the recommended 80%. The result is non-significant, and she incorrectly concludes that mindfulness has no effect. This is a Type II error driven by low statistical power.
Fix: A power analysis before data collection would have revealed that approximately 52 students per group are needed to detect a medium effect (\(d = 0.5\)) at 80% power.
Example 2 — Inflated Type I error from multiple comparisons
A researcher studies the effect of a new reading intervention on five outcome measures: reading fluency, comprehension, vocabulary, writing quality, and attendance. She runs a separate t-test for each outcome without adjusting the alpha level. With five tests at \(\alpha = .05\), the probability of at least one false positive exceeds 22%. She reports a significant effect on vocabulary and claims the intervention works — but this finding may be a chance result.
Fix: Apply a Bonferroni correction (use \(\alpha = .05/5 = .01\) per test) or use a multivariate test (MANOVA) to test all outcomes simultaneously.
Example 3 — Violated statistical assumption
A psychology graduate student examines whether students’ self-reported stress levels differ across three academic programs (Education, Psychology, Engineering) using a one-way ANOVA. However, she collects data from only 5 students per group and never checks for normality. With such small samples and an ordinal Likert scale, the normality assumption is likely violated, making the F-test results untrustworthy.
Fix: Use a non-parametric alternative (e.g., Kruskal-Wallis test) when sample sizes are small and distributional assumptions cannot be verified.
Bonus: Resentful Demoralization
Resentful demoralization is a threat to internal validity that occurs when participants in a control group become discouraged or resentful because they are not receiving the treatment.
As a result, they may:
This artificially inflates the difference between the treatment and control groups — making the treatment look more effective than it actually is.
Internal validity is about whether the treatment truly caused the outcome.
With resentful demoralization:
Imagine a study testing a new study-skills program:
If control participants realize they are missing out, they may feel:
“Why bother trying? They’re getting the help anyway.”
Their performance drops — not because the treatment works so well, but because the control group became demoralized.
Example of a Threat to Internal Validity
Imagine a study measures the effectiveness of a 3-month public health campaign designed to increase recycling.
The conclusion seems to be that the campaign worked. However, during that same 3-month period, a very popular celebrity independently launched their own high-profile “Go Green” initiative. Now, it’s impossible to know if the increase in recycling was due to the health campaign or the celebrity’s influence. This external event is a history threat that compromises the study’s internal validity.
More Examples of Internal Validity Threats in Education & Psychology
Example 1 — Maturation threat
A researcher implements a 9-month phonics intervention for first-grade students and measures reading fluency at the start and end of the school year. Reading scores improve significantly. However, first-graders naturally develop reading skills over the course of a year through brain maturation and general classroom instruction. Without a control group, it is impossible to determine how much of the gain is due to the intervention versus normal developmental growth.
Example 2 — Selection bias
A school district offers an optional after-school tutoring program and later compares the final exam scores of students who attended with those who did not. Attendees score higher on average. However, students who voluntarily joined the program were likely already more motivated, more engaged, or had more parental support. The observed difference may reflect pre-existing differences between groups rather than the effect of tutoring itself.
Example 3 — Regression to the mean
A district identifies the 50 students who scored in the bottom 10% on a standardized math test and enrolls them in an intensive remediation program. At the end of the semester, their average score rises noticeably. But scores at the extreme low end naturally tend to move upward on retesting due to statistical regression to the mean — even without any intervention. Attributing all the improvement to the program would be an internal validity error.
Example 4 — Instrumentation threat
A study measures student anxiety using a validated 20-item scale at pre-test, but the research team switches to a shorter 10-item scale at post-test to save time. Any observed change in scores may partly or entirely reflect the change in measurement instrument rather than a true change in anxiety levels.
Complete math test in swimsuits
Consider, for example, an experiment in which researcher Barbara Fredrickson and her colleagues had undergraduate students come to a laboratory on campus and complete a math test while wearing a swimsuit (Fredrickson et al. 1998). At first, this manipulation might seem silly. When will undergraduate students ever have to complete math tests in their swimsuits outside of this experiment?
Assumption: “This self-objectification is hypothesized to (a) produce body shame, which in turn leads to restrained eating, and (b) consume attention resources, which is manifested in diminished mental performance.”
“Self-objectification increased body shame, which in turn predicted restrained eating.”
Example of Cialdini et al. (2005)
In one such experiment, Robert Cialdini and his colleagues studied whether hotel guests chose to reuse their towels for a second day as opposed to having them washed as a way of conserving water and energy (Cialdini 2005).
Threats to External Validity:
As a general rule, studies are higher in external validity when the participants and the situation studied are similar to those that the researchers want to generalize to and that participants encounter every day, often described as mundane realism.
The best approach to minimize this threat is to use a heterogeneous group of settings, people, and times.
Examples of External Validity Threats in Education & Psychology
Example 1 — Interaction of selection and treatment (WEIRD samples)
A researcher tests a growth mindset intervention and finds it significantly raises academic persistence among undergraduate psychology students at a large research university. She concludes the intervention should be adopted nationwide. However, her sample is almost entirely 18–22-year-old, college-educated, and self-selected into a psychology course. Whether the same intervention works for elementary school students, adult learners, or students in under-resourced rural schools remains unknown. The finding may not generalize beyond the original sample.
Example 2 — Interaction of setting and treatment
A counseling psychologist develops a trauma-informed social-emotional learning (SEL) curriculum and tests it in a well-funded suburban school with small class sizes, trained counselors on staff, and strong administrator support. Results are impressive. However, when the same curriculum is adopted by an urban district with large class sizes, few support staff, and limited professional development time, outcomes are far weaker. The original results did not generalize to a different setting.
Example 3 — Interaction of history and treatment
A study evaluating the effect of online collaborative learning on student engagement was conducted in spring 2020, when schools abruptly shifted to remote instruction during the COVID-19 pandemic. Students and teachers were highly motivated to make technology-based collaboration work. A researcher who concludes “online collaboration improves engagement” and tries to replicate the finding in 2025 under normal in-person conditions may get very different results — the extraordinary historical context was a key driver of the original outcome.
Example 4 — Interaction of testing and treatment
Researchers administer a detailed pre-test on attitudes toward diversity before implementing a multicultural education program. Post-test scores show significantly more positive attitudes. However, the pre-test itself likely sensitized students to diversity issues and primed them to think about the topic before the program began — an effect that would not occur if the program were implemented in a school without a pre-test. The findings may not generalize to real-world program delivery without the pre-test.
Examples of Construct Validity in Education & Psychology
Example 1 — Measuring “critical thinking” or measuring reading speed?
A school district develops a new “critical thinking” assessment for 4th graders that consists of complex, lengthy reading passages followed by inferential questions. Students who score low are labeled as weak critical thinkers and placed in an intervention. However, a follow-up analysis reveals that scores correlate almost perfectly with reading fluency but show little relationship with tasks that require logical reasoning or argument evaluation. The test is actually measuring reading speed and decoding ability — not critical thinking. This is a construct validity failure: the operationalization does not match the theoretical construct.
Example 2 — Measuring “intrinsic motivation” or social desirability?
A researcher develops a self-report questionnaire to measure students’ intrinsic motivation for learning. Items include: “I study because I genuinely enjoy learning new things.” Students who want to appear to be good students may give socially desirable answers regardless of their true feelings. If the scale correlates strongly with a social desirability scale but weakly with actual time-on-task or voluntary reading behavior, the instrument may be capturing social desirability rather than intrinsic motivation itself.
Example 3 — Convergent and discriminant validity of a “self-efficacy” scale
A psychologist creates a new academic self-efficacy scale for college students. To validate it:
Both checks together provide evidence that the instrument measures what it claims to measure.
Example 4 — Mono-method bias in measuring student well-being
A researcher assesses student psychological well-being using only a single self-report survey administered once at the end of the semester. Because students are tired and may under-report positive emotions at that time of year, the measure under-represents the true construct. Using multiple methods — self-report, teacher ratings, behavioral observation, and physiological measures — across multiple time points would yield a more construct-valid picture of well-being.


ESRM 64503