Example 01: Make Friends with R

Author

Jihong Zhang

1 How to use this file

  1. You can review all R code on this webpage.

  2. To test one certain chunk of code, you click the “copy” icon in the upper right hand side of the chunk block (see screenshot below)

  3. To review the whole file, click “</> Code” next to the title of this paper. Find “View Source” and click the button. Then, you can paste to the newly created Quarto Document.

2 R SYNTAX AND NAMING CONVENTIONS

Hide the code
# MAKE FRIENDS WITH R
# BASED ON MAKE FRIENDS WITH R BY JONATHAN TEMPLIN
# CREATED BY JIHONG ZHANG

# R comments begin with a # -- there are no multiline comments

# RStudio helps you build syntax
#   GREEN: Comments and character values in single or double quotes

# You can use the tab key to complete object names, functions, and arugments

# R is case sensitive. That means R and r are two different things.

3 R Functions

Hide the code
# In R, every statement is a function

# The print function prints the contents of what is inside to the console
print(x = 10)
[1] 10
Hide the code
# The terms inside the function are called the arguments; here print takes x
#   To find help with what the arguments are use:
?print

# Each function returns an object
print(x = 10)
[1] 10
Hide the code
# You can determine what type of object is returned by using the class function
class(print(x = 10))
[1] 10
[1] "numeric"

4 R Objects

Hide the code
# Each objects can be saved into the R environment (the workspace here)
#   You can save the results of a function call to a variable of any name
MyObject = print(x = 10)
[1] 10
Hide the code
class(MyObject)
[1] "numeric"
Hide the code
# You can view the objects you have saved in the Environment tab in RStudio
# Or type their name
MyObject
[1] 10
Hide the code
# There are literally thousands of types of objects in R (you can create them),
#   but for our course we will mostly be working with data frames (more later)

# The process of saving the results of a function to a variable is called 
#   assignment. There are several ways you can assign function results to 
#   variables:

# The equals sign takes the result from the right-hand side and assigns it to
#   the variable name on the left-hand side:
MyObject = print(x = 10)
[1] 10
Hide the code
# The <- (Alt "-" in RStudio) functions like the equals (right to left)
MyObject2 <- print(x = 10)
[1] 10
Hide the code
identical(MyObject, MyObject2)
[1] TRUE
Hide the code
# The -> assigns from left to right:
print(x = 10) -> MyObject3
[1] 10
Hide the code
identical(MyObject, MyObject2, MyObject3)
[1] TRUE

5 Importing and Exporting Data

  • The data frame is an R object that is a rectangular array of data. The variables in the data frame can be any class (e.g., numeric, character) and go across the columns. The observations are across the rows.

  • We will start by importing data from a comma-separated values (csv) file.

  • We will use the read.csv() function. Here, the argument stringsAsFactors keeps R from creating data strings

  • We will use here::here() function to quickly point to the target data file.

Hide the code
# You can also set the directory using setwd(). Here, I set my directory to 
#   my root folder:
setwd("~")

getwd()
dir()
# If I tried to re-load the data, I would get an error:
HeightsData = read.csv(file = "heights.csv", stringsAsFactors = FALSE)
Hide the code
# Method 2: I can use the full path to the file:
# HeightsData = 
#   read.csv(
#     file = "/Users/jihong/Documents/website-jihong/posts/Lectures/2024-07-21-applied-multivariate-statistics-esrm64503/Lecture01/data/heights.csv", 
#     stringsAsFactors = FALSE)

# Or, I can reset the current directory and use the previous syntax:
# setwd("/Users/jihong/Documents/website-jihong/posts/Lectures/2024-07-21-applied-multivariate-statistics-esrm64503/Lecture01/data/")

HeightsData = read.csv(file = "heights.csv", stringsAsFactors = FALSE)
HeightsData
   ID HeightIN
1   1 72.76783
2   2 69.45293
3   3 69.70142
4   4 69.36786
5   5 70.55350
6   6 69.76497
7   7 70.55302
8   8 69.02545
9   9 69.48786
10 10 69.29473
Hide the code
# Note: Windows users will have to either change the direction of the slash
#   or put two slashes between folder levels.

# To show my data in RStudio, I can either double click it in the 
#   Environment tab or use the View() function
View(HeightsData)

# You can see the variable names and contents by using the $:
HeightsData$ID
 [1]  1  2  3  4  5  6  7  8  9 10
Hide the code
# To read in SPSS files, we will need the foreign library. The foreign
#   library comes installed with R (so no need to use install.packages()).
library(foreign)

# The read.spss() function imports the SPSS file to an R data frame if the 
#   argument to.data.frame is TRUE
WideData = read.spss(file = "wide.sav", to.data.frame = TRUE)
WideData
   ID Gender DVTime1 DVTime2 DVTime3 DVTime4 AgeTime1 AgeTime2 AgeTime3
1   1      1    21.0    20.0    21.5    23.0      8.0     10.0     12.0
2   2      1    21.0    21.5    24.0    25.5      8.1     10.1     12.1
3   3      1    20.5    24.0    24.5    26.0      8.5     10.4     12.2
4   4      1    23.5    24.5    25.0    26.5      8.7     10.6     12.5
5   5      1    21.5    23.0    22.5    23.5      7.9     10.0     12.1
6   6      1    20.0    21.0    21.0    22.5      8.0     10.0     11.9
7   7      1    21.5    22.5    23.0    25.0      8.2     10.2     12.0
8   8      1    23.0    23.0    23.5    24.0      7.9      9.9     12.1
9   9      1    20.0    21.0    22.0    21.5      8.0     10.1     12.4
10 10      1    16.5    19.0    19.0    19.5      8.3     10.2     12.1
   AgeTime4
1      14.0
2      14.2
3      14.1
4      14.4
5      13.9
6      13.8
7      14.1
8      14.0
9      14.3
10     14.2

6 Merging R data frame objects

Hide the code
# The WideData and HeightsData have the same set of ID numbers. We can use the merge() function to merge them into a single data frame. Here, x is the name of the left-side data frame and y is the name of the right-side data frame. The arguments by.x and by.y are the name of the variable(s) by which we will merge:
AllData = merge(x = WideData, y = HeightsData, by.x = "ID", by.y = "ID")
AllData
   ID Gender DVTime1 DVTime2 DVTime3 DVTime4 AgeTime1 AgeTime2 AgeTime3
1   1      1    21.0    20.0    21.5    23.0      8.0     10.0     12.0
2   2      1    21.0    21.5    24.0    25.5      8.1     10.1     12.1
3   3      1    20.5    24.0    24.5    26.0      8.5     10.4     12.2
4   4      1    23.5    24.5    25.0    26.5      8.7     10.6     12.5
5   5      1    21.5    23.0    22.5    23.5      7.9     10.0     12.1
6   6      1    20.0    21.0    21.0    22.5      8.0     10.0     11.9
7   7      1    21.5    22.5    23.0    25.0      8.2     10.2     12.0
8   8      1    23.0    23.0    23.5    24.0      7.9      9.9     12.1
9   9      1    20.0    21.0    22.0    21.5      8.0     10.1     12.4
10 10      1    16.5    19.0    19.0    19.5      8.3     10.2     12.1
   AgeTime4 HeightIN
1      14.0 72.76783
2      14.2 69.45293
3      14.1 69.70142
4      14.4 69.36786
5      13.9 70.55350
6      13.8 69.76497
7      14.1 70.55302
8      14.0 69.02545
9      14.3 69.48786
10     14.2 69.29473
Hide the code
## Method 2: Use dplyr method, |> can be typed using `command + shift + M`
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Hide the code
WideData |> 
  left_join(HeightsData, by = "ID")
   ID Gender DVTime1 DVTime2 DVTime3 DVTime4 AgeTime1 AgeTime2 AgeTime3
1   1      1    21.0    20.0    21.5    23.0      8.0     10.0     12.0
2   2      1    21.0    21.5    24.0    25.5      8.1     10.1     12.1
3   3      1    20.5    24.0    24.5    26.0      8.5     10.4     12.2
4   4      1    23.5    24.5    25.0    26.5      8.7     10.6     12.5
5   5      1    21.5    23.0    22.5    23.5      7.9     10.0     12.1
6   6      1    20.0    21.0    21.0    22.5      8.0     10.0     11.9
7   7      1    21.5    22.5    23.0    25.0      8.2     10.2     12.0
8   8      1    23.0    23.0    23.5    24.0      7.9      9.9     12.1
9   9      1    20.0    21.0    22.0    21.5      8.0     10.1     12.4
10 10      1    16.5    19.0    19.0    19.5      8.3     10.2     12.1
   AgeTime4 HeightIN
1      14.0 72.76783
2      14.2 69.45293
3      14.1 69.70142
4      14.4 69.36786
5      13.9 70.55350
6      13.8 69.76497
7      14.1 70.55302
8      14.0 69.02545
9      14.3 69.48786
10     14.2 69.29473

7 Transforming Wide to Long

Hide the code
# Sometimes, certain packages require repeated measures data to be in a long
# format. 

library(dplyr) # contains variable selection 

## Wrong Way
AllData |> 
  tidyr::pivot_longer(starts_with("DVTime"), names_to = "DV", values_to = "DV_Value") |> 
  tidyr::pivot_longer(starts_with("AgeTime"), names_to = "Age", values_to = "Age_Value") 
# A tibble: 160 × 7
      ID Gender HeightIN DV      DV_Value Age      Age_Value
   <dbl>  <dbl>    <dbl> <chr>      <dbl> <chr>        <dbl>
 1     1      1     72.8 DVTime1     21   AgeTime1         8
 2     1      1     72.8 DVTime1     21   AgeTime2        10
 3     1      1     72.8 DVTime1     21   AgeTime3        12
 4     1      1     72.8 DVTime1     21   AgeTime4        14
 5     1      1     72.8 DVTime2     20   AgeTime1         8
 6     1      1     72.8 DVTime2     20   AgeTime2        10
 7     1      1     72.8 DVTime2     20   AgeTime3        12
 8     1      1     72.8 DVTime2     20   AgeTime4        14
 9     1      1     72.8 DVTime3     21.5 AgeTime1         8
10     1      1     72.8 DVTime3     21.5 AgeTime2        10
# ℹ 150 more rows
Hide the code
## Correct Way
AllData |> 
  tidyr::pivot_longer(c(starts_with("DVTime"), starts_with("AgeTime"))) |> 
  tidyr::separate(name, into = c("Variable", "Time"), sep = "Time") |> 
  tidyr::pivot_wider(names_from = "Variable", values_from = "value") -> AllDataLong

8 Gathering Descriptive Statistics

Hide the code
# The psych package makes getting descriptive statistics very easy.
## install.packages("psych")
library(psych)

Attaching package: 'psych'
The following object is masked from 'package:rstan':

    lookup
Hide the code
# We can use describe() to get descriptive statistics across all cases:
DescriptivesWide = describe(AllData)
DescriptivesWide
         vars  n  mean   sd median trimmed  mad   min   max range  skew
ID          1 10  5.50 3.03   5.50    5.50 3.71  1.00 10.00  9.00  0.00
Gender      2 10  1.00 0.00   1.00    1.00 0.00  1.00  1.00  0.00   NaN
DVTime1     3 10 20.85 1.92  21.00   21.06 1.11 16.50 23.50  7.00 -0.78
DVTime2     4 10 21.95 1.76  22.00   22.00 1.48 19.00 24.50  5.50 -0.13
DVTime3     5 10 22.60 1.81  22.75   22.75 1.85 19.00 25.00  6.00 -0.48
DVTime4     6 10 23.70 2.18  23.75   23.88 2.22 19.50 26.50  7.00 -0.43
AgeTime1    7 10  8.16 0.27   8.05    8.12 0.22  7.90  8.70  0.80  0.79
AgeTime2    8 10 10.15 0.21  10.10   10.12 0.15  9.90 10.60  0.70  0.85
AgeTime3    9 10 12.14 0.18  12.10   12.12 0.15 11.90 12.50  0.60  0.72
AgeTime4   10 10 14.10 0.18  14.10   14.10 0.15 13.80 14.40  0.60  0.00
HeightIN   11 10 70.00 1.10  69.59   69.77 0.39 69.03 72.77  3.74  1.50
         kurtosis   se
ID          -1.56 0.96
Gender        NaN 0.00
DVTime1      0.19 0.61
DVTime2     -1.37 0.56
DVTime3     -0.87 0.57
DVTime4     -1.04 0.69
AgeTime1    -0.85 0.08
AgeTime2    -0.51 0.07
AgeTime3    -0.77 0.06
AgeTime4    -1.22 0.06
HeightIN     1.19 0.35
Hide the code
DescriptivesLong = describe(AllDataLong)
DescriptivesLong
         vars  n  mean   sd median trimmed  mad   min   max range  skew
ID          1 40  5.50 2.91   5.50    5.50 3.71  1.00 10.00  9.00  0.00
Gender      2 40  1.00 0.00   1.00    1.00 0.00  1.00  1.00  0.00   NaN
HeightIN    3 40 70.00 1.05  69.59   69.77 0.39 69.03 72.77  3.74  1.69
Time*       4 40  2.50 1.13   2.50    2.50 1.48  1.00  4.00  3.00  0.00
DV          5 40 22.27 2.12  22.50   22.31 2.22 16.50 26.50 10.00 -0.24
Age         6 40 11.14 2.25  11.25   11.14 2.89  7.90 14.40  6.50 -0.01
         kurtosis   se
ID          -1.31 0.46
Gender        NaN 0.00
HeightIN     1.92 0.17
Time*       -1.44 0.18
DV          -0.13 0.34
Age         -1.42 0.36
Hide the code
# We can use describeBy() to get descriptive statistics by groups:
DescriptivesLongID = describeBy(AllDataLong, group = AllDataLong$ID)
DescriptivesLongID

 Descriptive statistics by group 
group: 1
         vars n  mean   sd median trimmed  mad   min   max range skew kurtosis
ID          1 4  1.00 0.00   1.00    1.00 0.00  1.00  1.00     0  NaN      NaN
Gender      2 4  1.00 0.00   1.00    1.00 0.00  1.00  1.00     0  NaN      NaN
HeightIN    3 4 72.77 0.00  72.77   72.77 0.00 72.77 72.77     0  NaN      NaN
Time        4 4  2.50 1.29   2.50    2.50 1.48  1.00  4.00     3 0.00    -2.08
DV          5 4 21.38 1.25  21.25   21.38 1.11 20.00 23.00     3 0.21    -1.92
Age         6 4 11.00 2.58  11.00   11.00 2.97  8.00 14.00     6 0.00    -2.08
           se
ID       0.00
Gender   0.00
HeightIN 0.00
Time     0.65
DV       0.62
Age      1.29
------------------------------------------------------------ 
group: 2
         vars n  mean   sd median trimmed  mad   min   max range skew kurtosis
ID          1 4  2.00 0.00   2.00    2.00 0.00  2.00  2.00   0.0  NaN      NaN
Gender      2 4  1.00 0.00   1.00    1.00 0.00  1.00  1.00   0.0  NaN      NaN
HeightIN    3 4 69.45 0.00  69.45   69.45 0.00 69.45 69.45   0.0  NaN      NaN
Time        4 4  2.50 1.29   2.50    2.50 1.48  1.00  4.00   3.0 0.00    -2.08
DV          5 4 23.00 2.12  22.75   23.00 2.22 21.00 25.50   4.5 0.14    -2.25
Age         6 4 11.12 2.62  11.10   11.12 2.97  8.10 14.20   6.1 0.02    -2.07
           se
ID       0.00
Gender   0.00
HeightIN 0.00
Time     0.65
DV       1.06
Age      1.31
------------------------------------------------------------ 
group: 3
         vars n  mean   sd median trimmed  mad  min  max range  skew kurtosis
ID          1 4  3.00 0.00   3.00    3.00 0.00  3.0  3.0   0.0   NaN      NaN
Gender      2 4  1.00 0.00   1.00    1.00 0.00  1.0  1.0   0.0   NaN      NaN
HeightIN    3 4 69.70 0.00  69.70   69.70 0.00 69.7 69.7   0.0   NaN      NaN
Time        4 4  2.50 1.29   2.50    2.50 1.48  1.0  4.0   3.0  0.00    -2.08
DV          5 4 23.75 2.33  24.25   23.75 1.48 20.5 26.0   5.5 -0.45    -1.83
Age         6 4 11.30 2.40  11.30   11.30 2.74  8.5 14.1   5.6  0.00    -2.07
           se
ID       0.00
Gender   0.00
HeightIN 0.00
Time     0.65
DV       1.16
Age      1.20
------------------------------------------------------------ 
group: 4
         vars n  mean   sd median trimmed  mad   min   max range skew kurtosis
ID          1 4  4.00 0.00   4.00    4.00 0.00  4.00  4.00   0.0  NaN      NaN
Gender      2 4  1.00 0.00   1.00    1.00 0.00  1.00  1.00   0.0  NaN      NaN
HeightIN    3 4 69.37 0.00  69.37   69.37 0.00 69.37 69.37   0.0  NaN      NaN
Time        4 4  2.50 1.29   2.50    2.50 1.48  1.00  4.00   3.0 0.00    -2.08
DV          5 4 24.88 1.25  24.75   24.88 1.11 23.50 26.50   3.0 0.21    -1.92
Age         6 4 11.55 2.45  11.55   11.55 2.82  8.70 14.40   5.7 0.00    -2.08
           se
ID       0.00
Gender   0.00
HeightIN 0.00
Time     0.65
DV       0.62
Age      1.23
------------------------------------------------------------ 
group: 5
         vars n  mean   sd median trimmed  mad   min   max range  skew kurtosis
ID          1 4  5.00 0.00   5.00    5.00 0.00  5.00  5.00     0   NaN      NaN
Gender      2 4  1.00 0.00   1.00    1.00 0.00  1.00  1.00     0   NaN      NaN
HeightIN    3 4 70.55 0.00  70.55   70.55 0.00 70.55 70.55     0   NaN      NaN
Time        4 4  2.50 1.29   2.50    2.50 1.48  1.00  4.00     3  0.00    -2.08
DV          5 4 22.62 0.85  22.75   22.62 0.74 21.50 23.50     2 -0.28    -1.96
Age         6 4 10.97 2.60  11.05   10.97 2.89  7.90 13.90     6 -0.05    -2.09
           se
ID       0.00
Gender   0.00
HeightIN 0.00
Time     0.65
DV       0.43
Age      1.30
------------------------------------------------------------ 
group: 6
         vars n  mean   sd median trimmed  mad   min   max range  skew kurtosis
ID          1 4  6.00 0.00   6.00    6.00 0.00  6.00  6.00   0.0   NaN      NaN
Gender      2 4  1.00 0.00   1.00    1.00 0.00  1.00  1.00   0.0   NaN      NaN
HeightIN    3 4 69.76 0.00  69.76   69.76 0.00 69.76 69.76   0.0   NaN      NaN
Time        4 4  2.50 1.29   2.50    2.50 1.48  1.00  4.00   3.0  0.00    -2.08
DV          5 4 21.12 1.03  21.00   21.12 0.74 20.00 22.50   2.5  0.27    -1.85
Age         6 4 10.93 2.49  10.95   10.93 2.82  8.00 13.80   5.8 -0.02    -2.07
           se
ID       0.00
Gender   0.00
HeightIN 0.00
Time     0.65
DV       0.52
Age      1.25
------------------------------------------------------------ 
group: 7
         vars n  mean   sd median trimmed  mad   min   max range skew kurtosis
ID          1 4  7.00 0.00   7.00    7.00 0.00  7.00  7.00   0.0  NaN      NaN
Gender      2 4  1.00 0.00   1.00    1.00 0.00  1.00  1.00   0.0  NaN      NaN
HeightIN    3 4 70.55 0.00  70.55   70.55 0.00 70.55 70.55   0.0  NaN      NaN
Time        4 4  2.50 1.29   2.50    2.50 1.48  1.00  4.00   3.0 0.00    -2.08
DV          5 4 23.00 1.47  22.75   23.00 1.11 21.50 25.00   3.5 0.35    -1.87
Age         6 4 11.12 2.52  11.10   11.12 2.82  8.20 14.10   5.9 0.02    -2.05
           se
ID       0.00
Gender   0.00
HeightIN 0.00
Time     0.65
DV       0.74
Age      1.26
------------------------------------------------------------ 
group: 8
         vars n  mean   sd median trimmed  mad   min   max range  skew kurtosis
ID          1 4  8.00 0.00   8.00    8.00 0.00  8.00  8.00   0.0   NaN      NaN
Gender      2 4  1.00 0.00   1.00    1.00 0.00  1.00  1.00   0.0   NaN      NaN
HeightIN    3 4 69.03 0.00  69.03   69.03 0.00 69.03 69.03   0.0   NaN      NaN
Time        4 4  2.50 1.29   2.50    2.50 1.48  1.00  4.00   3.0  0.00    -2.08
DV          5 4 23.38 0.48  23.25   23.38 0.37 23.00 24.00   1.0  0.32    -2.08
Age         6 4 10.97 2.65  11.00   10.97 3.04  7.90 14.00   6.1 -0.02    -2.10
           se
ID       0.00
Gender   0.00
HeightIN 0.00
Time     0.65
DV       0.24
Age      1.32
------------------------------------------------------------ 
group: 9
         vars n  mean   sd median trimmed  mad   min   max range  skew kurtosis
ID          1 4  9.00 0.00   9.00    9.00 0.00  9.00  9.00   0.0   NaN      NaN
Gender      2 4  1.00 0.00   1.00    1.00 0.00  1.00  1.00   0.0   NaN      NaN
HeightIN    3 4 69.49 0.00  69.49   69.49 0.00 69.49 69.49   0.0   NaN      NaN
Time        4 4  2.50 1.29   2.50    2.50 1.48  1.00  4.00   3.0  0.00    -2.08
DV          5 4 21.12 0.85  21.25   21.12 0.74 20.00 22.00   2.0 -0.28    -1.96
Age         6 4 11.20 2.74  11.25   11.20 3.11  8.00 14.30   6.3 -0.03    -2.11
           se
ID       0.00
Gender   0.00
HeightIN 0.00
Time     0.65
DV       0.43
Age      1.37
------------------------------------------------------------ 
group: 10
         vars n  mean   sd median trimmed  mad   min   max range  skew kurtosis
ID          1 4 10.00 0.00  10.00   10.00 0.00 10.00 10.00   0.0   NaN      NaN
Gender      2 4  1.00 0.00   1.00    1.00 0.00  1.00  1.00   0.0   NaN      NaN
HeightIN    3 4 69.29 0.00  69.29   69.29 0.00 69.29 69.29   0.0   NaN      NaN
Time        4 4  2.50 1.29   2.50    2.50 1.48  1.00  4.00   3.0  0.00    -2.08
DV          5 4 18.50 1.35  19.00   18.50 0.37 16.50 19.50   3.0 -0.68    -1.73
Age         6 4 11.20 2.53  11.15   11.20 2.82  8.30 14.20   5.9  0.04    -2.07
           se
ID       0.00
Gender   0.00
HeightIN 0.00
Time     0.65
DV       0.68
Age      1.27

9 Transforming Data

Hide the code
# Transforming data is accomplished by the creation of new variables. 
AllDataLong$AgeC = AllDataLong$Age - mean(AllDataLong$Age)

# You can also use functions to create new variables. Here we create new terms
#   using the function for significant digits:
AllDataLong$AgeYear = signif(x = AllDataLong$Age, digits = 2)
AllDataLong$AgeDecade = signif(x = AllDataLong$Age, digits = 1)
head(AllDataLong)
# A tibble: 6 × 9
     ID Gender HeightIN Time     DV   Age   AgeC AgeYear AgeDecade
  <dbl>  <dbl>    <dbl> <chr> <dbl> <dbl>  <dbl>   <dbl>     <dbl>
1     1      1     72.8 1      21     8   -3.14      8           8
2     1      1     72.8 2      20    10   -1.14     10          10
3     1      1     72.8 3      21.5  12    0.863    12          10
4     1      1     72.8 4      23    14    2.86     14          10
5     2      1     69.5 1      21     8.1 -3.04      8.1         8
6     2      1     69.5 2      21.5  10.1 -1.04     10          10
Back to top