Agenda

  • Introduce R
  • Very basics of R (object assignment)
  • R packages
  • Running multiple regression models
  • Visualizing multiple regression models

Who am I?

Daniel Anderson

  • Research Associate: Behavioral Research and Teaching
  • Dad (two daughters: 5 and 3)
  • Quantitative educational researcher who loves R
  • Primary areas of interest
    • R and computational educational research
    • Open data, open science, and reproducible workflows
    • Growth modeling (primarily through multilevel models)

Part of why I love R

All of this great stuff and it just happens to also be free.

Free Books!

The book for the course I teach

R for Data Science

The book for the next course I would like to teach (if there is one)

Advanced R


Other books

Freely available at http://socviz.co


What is R?

  • A programming language
  • Tremendously powerful and flexible statistical software that happens to be free
  • No point-and-click interface
  • Incredible array of external “packages” available for specialized analyses, data visualizations, or to automate much of the data “munging” process

Code-based interface

Moving to code/programming

Advantages

  • Flexibility
    • Only limited by your own creativity (and current level of programming skills, which are ever-evolving)
  • Transparency
    • Documented record of every step taken in your data preparation and analysis
  • Efficiency
    • Many (most?) tasks can be automated and/or applied to multiple datasets/variables simultaneously or essentially simultaneously

Disadvantages

  • Steep learning curve
    • Absolutely requires a significant time investment, both to learn initially and build fluency
    • Equivalent to learning a new language
  • You will lose patience with point-and-click interfaces
  • Likely to become “one of the converted”

The R Learning Curve

plot of chunk unnamed-chunk-1


How to learn R?

  • Three most important ingredients: time, time, and more time
  • A sprinkling of dedication and determination help.
  • Be patient and forgiving with yourself. It will feel slow at first. Most people have not trained themselves to think in this way.

impost-r


R as a big calculator

3 + 2
## [1] 5
(1/-(3/2)^2) / 2^-1/9
## [1] -0.09876543

Object Assignment

a <- 3
b <- 2
a + b
## [1] 5
a / (a + b)
## [1] 0.6

Object re-assignment

a <- 3
a
## [1] 3
a <- 7
a
## [1] 7

Object Assignment (continued)

Objects can be of a variety of types.

string <- "Hello world!"
logical <- TRUE
double <- 3.2587021
Integer <- 6L

In this case, we can’t exactly do arithmetic with all of these. For example

string + double
## Error in string + double: non-numeric argument to binary operator

But, these objects can be extremely useful in programming.

Functions and getting help

R functions

  • Anything that carries out an operation in R is a function, even +.
  • Functions (outside of primitive functions) are preceded by ()
    • e.g., sum(), lm()

Getting help

R packages

R ships with considerable functionality. It also comes with a set of pre-loaded packages

  • e.g.
    • “base”
    • “graphics”
    • “stats”

R also comes with a set of packages installed, but not loaded on launch

  • e.g.
    • “boot”
    • “MASS”
    • “Matrix”

Pre-loaded packages operate “out of the box”. For example, plot is part of the graphics package, which ships with R.

plot(x = 1:10, y = 1:10)

plot of chunk first_plot


On CRAN

  • Any of these can be installed with install.packages("pkg_name"). You will then have access to all the functionality of the package.
  • Notice this plot only goes to mid-2014. As of this writing (11/22/17), there are 11,892 packages available on CRAN! See https://cran.r-project.org/web/packages/

Other packages

On github


Installing from github

First, install the devtools package from CRAN

install.packages("devtools")

Next, load the devtools library to access the install_github function. For example, to install my esvis package

library(devtools)
install_github("DJAnderson07/esvis")

You then have access to all the functionality of that package once you load it. Let’s look at these data:

sid cohort sped ethnicity frl ell season reading math
332347 1 Non-Sped Native Am. Non-FRL Non-ELL Winter 208 205
400047 1 Non-Sped Native Am. FRL Non-ELL Spring 212 218
402107 1 Non-Sped White Non-FRL Non-ELL Winter 201 212
402547 1 Non-Sped White Non-FRL Non-ELL Fall 185 177
403047 1 Sped Hispanic FRL Active Winter 179 192
403307 1 Sped Hispanic Non-FRL Non-ELL Winter 189 188

PP-Plot

library(esvis)
pp_plot(reading ~ ell, benchmarks)

plot of chunk load_esvis


Binned quantile effect sizes

binned_plot(math ~ ethnicity, benchmarks, 
            qtiles = seq(0, 1, .2), 
            theme = "dark")

plot of chunk esvis_ptile


ES Calculation

hedg_g(math ~ ethnicity, benchmarks, ref_group = "White")
##   ref_group  foc_group   estimate
## 1     White      Asian -0.1811177
## 2     White   Hispanic -0.6226720
## 3     White      Black -0.6547893
## 4     White Am. Indian -0.6685548
## 5     White Native Am. -0.8248879
auc(math ~ ethnicity, benchmarks)
##     ref_group  foc_group  estimate
## 1       White      Asian 0.5623478
## 2       White   Hispanic 0.6689338
## 3       White      Black 0.6805925
## 4       White Am. Indian 0.6888028
## 5       White Native Am. 0.7352343
## 6       Asian   Hispanic 0.6030755
## 7       Asian      Black 0.6155365
## 8       Asian Am. Indian 0.6231116
## 9       Asian Native Am. 0.6681564
## 10   Hispanic      Black 0.5135554
## 11   Hispanic Am. Indian 0.5195881
## 12   Hispanic Native Am. 0.5683233
## 13      Black Am. Indian 0.5071228
## 14      Black Native Am. 0.5542968
## 15 Am. Indian Native Am. 0.5462639
## 16 Native Am. Am. Indian 0.4516412
## 17 Native Am.      Black 0.4455142
## 18 Native Am.   Hispanic 0.4311700
## 19 Native Am.      Asian 0.3304590
## 20 Native Am.      White 0.2644428
## 21 Am. Indian      Black 0.4915051
## 22 Am. Indian   Hispanic 0.4787148
## 23 Am. Indian      Asian 0.3743474
## 24 Am. Indian      White 0.3091228
## 25      Black   Hispanic 0.4863905
## 26      Black      Asian 0.3842752
## 27      Black      White 0.3194840
## 28   Hispanic      Asian 0.3963723
## 29   Hispanic      White 0.3311781
## 30      Asian      White 0.4368362

Is this exciting!?! YES!!!

Why is this such a big deal?

  • With just a basic knowledge of R you have access to literally thousands of packages
    • Expanding on a daily basis
    • Provides access to cutting edge and specialized functionality for analysis, data visualization, and data munging
    • Some of the most modern thinking on data analysis topics are often represented in these packages

Fitting multiple regression models


Want to follow along?

  • Copy and paste the following code in your R console
install.packages(c("tidyverse", "rio", "devtools", "arm", "lm.beta", "visreg", "lme4"))

Step 0

  • Before fitting model, you’ll generally need to import some data, let’s do so now
  • Make sure your data file is stored in the same place that your script is
  • The setclass argument above is actually not required, but makes it a bit easier to work with.
library(rio)
d <- import("synthetic_data.csv", setclass = "tbl_df")
d
## # A tibble: 11,218 x 6
##         SID grade clock cohort  LD33    SS
##       <int> <int> <int>  <int> <chr> <int>
##  1  1243667     7     1      5 Never   238
##  2 12961647     6     0      7 Never   221
##  3  5477581     7     1      5 Never   224
##  4  4177568     8     2      5 Never   248
##  5  9368752     7     1      6 Never   239
##  6  7736290     7     1      7 Never   239
##  7  9486143     6     0      5 Never   220
##  8  6181953     7     1      5 Never   237
##  9  7966652     7     1      7 Never   234
## 10  7776640     7     1      6 Never   234
## # ... with 11,208 more rows

Research Questions

  1. What is the average growth from Grades 6-8 in math (SS)
  2. Does the averge initial achievement or rate of growth depend upon cohort?
  3. Does the averge initial achievement or rate of growth depend upon LD33, the students’ pattern of SLD classification?
    • I don’t remember why the variable has the name it does

NOTE: Multiple regression is NOT the best way to approach this. A multilevel model would be preferable. But, at the end, I’ll show you how simple it is to extend what we do here to the multilevel modeling approach.


Step 1: Look at your data!

Always best to visualize your data first. Let’s produce plots addressing each of our research questions.

  1. What does the average growth look like? (plot on next slide)
library(tidyverse)
theme_set(theme_light()) # Not neccessary, but I like it

ggplot(d, aes(x = grade, y = SS)) +
  geom_point() +
  geom_smooth(method = "lm") 

plot of chunk visualize1_eval


Does initial achievement or average growth depend upon cohort?

ggplot(d, aes(grade, SS)) +
  geom_point() +
  geom_smooth(method = "lm",
              aes(color = factor(cohort)))

plot of chunk cohort_growth1


Does initial achievement or average growth depend upon LD status?

ggplot(d, aes(grade, SS)) +
  geom_point() +
  geom_smooth(method = "lm",
              aes(color = factor(LD33)))

plot of chunk cohort_growth2


What about both?

ggplot(d, aes(grade, SS)) +
  geom_point() +
  geom_smooth(method = "lm",
              aes(color = factor(LD33))) +
  facet_wrap(~cohort)

plot of chunk by_cohort_ld33


Quick aside

  • geom_jitter might be slightly better in this case
ggplot(d, aes(grade, SS)) +
  geom_jitter(height = 0, width = 0.2, color = "gray80", alpha = 0.6) +
  geom_smooth(method = "lm",
              aes(color = factor(LD33))) +
  facet_wrap(~cohort)

plot of chunk jittered


Step 2: Fit the model

  • Use the lm function
    • Part of base R
    • Takes the following general form
mod <- lm(outcome ~ predictor1 + predictor2 + predictorN,
          data = d)

Fit simple linear regression model first

library(arm)
time_mod <- lm(SS ~ clock, data = d)
display(time_mod, detail = TRUE)
## lm(formula = SS ~ clock, data = d)
##             coef.est coef.se t value Pr(>|t|)
## (Intercept)  227.70     0.15 1485.71    0.00 
## clock          5.31     0.12   44.85    0.00 
## ---
## n = 11218, k = 2
## residual sd = 10.24, R-Squared = 0.15

Include a categorical predictor

  • In R, categorical predictors need to be defined as factors
  • Factors can have any underlying contrast matrix, but by default dummy-coding is used, with the reference level being the first level

Change cohort to be a factor

d$cohort <- as.factor(d$cohort)

Inspect the contrast matrix with

contrasts(d$cohort)
##   6 7
## 5 0 0
## 6 1 0
## 7 0 1

Change the reference level with

d$cohort <- relevel(d$cohort, ref = "7")
contrasts(d$cohort)
##   5 6
## 7 0 0
## 5 1 0
## 6 0 1

Fit a second model

cohort_mod <- lm(SS ~ clock + cohort, data = d)
display(cohort_mod, detail = TRUE)
## lm(formula = SS ~ clock + cohort, data = d)
##             coef.est coef.se t value Pr(>|t|)
## (Intercept)  228.40     0.21 1104.77    0.00 
## clock          5.31     0.12   44.87    0.00 
## cohort5       -1.50     0.24   -6.33    0.00 
## cohort6       -0.57     0.24   -2.43    0.02 
## ---
## n = 11218, k = 4
## residual sd = 10.22, R-Squared = 0.16

Compare models

anova(time_mod, cohort_mod)
## Analysis of Variance Table
## 
## Model 1: SS ~ clock
## Model 2: SS ~ clock + cohort
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1  11216 1175695                                  
## 2  11214 1171426  2    4269.3 20.435 1.385e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Add cohort as predictor of the slope

cohort_mod2 <- lm(SS ~ clock + cohort + clock:cohort, data = d)
display(cohort_mod2, detail = TRUE)
## lm(formula = SS ~ clock + cohort + clock:cohort, data = d)
##               coef.est coef.se t value Pr(>|t|)
## (Intercept)   228.06     0.27  847.84    0.00  
## clock           5.64     0.21   27.34    0.00  
## cohort5        -0.79     0.38   -2.09    0.04  
## cohort6        -0.27     0.38   -0.72    0.47  
## clock:cohort5  -0.71     0.29   -2.44    0.01  
## clock:cohort6  -0.30     0.29   -1.02    0.31  
## ---
## n = 11218, k = 6
## residual sd = 10.22, R-Squared = 0.16

Quick note on syntax

The following two lines of code are equivalent

cohort_mod2 <- lm(SS ~ clock + cohort + clock:cohort, data = d)
cohort_mod2 <- lm(SS ~ clock*cohort, data = d)

Need standardized coefficients?

# install.packages("lm.beta")
library(lm.beta)
lm.beta(cohort_mod2)
## 
## Call:
## lm(formula = SS ~ clock + cohort + clock:cohort, data = d)
## 
## Standardized Coefficients::
##   (Intercept)         clock       cohort5       cohort6 clock:cohort5 
##    0.00000000    0.41424809   -0.03344220   -0.01152243   -0.04243530 
## clock:cohort6 
##   -0.01779824

Let’s skip ahead and fit the full model

  • We want clock, cohort, and ld status all entered in the model, as well as the interaction between clock and cohort, and clock and ld status

Try to write the code on your own


Model

full_mod <- lm(SS ~ clock + cohort + LD33 +
                   clock:cohort + clock:LD33, 
                 data = d)
display(full_mod, detail = TRUE)
## lm(formula = SS ~ clock + cohort + LD33 + clock:cohort + clock:LD33, 
##     data = d)
##                     coef.est coef.se t value Pr(>|t|)
## (Intercept)         218.47     0.69  318.29    0.00  
## clock                 5.47     0.54   10.20    0.00  
## cohort5              -0.79     0.36   -2.19    0.03  
## cohort6              -0.27     0.36   -0.75    0.45  
## LD33Never            10.44     0.67   15.64    0.00  
## LD33Sometimes        -1.48     1.14   -1.30    0.19  
## clock:cohort5        -0.64     0.28   -2.30    0.02  
## clock:cohort6        -0.28     0.28   -0.99    0.32  
## clock:LD33Never       0.07     0.52    0.14    0.89  
## clock:LD33Sometimes   0.78     0.94    0.84    0.40  
## ---
## n = 11218, k = 10
## residual sd = 9.83, R-Squared = 0.22

Compare our last model to prior models

anova(cohort_mod2, full_mod)
## Analysis of Variance Table
## 
## Model 1: SS ~ clock + cohort + clock:cohort
## Model 2: SS ~ clock + cohort + LD33 + clock:cohort + clock:LD33
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1  11212 1170801                                  
## 2  11208 1084086  4     86714 224.13 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualizing the fitted models

(I’m guessing I’m almost out of time, but quickly…) The visreg package is amazing, and I highly recommend it

library(visreg)
visreg(full_mod, "clock", by = "LD33")

plot of chunk visreg1


visreg(full_mod, "clock", by = "cohort")

plot of chunk visreg2


visreg(full_mod, "LD33", by = "clock", gg = TRUE)

plot of chunk visreg3


Kinda hard but maybe helpful

visreg2d(full_mod, "clock", "LD33", plot.type = "persp")

plot of chunk visreg4


See the following links for more info on the visreg package


Last note - fitting the right model!

  • The model we’ve fit should be multilevel. So let’s do it!
#install.package("lme4")
library(lme4)
mlm <- lmer(SS ~ clock + cohort + LD33 +
                   clock:cohort + clock:LD33 +
              (1 + clock|SID), 
                 data = d)

display(mlm, detail = TRUE)
## lmer(formula = SS ~ clock + cohort + LD33 + clock:cohort + clock:LD33 + 
##     (1 + clock | SID), data = d)
##                     coef.est coef.se t value
## (Intercept)         218.45     0.68  319.73 
## clock                 5.44     0.54   10.12 
## cohort5              -0.76     0.37   -2.08 
## cohort6              -0.27     0.36   -0.75 
## LD33Never            10.44     0.66   15.74 
## LD33Sometimes        -1.48     1.13   -1.31 
## clock:cohort5        -0.61     0.28   -2.17 
## clock:cohort6        -0.26     0.28   -0.92 
## clock:LD33Never       0.10     0.52    0.20 
## clock:LD33Sometimes   0.76     0.94    0.81 
## 
## Error terms:
##  Groups   Name        Std.Dev. Corr  
##  SID      (Intercept) 2.08           
##           clock       1.45     -0.35 
##  Residual             9.54           
## ---
## number of obs: 11218, groups: SID, 3580
## AIC = 83111.4, DIC = 83070.7
## deviance = 83077.0

How much variability?

library(lattice)
qqmath(ranef(mlm, condVar = TRUE))
## $SID

plot of chunk ranef


Parting thoughts

Today’s lecture is mostly about exposure. R takes a lot of time and effort to learn, but it is really worth it.

Questions?


Thanks!

thanks