Agenda
 Introduce R
 Very basics of R (object assignment)
 R packages
 Running multiple regression models
 Visualizing multiple regression models
Who am I?
Daniel Anderson
 Research Associate: Behavioral Research and Teaching
 Dad (two daughters: 5 and 3)
 Quantitative educational researcher who loves R
 Primary areas of interest
 R and computational educational research
 Open data, open science, and reproducible workflows
 Growth modeling (primarily through multilevel models)
Part of why I love R
 Computer programming language = you can do ANYTHING!
 Develop websites: http://www.dandersondata.com
 Develop new algorithms/methods: https://github.com/DJAnderson07/esvis
 Create huge gains in efficiency: https://github.com/DJAnderson07/r2Winsteps
 Transparency and open and reproducible research
 The R community!
All of this great stuff and it just happens to also be free.
Free Books!
The book for the course I teach
The book for the next course I would like to teach (if there is one)
Other books
Freely available at http://socviz.co
What is R?
 A programming language
 Tremendously powerful and flexible statistical software that happens to be free
 No pointandclick interface
 Incredible array of external “packages” available for specialized analyses, data visualizations, or to automate much of the data “munging” process
Codebased interface
Moving to code/programming
Advantages
 Flexibility
 Only limited by your own creativity (and current level of programming skills, which are everevolving)
 Transparency
 Documented record of every step taken in your data preparation and analysis
 Efficiency
 Many (most?) tasks can be automated and/or applied to multiple datasets/variables simultaneously or essentially simultaneously
Disadvantages
 Steep learning curve
 Absolutely requires a significant time investment, both to learn initially and build fluency
 Equivalent to learning a new language
 You will lose patience with pointandclick interfaces
 Likely to become “one of the converted”
The R Learning Curve
How to learn R?
 Three most important ingredients: time, time, and more time
 A sprinkling of dedication and determination help.
 Be patient and forgiving with yourself. It will feel slow at first. Most people have not trained themselves to think in this way.
R as a big calculator
3 + 2
## [1] 5
(1/(3/2)^2) / 2^1/9
## [1] 0.09876543

Object Assignment
a < 3
b < 2
a + b
## [1] 5
a / (a + b)
## [1] 0.6
Object reassignment
a < 3
a
## [1] 3
a < 7
a
## [1] 7

Object Assignment (continued)
Objects can be of a variety of types.
string < "Hello world!"
logical < TRUE
double < 3.2587021
Integer < 6L
In this case, we can’t exactly do arithmetic with all of these. For example
string + double
## Error in string + double: nonnumeric argument to binary operator
But, these objects can be extremely useful in programming.
Functions and getting help
R functions
 Anything that carries out an operation in R is a function, even
+
.  Functions (outside of primitive functions) are preceded by
()
 e.g.,
sum()
,lm()
 e.g.,
Getting help
?
can be helpful, but often too advanced early on Helpful for understanding the formal arguments of a function
 Scroll down to the examples first
 Google is your best friend
 Other good websites
R packages
R ships with considerable functionality. It also comes with a set of preloaded packages
 e.g.
 “base”
 “graphics”
 “stats”
R also comes with a set of packages installed, but not loaded on launch
 e.g.
 “boot”
 “MASS”
 “Matrix”
Preloaded packages operate “out of the box”. For example, plot
is part of the graphics package, which ships with R.
plot(x = 1:10, y = 1:10)
On CRAN
 Any of these can be installed with
install.packages("pkg_name")
. You will then have access to all the functionality of the package.  Notice this plot only goes to mid2014. As of this writing (11/22/17), there are 11,892 packages available on CRAN! See https://cran.rproject.org/web/packages/
Other packages
On github
Installing from github
First, install the devtools package from CRAN
install.packages("devtools")
Next, load the devtools library to access the install_github
function. For example, to install my esvis package
library(devtools)
install_github("DJAnderson07/esvis")
You then have access to all the functionality of that package once you load it. Let’s look at these data:
sid  cohort  sped  ethnicity  frl  ell  season  reading  math 

332347  1  NonSped  Native Am.  NonFRL  NonELL  Winter  208  205 
400047  1  NonSped  Native Am.  FRL  NonELL  Spring  212  218 
402107  1  NonSped  White  NonFRL  NonELL  Winter  201  212 
402547  1  NonSped  White  NonFRL  NonELL  Fall  185  177 
403047  1  Sped  Hispanic  FRL  Active  Winter  179  192 
403307  1  Sped  Hispanic  NonFRL  NonELL  Winter  189  188 
PPPlot
library(esvis)
pp_plot(reading ~ ell, benchmarks)
Binned quantile effect sizes
binned_plot(math ~ ethnicity, benchmarks,
qtiles = seq(0, 1, .2),
theme = "dark")
ES Calculation
hedg_g(math ~ ethnicity, benchmarks, ref_group = "White")
## ref_group foc_group estimate
## 1 White Asian 0.1811177
## 2 White Hispanic 0.6226720
## 3 White Black 0.6547893
## 4 White Am. Indian 0.6685548
## 5 White Native Am. 0.8248879
auc(math ~ ethnicity, benchmarks)
## ref_group foc_group estimate
## 1 White Asian 0.5623478
## 2 White Hispanic 0.6689338
## 3 White Black 0.6805925
## 4 White Am. Indian 0.6888028
## 5 White Native Am. 0.7352343
## 6 Asian Hispanic 0.6030755
## 7 Asian Black 0.6155365
## 8 Asian Am. Indian 0.6231116
## 9 Asian Native Am. 0.6681564
## 10 Hispanic Black 0.5135554
## 11 Hispanic Am. Indian 0.5195881
## 12 Hispanic Native Am. 0.5683233
## 13 Black Am. Indian 0.5071228
## 14 Black Native Am. 0.5542968
## 15 Am. Indian Native Am. 0.5462639
## 16 Native Am. Am. Indian 0.4516412
## 17 Native Am. Black 0.4455142
## 18 Native Am. Hispanic 0.4311700
## 19 Native Am. Asian 0.3304590
## 20 Native Am. White 0.2644428
## 21 Am. Indian Black 0.4915051
## 22 Am. Indian Hispanic 0.4787148
## 23 Am. Indian Asian 0.3743474
## 24 Am. Indian White 0.3091228
## 25 Black Hispanic 0.4863905
## 26 Black Asian 0.3842752
## 27 Black White 0.3194840
## 28 Hispanic Asian 0.3963723
## 29 Hispanic White 0.3311781
## 30 Asian White 0.4368362
Is this exciting!?! YES!!!
Why is this such a big deal? * With just a basic knowledge of R you have access to literally thousands of packages + Expanding on a daily basis + Provides access to cutting edge and specialized functionality for analysis, data visualization, and data munging + Some of the most modern thinking on data analysis topics are often represented in these packages
Fitting multiple regression models
Want to follow along?
 Copy and paste the following code in your R console
install.packages(c("tidyverse", "rio", "devtools", "arm", "lm.beta", "visreg", "lme4"))
Step 0
 Before fitting model, you’ll generally need to import some data, let’s do so now
 Make sure your data file is stored in the same place that your script is
 The
setclass
argument above is actually not required, but makes it a bit easier to work with.
library(rio)
d < import("synthetic_data.csv", setclass = "tbl_df")
d
## # A tibble: 11,218 x 6
## SID grade clock cohort LD33 SS
## <int> <int> <int> <int> <chr> <int>
## 1 1243667 7 1 5 Never 238
## 2 12961647 6 0 7 Never 221
## 3 5477581 7 1 5 Never 224
## 4 4177568 8 2 5 Never 248
## 5 9368752 7 1 6 Never 239
## 6 7736290 7 1 7 Never 239
## 7 9486143 6 0 5 Never 220
## 8 6181953 7 1 5 Never 237
## 9 7966652 7 1 7 Never 234
## 10 7776640 7 1 6 Never 234
## # ... with 11,208 more rows
Research Questions
 What is the average growth from Grades 68 in math (
SS
)  Does the averge initial achievement or rate of growth depend upon
cohort
?  Does the averge initial achievement or rate of growth depend upon
LD33
, the students’ pattern of SLD classification? I don’t remember why the variable has the name it does
 I don’t remember why the variable has the name it does
NOTE: Multiple regression is NOT the best way to approach this. A multilevel model would be preferable. But, at the end, I’ll show you how simple it is to extend what we do here to the multilevel modeling approach.
Step 1: Look at your data!
Always best to visualize your data first. Let’s produce plots addressing each of our research questions.
1) What does the average growth look like? (plot on next slide)
library(tidyverse)
theme_set(theme_light()) # Not neccessary, but I like it
ggplot(d, aes(x = grade, y = SS)) +
geom_point() +
geom_smooth(method = "lm")
Does initial achievement or average growth depend upon cohort?
ggplot(d, aes(grade, SS)) +
geom_point() +
geom_smooth(method = "lm",
aes(color = factor(cohort)))
Does initial achievement or average growth depend upon LD status?
ggplot(d, aes(grade, SS)) +
geom_point() +
geom_smooth(method = "lm",
aes(color = factor(LD33)))
What about both?
ggplot(d, aes(grade, SS)) +
geom_point() +
geom_smooth(method = "lm",
aes(color = factor(LD33))) +
facet_wrap(~cohort)
Quick aside
geom_jitter
might be slightly better in this case
ggplot(d, aes(grade, SS)) +
geom_jitter(height = 0, width = 0.2, color = "gray80", alpha = 0.6) +
geom_smooth(method = "lm",
aes(color = factor(LD33))) +
facet_wrap(~cohort)
Step 2: Fit the model
 Use the
lm
function Part of base R
 Takes the following general form
mod < lm(outcome ~ predictor1 + predictor2 + predictorN,
data = d)
Fit simple linear regression model first
library(arm)
time_mod < lm(SS ~ clock, data = d)
display(time_mod, detail = TRUE)
## lm(formula = SS ~ clock, data = d)
## coef.est coef.se t value Pr(>t)
## (Intercept) 227.70 0.15 1485.71 0.00
## clock 5.31 0.12 44.85 0.00
## 
## n = 11218, k = 2
## residual sd = 10.24, RSquared = 0.15
Include a categorical predictor
 In R, categorical predictors need to be defined as factors
 Factors can have any underlying contrast matrix, but by default dummycoding is used, with the reference level being the first level
Change cohort to be a factor
d$cohort < as.factor(d$cohort)
Inspect the contrast matrix with
contrasts(d$cohort)
## 6 7
## 5 0 0
## 6 1 0
## 7 0 1
Change the reference level with
d$cohort < relevel(d$cohort, ref = "7")
contrasts(d$cohort)
## 5 6
## 7 0 0
## 5 1 0
## 6 0 1
Fit a second model
cohort_mod < lm(SS ~ clock + cohort, data = d)
display(cohort_mod, detail = TRUE)
## lm(formula = SS ~ clock + cohort, data = d)
## coef.est coef.se t value Pr(>t)
## (Intercept) 228.40 0.21 1104.77 0.00
## clock 5.31 0.12 44.87 0.00
## cohort5 1.50 0.24 6.33 0.00
## cohort6 0.57 0.24 2.43 0.02
## 
## n = 11218, k = 4
## residual sd = 10.22, RSquared = 0.16
Compare models
anova(time_mod, cohort_mod)
## Analysis of Variance Table
##
## Model 1: SS ~ clock
## Model 2: SS ~ clock + cohort
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 11216 1175695
## 2 11214 1171426 2 4269.3 20.435 1.385e09 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Add cohort as predictor of the slope
cohort_mod2 < lm(SS ~ clock + cohort + clock:cohort, data = d)
display(cohort_mod2, detail = TRUE)
## lm(formula = SS ~ clock + cohort + clock:cohort, data = d)
## coef.est coef.se t value Pr(>t)
## (Intercept) 228.06 0.27 847.84 0.00
## clock 5.64 0.21 27.34 0.00
## cohort5 0.79 0.38 2.09 0.04
## cohort6 0.27 0.38 0.72 0.47
## clock:cohort5 0.71 0.29 2.44 0.01
## clock:cohort6 0.30 0.29 1.02 0.31
## 
## n = 11218, k = 6
## residual sd = 10.22, RSquared = 0.16
Quick note on syntax
The following two lines of code are equivalent
cohort_mod2 < lm(SS ~ clock + cohort + clock:cohort, data = d)
cohort_mod2 < lm(SS ~ clock*cohort, data = d)
Need standardized coefficients?
# install.packages("lm.beta")
library(lm.beta)
lm.beta(cohort_mod2)
##
## Call:
## lm(formula = SS ~ clock + cohort + clock:cohort, data = d)
##
## Standardized Coefficients::
## (Intercept) clock cohort5 cohort6 clock:cohort5
## 0.00000000 0.41424809 0.03344220 0.01152243 0.04243530
## clock:cohort6
## 0.01779824
Let’s skip ahead and fit the full model
 We want clock, cohort, and ld status all entered in the model, as well as the interaction between clock and cohort, and clock and ld status
Try to write the code on your own
Model
full_mod < lm(SS ~ clock + cohort + LD33 +
clock:cohort + clock:LD33,
data = d)
display(full_mod, detail = TRUE)
## lm(formula = SS ~ clock + cohort + LD33 + clock:cohort + clock:LD33,
## data = d)
## coef.est coef.se t value Pr(>t)
## (Intercept) 218.47 0.69 318.29 0.00
## clock 5.47 0.54 10.20 0.00
## cohort5 0.79 0.36 2.19 0.03
## cohort6 0.27 0.36 0.75 0.45
## LD33Never 10.44 0.67 15.64 0.00
## LD33Sometimes 1.48 1.14 1.30 0.19
## clock:cohort5 0.64 0.28 2.30 0.02
## clock:cohort6 0.28 0.28 0.99 0.32
## clock:LD33Never 0.07 0.52 0.14 0.89
## clock:LD33Sometimes 0.78 0.94 0.84 0.40
## 
## n = 11218, k = 10
## residual sd = 9.83, RSquared = 0.22
Compare our last model to prior models
anova(cohort_mod2, full_mod)
## Analysis of Variance Table
##
## Model 1: SS ~ clock + cohort + clock:cohort
## Model 2: SS ~ clock + cohort + LD33 + clock:cohort + clock:LD33
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 11212 1170801
## 2 11208 1084086 4 86714 224.13 < 2.2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Visualizing the fitted models
(I’m guessing I’m almost out of time, but quickly…) The visreg package is amazing, and I highly recommend it
library(visreg)
visreg(full_mod, "clock", by = "LD33")
visreg(full_mod, "clock", by = "cohort")
visreg(full_mod, "LD33", by = "clock", gg = TRUE)
Kinda hard but maybe helpful
visreg2d(full_mod, "clock", "LD33", plot.type = "persp")
See the following links for more info on the visreg package
Last note  fitting the right model!
 The model we’ve fit should be multilevel. So let’s do it!
#install.package("lme4")
library(lme4)
mlm < lmer(SS ~ clock + cohort + LD33 +
clock:cohort + clock:LD33 +
(1 + clockSID),
data = d)
display(mlm, detail = TRUE)
## lmer(formula = SS ~ clock + cohort + LD33 + clock:cohort + clock:LD33 +
## (1 + clock  SID), data = d)
## coef.est coef.se t value
## (Intercept) 218.45 0.68 319.73
## clock 5.44 0.54 10.12
## cohort5 0.76 0.37 2.08
## cohort6 0.27 0.36 0.75
## LD33Never 10.44 0.66 15.74
## LD33Sometimes 1.48 1.13 1.31
## clock:cohort5 0.61 0.28 2.17
## clock:cohort6 0.26 0.28 0.92
## clock:LD33Never 0.10 0.52 0.20
## clock:LD33Sometimes 0.76 0.94 0.81
##
## Error terms:
## Groups Name Std.Dev. Corr
## SID (Intercept) 2.08
## clock 1.45 0.35
## Residual 9.54
## 
## number of obs: 11218, groups: SID, 3580
## AIC = 83111.4, DIC = 83070.7
## deviance = 83077.0
How much variability?
library(lattice)
qqmath(ranef(mlm, condVar = TRUE))
## $SID
Parting thoughts
Today’s lecture is mostly about exposure. R takes a lot of time and effort to learn, but it is really worth it.
Questions?