Agenda
- Introduce R
- Very basics of R (object assignment)
- R packages
- Running multiple regression models
- Visualizing multiple regression models
Who am I?
Daniel Anderson
- Research Associate: Behavioral Research and Teaching
- Dad (two daughters: 5 and 3)
- Quantitative educational researcher who loves R
- Primary areas of interest
- R and computational educational research
- Open data, open science, and reproducible workflows
- Growth modeling (primarily through multilevel models)
Part of why I love R
- Computer programming language = you can do ANYTHING!
- Develop websites: http://www.dandersondata.com
- Develop new algorithms/methods: https://github.com/DJAnderson07/esvis
- Create huge gains in efficiency: https://github.com/DJAnderson07/r2Winsteps
- Transparency and open and reproducible research
- The R community!
All of this great stuff and it just happens to also be free.
Free Books!
The book for the course I teach
The book for the next course I would like to teach (if there is one)
Other books
Freely available at http://socviz.co
What is R?
- A programming language
- Tremendously powerful and flexible statistical software that happens to be free
- No point-and-click interface
- Incredible array of external “packages” available for specialized analyses, data visualizations, or to automate much of the data “munging” process
Code-based interface
Moving to code/programming
Advantages
- Flexibility
- Only limited by your own creativity (and current level of programming skills, which are ever-evolving)
- Transparency
- Documented record of every step taken in your data preparation and analysis
- Efficiency
- Many (most?) tasks can be automated and/or applied to multiple datasets/variables simultaneously or essentially simultaneously
Disadvantages
- Steep learning curve
- Absolutely requires a significant time investment, both to learn initially and build fluency
- Equivalent to learning a new language
- You will lose patience with point-and-click interfaces
- Likely to become “one of the converted”
The R Learning Curve
How to learn R?
- Three most important ingredients: time, time, and more time
- A sprinkling of dedication and determination help.
- Be patient and forgiving with yourself. It will feel slow at first. Most people have not trained themselves to think in this way.
R as a big calculator
3 + 2
## [1] 5
(1/-(3/2)^2) / 2^-1/9
## [1] -0.09876543
Object Assignment
a <- 3
b <- 2
a + b
## [1] 5
a / (a + b)
## [1] 0.6
Object re-assignment
a <- 3
a
## [1] 3
a <- 7
a
## [1] 7
Object Assignment (continued)
Objects can be of a variety of types.
string <- "Hello world!"
logical <- TRUE
double <- 3.2587021
Integer <- 6L
In this case, we can’t exactly do arithmetic with all of these. For example
string + double
## Error in string + double: non-numeric argument to binary operator
But, these objects can be extremely useful in programming.
Functions and getting help
R functions
- Anything that carries out an operation in R is a function, even
+
. - Functions (outside of primitive functions) are preceded by
()
- e.g.,
sum()
,lm()
- e.g.,
Getting help
?
can be helpful, but often too advanced early on- Helpful for understanding the formal arguments of a function
- Scroll down to the examples first
- Google is your best friend
- Other good websites
- http://stackoverflow.com
- Mailing lists: - https://stat.ethz.ch/mailman/listinfo/r-help
R packages
R ships with considerable functionality. It also comes with a set of pre-loaded packages
- e.g.
- “base”
- “graphics”
- “stats”
R also comes with a set of packages installed, but not loaded on launch
- e.g.
- “boot”
- “MASS”
- “Matrix”
Pre-loaded packages operate “out of the box”. For example, plot
is part of the graphics package, which ships with R.
plot(x = 1:10, y = 1:10)
On CRAN
- Any of these can be installed with
install.packages("pkg_name")
. You will then have access to all the functionality of the package. - Notice this plot only goes to mid-2014. As of this writing (11/22/17), there are 11,892 packages available on CRAN! See https://cran.r-project.org/web/packages/
Other packages
On github
Installing from github
First, install the devtools package from CRAN
install.packages("devtools")
Next, load the devtools library to access the install_github
function. For example, to install my esvis package
library(devtools)
install_github("DJAnderson07/esvis")
You then have access to all the functionality of that package once you load it. Let’s look at these data:
sid | cohort | sped | ethnicity | frl | ell | season | reading | math |
---|---|---|---|---|---|---|---|---|
332347 | 1 | Non-Sped | Native Am. | Non-FRL | Non-ELL | Winter | 208 | 205 |
400047 | 1 | Non-Sped | Native Am. | FRL | Non-ELL | Spring | 212 | 218 |
402107 | 1 | Non-Sped | White | Non-FRL | Non-ELL | Winter | 201 | 212 |
402547 | 1 | Non-Sped | White | Non-FRL | Non-ELL | Fall | 185 | 177 |
403047 | 1 | Sped | Hispanic | FRL | Active | Winter | 179 | 192 |
403307 | 1 | Sped | Hispanic | Non-FRL | Non-ELL | Winter | 189 | 188 |
PP-Plot
library(esvis)
pp_plot(reading ~ ell, benchmarks)
Binned quantile effect sizes
binned_plot(math ~ ethnicity, benchmarks,
qtiles = seq(0, 1, .2),
theme = "dark")
ES Calculation
hedg_g(math ~ ethnicity, benchmarks, ref_group = "White")
## ref_group foc_group estimate
## 1 White Asian -0.1811177
## 2 White Hispanic -0.6226720
## 3 White Black -0.6547893
## 4 White Am. Indian -0.6685548
## 5 White Native Am. -0.8248879
auc(math ~ ethnicity, benchmarks)
## ref_group foc_group estimate
## 1 White Asian 0.5623478
## 2 White Hispanic 0.6689338
## 3 White Black 0.6805925
## 4 White Am. Indian 0.6888028
## 5 White Native Am. 0.7352343
## 6 Asian Hispanic 0.6030755
## 7 Asian Black 0.6155365
## 8 Asian Am. Indian 0.6231116
## 9 Asian Native Am. 0.6681564
## 10 Hispanic Black 0.5135554
## 11 Hispanic Am. Indian 0.5195881
## 12 Hispanic Native Am. 0.5683233
## 13 Black Am. Indian 0.5071228
## 14 Black Native Am. 0.5542968
## 15 Am. Indian Native Am. 0.5462639
## 16 Native Am. Am. Indian 0.4516412
## 17 Native Am. Black 0.4455142
## 18 Native Am. Hispanic 0.4311700
## 19 Native Am. Asian 0.3304590
## 20 Native Am. White 0.2644428
## 21 Am. Indian Black 0.4915051
## 22 Am. Indian Hispanic 0.4787148
## 23 Am. Indian Asian 0.3743474
## 24 Am. Indian White 0.3091228
## 25 Black Hispanic 0.4863905
## 26 Black Asian 0.3842752
## 27 Black White 0.3194840
## 28 Hispanic Asian 0.3963723
## 29 Hispanic White 0.3311781
## 30 Asian White 0.4368362
Is this exciting!?! YES!!!
Why is this such a big deal?
- With just a basic knowledge of R you have access to literally thousands of packages
- Expanding on a daily basis
- Provides access to cutting edge and specialized functionality for analysis, data visualization, and data munging
- Some of the most modern thinking on data analysis topics are often represented in these packages
Fitting multiple regression models
Want to follow along?
- Copy and paste the following code in your R console
install.packages(c("tidyverse", "rio", "devtools", "arm", "lm.beta", "visreg", "lme4"))
Step 0
- Before fitting model, you’ll generally need to import some data, let’s do so now
- Make sure your data file is stored in the same place that your script is
- The
setclass
argument above is actually not required, but makes it a bit easier to work with.
library(rio)
d <- import("synthetic_data.csv", setclass = "tbl_df")
d
## # A tibble: 11,218 x 6
## SID grade clock cohort LD33 SS
## <int> <int> <int> <int> <chr> <int>
## 1 1243667 7 1 5 Never 238
## 2 12961647 6 0 7 Never 221
## 3 5477581 7 1 5 Never 224
## 4 4177568 8 2 5 Never 248
## 5 9368752 7 1 6 Never 239
## 6 7736290 7 1 7 Never 239
## 7 9486143 6 0 5 Never 220
## 8 6181953 7 1 5 Never 237
## 9 7966652 7 1 7 Never 234
## 10 7776640 7 1 6 Never 234
## # ... with 11,208 more rows
Research Questions
- What is the average growth from Grades 6-8 in math (
SS
) - Does the averge initial achievement or rate of growth depend upon
cohort
? - Does the averge initial achievement or rate of growth depend upon
LD33
, the students’ pattern of SLD classification?- I don’t remember why the variable has the name it does
NOTE: Multiple regression is NOT the best way to approach this. A multilevel model would be preferable. But, at the end, I’ll show you how simple it is to extend what we do here to the multilevel modeling approach.
Step 1: Look at your data!
Always best to visualize your data first. Let’s produce plots addressing each of our research questions.
- What does the average growth look like? (plot on next slide)
library(tidyverse)
theme_set(theme_light()) # Not neccessary, but I like it
ggplot(d, aes(x = grade, y = SS)) +
geom_point() +
geom_smooth(method = "lm")
Does initial achievement or average growth depend upon cohort?
ggplot(d, aes(grade, SS)) +
geom_point() +
geom_smooth(method = "lm",
aes(color = factor(cohort)))
Does initial achievement or average growth depend upon LD status?
ggplot(d, aes(grade, SS)) +
geom_point() +
geom_smooth(method = "lm",
aes(color = factor(LD33)))
What about both?
ggplot(d, aes(grade, SS)) +
geom_point() +
geom_smooth(method = "lm",
aes(color = factor(LD33))) +
facet_wrap(~cohort)
Quick aside
geom_jitter
might be slightly better in this case
ggplot(d, aes(grade, SS)) +
geom_jitter(height = 0, width = 0.2, color = "gray80", alpha = 0.6) +
geom_smooth(method = "lm",
aes(color = factor(LD33))) +
facet_wrap(~cohort)
Step 2: Fit the model
- Use the
lm
function- Part of base R
- Takes the following general form
mod <- lm(outcome ~ predictor1 + predictor2 + predictorN,
data = d)
Fit simple linear regression model first
library(arm)
time_mod <- lm(SS ~ clock, data = d)
display(time_mod, detail = TRUE)
## lm(formula = SS ~ clock, data = d)
## coef.est coef.se t value Pr(>|t|)
## (Intercept) 227.70 0.15 1485.71 0.00
## clock 5.31 0.12 44.85 0.00
## ---
## n = 11218, k = 2
## residual sd = 10.24, R-Squared = 0.15
Include a categorical predictor
- In R, categorical predictors need to be defined as factors
- Factors can have any underlying contrast matrix, but by default dummy-coding is used, with the reference level being the first level
Change cohort to be a factor
d$cohort <- as.factor(d$cohort)
Inspect the contrast matrix with
contrasts(d$cohort)
## 6 7
## 5 0 0
## 6 1 0
## 7 0 1
Change the reference level with
d$cohort <- relevel(d$cohort, ref = "7")
contrasts(d$cohort)
## 5 6
## 7 0 0
## 5 1 0
## 6 0 1
Fit a second model
cohort_mod <- lm(SS ~ clock + cohort, data = d)
display(cohort_mod, detail = TRUE)
## lm(formula = SS ~ clock + cohort, data = d)
## coef.est coef.se t value Pr(>|t|)
## (Intercept) 228.40 0.21 1104.77 0.00
## clock 5.31 0.12 44.87 0.00
## cohort5 -1.50 0.24 -6.33 0.00
## cohort6 -0.57 0.24 -2.43 0.02
## ---
## n = 11218, k = 4
## residual sd = 10.22, R-Squared = 0.16
Compare models
anova(time_mod, cohort_mod)
## Analysis of Variance Table
##
## Model 1: SS ~ clock
## Model 2: SS ~ clock + cohort
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 11216 1175695
## 2 11214 1171426 2 4269.3 20.435 1.385e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Add cohort as predictor of the slope
cohort_mod2 <- lm(SS ~ clock + cohort + clock:cohort, data = d)
display(cohort_mod2, detail = TRUE)
## lm(formula = SS ~ clock + cohort + clock:cohort, data = d)
## coef.est coef.se t value Pr(>|t|)
## (Intercept) 228.06 0.27 847.84 0.00
## clock 5.64 0.21 27.34 0.00
## cohort5 -0.79 0.38 -2.09 0.04
## cohort6 -0.27 0.38 -0.72 0.47
## clock:cohort5 -0.71 0.29 -2.44 0.01
## clock:cohort6 -0.30 0.29 -1.02 0.31
## ---
## n = 11218, k = 6
## residual sd = 10.22, R-Squared = 0.16
Quick note on syntax
The following two lines of code are equivalent
cohort_mod2 <- lm(SS ~ clock + cohort + clock:cohort, data = d)
cohort_mod2 <- lm(SS ~ clock*cohort, data = d)
Need standardized coefficients?
# install.packages("lm.beta")
library(lm.beta)
lm.beta(cohort_mod2)
##
## Call:
## lm(formula = SS ~ clock + cohort + clock:cohort, data = d)
##
## Standardized Coefficients::
## (Intercept) clock cohort5 cohort6 clock:cohort5
## 0.00000000 0.41424809 -0.03344220 -0.01152243 -0.04243530
## clock:cohort6
## -0.01779824
Let’s skip ahead and fit the full model
- We want clock, cohort, and ld status all entered in the model, as well as the interaction between clock and cohort, and clock and ld status
Try to write the code on your own
Model
full_mod <- lm(SS ~ clock + cohort + LD33 +
clock:cohort + clock:LD33,
data = d)
display(full_mod, detail = TRUE)
## lm(formula = SS ~ clock + cohort + LD33 + clock:cohort + clock:LD33,
## data = d)
## coef.est coef.se t value Pr(>|t|)
## (Intercept) 218.47 0.69 318.29 0.00
## clock 5.47 0.54 10.20 0.00
## cohort5 -0.79 0.36 -2.19 0.03
## cohort6 -0.27 0.36 -0.75 0.45
## LD33Never 10.44 0.67 15.64 0.00
## LD33Sometimes -1.48 1.14 -1.30 0.19
## clock:cohort5 -0.64 0.28 -2.30 0.02
## clock:cohort6 -0.28 0.28 -0.99 0.32
## clock:LD33Never 0.07 0.52 0.14 0.89
## clock:LD33Sometimes 0.78 0.94 0.84 0.40
## ---
## n = 11218, k = 10
## residual sd = 9.83, R-Squared = 0.22
Compare our last model to prior models
anova(cohort_mod2, full_mod)
## Analysis of Variance Table
##
## Model 1: SS ~ clock + cohort + clock:cohort
## Model 2: SS ~ clock + cohort + LD33 + clock:cohort + clock:LD33
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 11212 1170801
## 2 11208 1084086 4 86714 224.13 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Visualizing the fitted models
(I’m guessing I’m almost out of time, but quickly…) The visreg package is amazing, and I highly recommend it
library(visreg)
visreg(full_mod, "clock", by = "LD33")
visreg(full_mod, "clock", by = "cohort")
visreg(full_mod, "LD33", by = "clock", gg = TRUE)
Kinda hard but maybe helpful
visreg2d(full_mod, "clock", "LD33", plot.type = "persp")
See the following links for more info on the visreg package
Last note - fitting the right model!
- The model we’ve fit should be multilevel. So let’s do it!
#install.package("lme4")
library(lme4)
mlm <- lmer(SS ~ clock + cohort + LD33 +
clock:cohort + clock:LD33 +
(1 + clock|SID),
data = d)
display(mlm, detail = TRUE)
## lmer(formula = SS ~ clock + cohort + LD33 + clock:cohort + clock:LD33 +
## (1 + clock | SID), data = d)
## coef.est coef.se t value
## (Intercept) 218.45 0.68 319.73
## clock 5.44 0.54 10.12
## cohort5 -0.76 0.37 -2.08
## cohort6 -0.27 0.36 -0.75
## LD33Never 10.44 0.66 15.74
## LD33Sometimes -1.48 1.13 -1.31
## clock:cohort5 -0.61 0.28 -2.17
## clock:cohort6 -0.26 0.28 -0.92
## clock:LD33Never 0.10 0.52 0.20
## clock:LD33Sometimes 0.76 0.94 0.81
##
## Error terms:
## Groups Name Std.Dev. Corr
## SID (Intercept) 2.08
## clock 1.45 -0.35
## Residual 9.54
## ---
## number of obs: 11218, groups: SID, 3580
## AIC = 83111.4, DIC = 83070.7
## deviance = 83077.0
How much variability?
library(lattice)
qqmath(ranef(mlm, condVar = TRUE))
## $SID
Parting thoughts
Today’s lecture is mostly about exposure. R takes a lot of time and effort to learn, but it is really worth it.
Questions?