This is, basically, the third post in a series of posts about estimating and, now, exploring achievement gap variation between schools. The first post described the method, while the second applied the method to estimate achievement gaps for all schools reporting data on students coded as Hispanic and White in California. That post included some preliminary explorations of the data, but this post will take that further by looking at the variance geographically.

The Data

We’ll use the same data we used in the last post, which represents school-level Hispanic-White achievement gap effect sizes for every school reporting data on both groups in California. To avoid going through all the gymnastics of creating the dataset, though, I’ve posted a file to my website repo. The code below should allow you to read the data directly from the repo.

library(tidyverse)
tmpfile <- tempfile()
download.file("https://github.com/datalorax/site/raw/master/content/post/v.rds",
              tmpfile)

v <- read_rds(tmpfile)
v
## # A tibble: 36,345 x 7
##    county_code district_code school_code grade test_id       auc      v
##    <chr>       <chr>         <chr>       <dbl> <chr>       <dbl>  <dbl>
##  1 01          10017         0125567         3 ELA         0.252 -0.945
##  2 01          10017         0125567         3 Mathematics 0.308 -0.709
##  3 01          10017         0125567        13 ELA         0.313 -0.690
##  4 01          10017         0125567        13 Mathematics 0.304 -0.724
##  5 01          31617         0131763        11 ELA         0.347 -0.556
##  6 01          31617         0131763        11 Mathematics 0.454 -0.163
##  7 01          31617         0131763        13 ELA         0.429 -0.253
##  8 01          31617         0131763        13 Mathematics 0.458 -0.150
##  9 01          61119         0111765         3 ELA         0.417 -0.297
## 10 01          61119         0111765         3 Mathematics 0.264 -0.893
## # … with 36,335 more rows

As before, we have v, which is the achievement gap effect size, for each grade, in English/Language Arts (ELA) and Mathetmatics, for every school. Let’s calculate an average achivement gap across these variables so we have a single estimate for each school (we could, of course, explore them separately, but we’ll go with the average for now).

v <- v %>% 
  group_by(county_code, district_code, school_code) %>% 
  summarize(v = mean(v, na.rm = TRUE))

If we want to look at these data geographically, we need to geocode the schools. That is, we need to know the longitude and lattitude. Luckily for us, California supplies this information publicly here. The code below will load the text file into R. The data are tab delimited, which is why we use read_delim with the delimiter set to tabs, "\t"

schools <- read_delim("https://www.cde.ca.gov/schooldirectory/report?rid=dl1&tp=txt",
                      delim = "\t")
schools
## # A tibble: 18,061 x 51
##    CDSCode NCESDist NCESSchool StatusType County District School Street
##    <chr>   <chr>    <chr>      <chr>      <chr>  <chr>    <chr>  <chr> 
##  1 011001… 0691051  No Data    Active     Alame… Alameda… No Da… 313 W…
##  2 011001… 0691051  10546      Closed     Alame… Alameda… FAME … 39899…
##  3 011001… 0691051  10947      Active     Alame… Alameda… Envis… 1515 …
##  4 011001… 0691051  12283      Closed     Alame… Alameda… Aspir… 2125 …
##  5 011001… 0691051  12844      Active     Alame… Alameda… Commu… 2111 …
##  6 011001… 0691051  12901      Active     Alame… Alameda… Yu Mi… 1086 …
##  7 011001… 0691051  13008      Active     Alame… Alameda… Urban… 5328 …
##  8 011001… 0691051  13772      Active     Alame… Alameda… Epic … 1045 …
##  9 011001… 0691051  No Data    Closed     Alame… Alameda… Techn… 597 C…
## 10 011001… 0691051  09264      Active     Alame… Alameda… Alame… 2500 …
## # … with 18,051 more rows, and 43 more variables: StreetAbr <chr>,
## #   City <chr>, Zip <chr>, State <chr>, MailStreet <chr>,
## #   MailStrAbr <chr>, MailCity <chr>, MailZip <chr>, MailState <chr>,
## #   Phone <chr>, Ext <chr>, WebSite <chr>, OpenDate <chr>,
## #   ClosedDate <chr>, Charter <chr>, CharterNum <chr>, FundingType <chr>,
## #   DOC <chr>, DOCType <chr>, SOC <chr>, SOCType <chr>, EdOpsCode <chr>,
## #   EdOpsName <chr>, EILCode <chr>, EILName <chr>, GSoffered <chr>,
## #   GSserved <chr>, Virtual <chr>, Magnet <chr>, YearRoundYN <chr>,
## #   FederalDFCDistrictID <chr>, Latitude <chr>, Longitude <chr>,
## #   AdmFName1 <chr>, AdmLName1 <chr>, AdmEmail1 <chr>, AdmFName2 <chr>,
## #   AdmLName2 <chr>, AdmEmail2 <chr>, AdmFName3 <chr>, AdmLName3 <chr>,
## #   AdmEmail3 <chr>, LastUpDate <chr>

This data file has longitude and latitude, which we want, so we’ll need to join it. If we look at the documentation for the data, we’ll see that CDSCode is the California school ID. Let’s check that it uniquely identifies each school.

schools %>% 
  count(CDSCode) %>% 
  filter(n > 1)
## # A tibble: 0 x 2
## # … with 2 variables: CDSCode <chr>, n <int>

And it does! A little sleuthing reveals that this id is the combination of county_code, district_code, and school_code in our v file. Let’s recreate this ID variable in v, then join the files. We could use tidyr::unite to create the ID variable, but I want to keep the individual variables here for reasons that will become apparent later on, so I’ll do it through a call to dplyr::mutate instead.

v <- v %>% 
  mutate(CDSCode = paste0(county_code, district_code, school_code))

# Limit `schools` to just the variables we really want
schools <- schools %>% 
  select(CDSCode, Longitude, Latitude)

d <- left_join(v, schools)
d
## # A tibble: 6,451 x 7
## # Groups:   county_code, district_code [849]
##    county_code district_code school_code      v CDSCode  Longitude Latitude
##    <chr>       <chr>         <chr>        <dbl> <chr>    <chr>     <chr>   
##  1 01          10017         0125567     -0.767 0110017… -122.189… 37.7783…
##  2 01          31617         0131763     -0.280 0131617… -121.965… 37.5589…
##  3 01          61119         0111765     -0.399 0161119… -122.286… 37.7818…
##  4 01          61119         0119222     -0.769 0161119… -122.287… 37.7790…
##  5 01          61119         0122085     -0.575 0161119… -122.285… 37.7764…
##  6 01          61119         0126656     -0.789 0161119… -122.271… 37.7727…
##  7 01          61119         0130229     -0.417 0161119… -122.245… 37.7649…
##  8 01          61119         0130609     -0.579 0161119… -122.287… 37.7790…
##  9 01          61119         0132142     -0.357 0161119… -122.289… 37.7727…
## 10 01          61119         0134304     -0.193 0161119… -122.281… 37.7757…
## # … with 6,441 more rows

Last thing, longitude and latitude are currently stored as characters, and we want them to be numeric.

d <- d %>% 
  mutate(Longitude = as.numeric(Longitude),
         Latitude = as.numeric(Latitude))

And there we are!

A basic Map

Let’s start by at least trying to plot all of our data. First, we’ll get a map of California.

ca <- map_data("state") %>% 
  filter(region == "california")
head(ca)
##        long      lat group order     region subregion
## 1 -120.0060 42.00927     4   667 california      <NA>
## 2 -120.0060 41.20139     4   668 california      <NA>
## 3 -120.0060 39.70024     4   669 california      <NA>
## 4 -119.9946 39.44241     4   670 california      <NA>
## 5 -120.0060 39.31636     4   671 california      <NA>
## 6 -120.0060 39.16166     4   672 california      <NA>

Now we can produce the map with a basic call to {ggplot2}, using geom_polygon.

ggplot(ca, aes(long, lat, group = group)) +
  geom_polygon() +
  coord_fixed(ratio = 1.3)

Let’s switch up the theme some. This is a slightly adapted version of Timo Grossenbacher’s theme he outlines here.

theme_map <- function(...) {
  theme_minimal(base_size = 15) +
  theme(
    text = element_text(color = "#E7F0F2"),
    axis.line = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_line(color = "#E7F0F2", size = 0.2),
    panel.grid.minor = element_blank(),
    plot.background = element_rect(fill = "gray20", color = NA), 
    panel.background = element_rect(fill = "gray20", color = NA), 
    legend.background = element_rect(fill = "gray20", color = NA),
    panel.border = element_blank(),
    ...
  )
}
theme_set(theme_map())

Now we just expand that to include points for our schools, and color them by their achievement gap effect size.

ggplot(ca, aes(long, lat, group = group)) +
  geom_polygon(fill = "gray60",
               color = "white") +
  geom_point(aes(Longitude, Latitude, color = v), 
             data = d,
             inherit.aes = FALSE) +
  coord_fixed(ratio = 1.3) 

This is not terrifically helpful, in part because higher effect sizes are darker, and we already have them overlaid on a dark background (the state). Instead, this is probably a good place to use a diverging palette, where the middle of the color scale centers on 0 (no acheivement gap).

#install.packages("colorspace")
library(colorspace)
ggplot(ca, aes(long, lat, group = group)) +
  geom_polygon(fill = "gray60",
               color = "white") +
  geom_point(aes(Longitude, Latitude, color = v), 
             data = d,
             inherit.aes = FALSE,
             size = 2.5) +
  scale_color_continuous_diverging("Blue-Red3",
                                   rev = TRUE) +
  coord_fixed(ratio = 1.3) 

Now we’re starting to get somewhere. We definitely have more gradation, but it’s still pretty tough to see a ton of variation. Maybe we can limit our scale a bit to highlight the variation better. Let’s look at our histogram of V again to see what might be reasonable values.

ggplot(d, aes(v)) +
  geom_histogram(fill = "cornflowerblue") +
  theme(axis.text.x = element_text(color = "#E7F0F2")) # add x-axis back in

From this plot, it looks like almost all of the data are greater than -2, but our axis goes all the way out to -3. Let’s limit it out to -2 and see what happens.

ggplot(ca, aes(long, lat, group = group)) +
  geom_polygon(fill = "gray60",
               color = "gray20") +
  geom_point(aes(Longitude, Latitude, color = v), 
             data = d,
             inherit.aes = FALSE,
             size = 2.5) +
  scale_color_continuous_diverging("Blue-Red3",
                                   rev = TRUE,
                                   limits = c(-2, 1)) +
  coord_fixed(ratio = 1.3)

Much better! What if we limit it even further? If we limit it to values greater thant -1.5 and less that 0.5 we will still represent (table(d$v > -1.5 & d$v < 0.5)/length(d$v))[2])% of the data, which is 100%.

ggplot(ca, aes(long, lat, group = group)) +
  geom_polygon(fill = "gray60",
               color = "gray20") +
  geom_point(aes(Longitude, Latitude, color = v), 
             data = d,
             inherit.aes = FALSE,
             size = 2.5) +
  scale_color_continuous_diverging("Blue-Red3",
                                   rev = TRUE,
                                   limits = c(-1.5, 0.5)) +
  coord_fixed(ratio = 1.3)

And now we’re really getting somewhere. We can see the variation even better, and it’s pretty clear that the highest achievement gaps are occurring in the bay area. let’s investigate further.

Interactive maps

I started by limiting the above maps to Alameda County, which includes the bay area, but instead I decided to go with some interactive mapping that start around Alameda county and you can move to wherever you’d like from there. When you’re dealing with larger scale data like this, interactive maps can be really beneficial. One of my favorite examples on the educational attainment of America by Kyle Walker whose work I really admire. He does a lot of amazing things that I think are fascinating, beautiful, and often really helpful (in the case of tools that he creates, such as {tidycensus}). The educational attainment map I’ve linked to above was built with mapbox, and it’s really intriguing to me, but it’s still on my “to learn” list and I haven’t dug into it much yet (maybe a later post!). So for now, let’s create a quick interactive map of California using {leaflet}.

Let’s first make a map that starts in the bay area. It’s amazing how easy this is.

library(leaflet)

# Figure out center-ish of alameda county
mean_geo <- d %>% 
  filter(county_code == "01") %>% 
  summarize(mean_long = mean(Longitude, na.rm = TRUE),
            mean_lat = mean(Latitude, na.rm = TRUE))
  
# Make the map
leaflet() %>%
  setView(lng = mean_geo$mean_long, 
             lat = mean_geo$mean_lat,
             zoom = 8) %>% 
  addTiles()

Now we’ll add circles for each school

leaflet() %>%
  setView(lng = mean_geo$mean_long, 
             lat = mean_geo$mean_lat,
             zoom = 8) %>% 
  addTiles() %>% 
  addCircleMarkers(data = d, lng = ~Longitude, lat = ~Latitude)

Notice that after I’ve declared the data source, I can refer to variables within the data using ~.

Obviously, the above isn’t very helpful, but we’re getting there. Let’s change the color of these according to the achievement gap effect size, and again limit the data a bit so there’s higher contrast. I’m going to call the data subset object d_99 to remind myself that it has (actully more than) 99% of the overall data.

d_99 <- filter(d, v > -1.5, v < 0.5)

# Create a continuous color pallette of reds and blues
pal <- colorNumeric(palette = "RdBu",
                    domain = seq(1.5, -1.5, -0.01))

leaflet() %>%
  setView(lng = mean_geo$mean_long, 
             lat = mean_geo$mean_lat,
             zoom = 8) %>% 
  addTiles() %>% 
  addCircleMarkers(data = d_99, lng = ~Longitude, lat = ~Latitude,
                   color = ~pal(v),
                   stroke = FALSE,
                   radius = 5,
                   fillOpacity = 0.8)

Notice I’ve made the radius = 5 to make circles representing schools smaller and easier to see, and changed fillOpacity = 0.8 to make the circles 80% opaque. We could keep messing with this, of course, but we have a fairly nice interactive map already!

Last bit, let’s add a legend. This turns out to be trickier than I would have hoped/thought. My initial attempt looked like this

leaflet() %>%
  setView(lng = mean_geo$mean_long, 
             lat = mean_geo$mean_lat,
             zoom = 8) %>% 
  addTiles() %>% 
  addCircleMarkers(data = d_99, lng = ~Longitude, lat = ~Latitude,
                   color = ~pal(v),
                   stroke = FALSE,
                   radius = 5,
                   fillOpacity = 0.8) %>% 
  addLegend("bottomleft",
            data = d_99,
            values = ~v,
            pal = pal, 
            title = "Hispanic-White <br/> Achievement Gap ES",
            opacity = 0.7)

Which is actually correct but everything is kind of backwards. We have negative values going up vertically, rather than down, which feels a lot more natural. What I ended up doing is reversing the color palette, restricting the range to just the values I wanted, and then (to correct for the reversed palette) multiplied each value by -1.

Because I wanted to have the schools with no real achievement gap to sort of fade into the background, I also changed the default tiles to some that are mostly white, which again just sort of helps with the overall contrast.

# Create reversed color palette
pal_rev <- colorNumeric(palette = "RdBu",
                        domain = seq(-1.5, 1.5, 0.01),
                        reverse = TRUE)

leaflet() %>%
  setView(lng = mean_geo$mean_long, 
             lat = mean_geo$mean_lat,
             zoom = 8) %>% 
  addProviderTiles("CartoDB.Positron") %>% # Changed from `add_tiles()`
  addCircleMarkers(data = d_99, lng = ~Longitude, lat = ~Latitude,
                   color = ~pal(v),
                   stroke = FALSE,
                   radius = 5,
                   fillOpacity = 0.8) %>% 
  addLegend("bottomleft",
            values = seq(-0.5, 1.5, 0.25), # Restrict range
            pal = pal_rev,                 # Use reversed color palette
            labFormat = labelFormat(       #
              transform = function(x) x*-1 # Multiply by -1
              ),                           #
            title = "Hispanic-White <br/> Achievement Gap ES",
            opacity = 0.7)

One final edit

My good friend and colleague, Brock Rowley, was reviewing this post and said something along the lines of “It would really be nice if I could know the exact values for some of these schools.” So let’s make that happen! What we’ll do is just add a label to the circles, as below.

leaflet() %>%
  setView(lng = mean_geo$mean_long, 
             lat = mean_geo$mean_lat,
             zoom = 8) %>% 
  addProviderTiles("CartoDB.Positron") %>% 
  addCircleMarkers(data = d_99, lng = ~Longitude, lat = ~Latitude,
                   color = ~pal(v),
                   stroke = FALSE,
                   radius = 5,
                   fillOpacity = 0.8,
                   label = ~as.character(round(v, 2))) %>% # add label
  addLegend("bottomleft",
            values = seq(-0.5, 1.5, 0.25),
            pal = pal_rev,
            labFormat = labelFormat(
              transform = function(x) x*-1
              ),
            title = "Hispanic-White <br/> Achievement Gap ES",
            opacity = 0.7)

And that’s it! Notice because it’s a label we had to transform it to character first. We could make it fancier, of course, but that’s the basic approach.

Conclusions

There’s a fair amount of achievement gap variance between schools in California, and there’s some evidence of geographic clustering too. This is certainly not unique to California, but understanding what is driving this geographical variation certainly seems like a good place to start if we’re going to try to start reducing these inequities.