In this post, I’d like to share some work related to geo-spatial mapping, statewide school ratings, and US Census Bureau data using census tracts. Specifically, I wanted to investigate whether there was a relation between the median housing price in an area, and the statewide achievement ratings for schools in the corresponding area. There is a strong relation between socio-economic status and student achievement, but less is known about how statewide ratings for schools relate to the demographics of the corresponding area. Fair warning - this is a rather long post, so you’ll have to have a bit of endurance to get through it all (I considered splitting it into two posts but it felt less cohesive). We’ll be using R to produce the following map.

This post uses the tidyverse, Thomas Leeper‘s rio package for data import and Sam Firke‘s janitor package for quick cleaning up of column names. The leaflet packages, is used for mapping, as well the terrific tidycensus package by Kyle Walker, and the
ggmap package is used for pulling lattitude and longitude data from physical addresses. Finally, this post builds off a wonderful post by Julia Silge.

The data

For this project, I wanted to use only publicly available data. I meandered around the Oregon Department of Education (ODE) website for a while and eventually came to the file I wanted: Report Card Ratings for each school. So let’s import the data directly from the ODE website, clean it up a bit, and take a look.

# load packages for data import/prep
library(tidyverse)
library(rio)
library(janitor)

ratings <- import("http://www.oregon.gov/ode/schools-and-districts/reportcards/reportcards/Documents/rcdetails_1617.xlsx",
                  setclass = "tbl_df") %>% 
            clean_names()

# Clean up the file to what we need
achievement <- ratings %>% 
  filter(district_name == "Portland SD 1J") %>% 
  spread(indicator, rating) %>% 
  clean_names() %>% 
  select(school_name, achievement) %>% 
  mutate(achievement = parse_number(achievement)) %>% 
  filter(!is.na(achievement))
	
achievement
## # A tibble: 84 x 2
##    school_name                   achievement
##    <chr>                               <dbl>
##  1 Abernethy Elementary School          5.00
##  2 Ainsworth Elementary School          5.00
##  3 Alameda Elementary School            5.00
##  4 Arleta Elementary School             3.00
##  5 Astor Elementary School              3.00
##  6 Atkinson Elementary School           4.00
##  7 Rosa Parks Elementary School         2.00
##  8 Beach Elementary School              3.00
##  9 Beaumont Middle School               4.00
## 10 Boise-Eliot Elementary School        2.00
## # ... with 74 more rows

In the above code, we import the ratings, filter for only schools in Portland School District, spread the data out so we have different columns for each type of rating (achievement, growth, and student subgroup growth), and then select only the two variables we really need - the school name and the achievement rating.

This is a good start, but we don’t yet have the geographical coordinates of the schools. To do that, we’ll get another dataset from ODE that has the physical address of each school. We’ll then transform those addresses to lattitude and longitude using a bit of help from the ggmap package First, the addresses:

addresses <- import("http://www.ode.state.or.us/pubs/labels/SchoolMail.xls",
                    setclass = "tbl_df") %>% 
  clean_names() %>% 
  filter(city == "PORTLAND") %>% 
  mutate(name = str_to_title(name)) %>% 
  unite(address, c(address, city, state, zip), sep = " ") %>% 
  select(name, address)
              
addresses
## # A tibble: 145 x 2
##    name                        address                                    
##    <chr>                       <chr>                                      
##  1 Abernethy Elementary School 2421 SE ORANGE AVE PORTLAND OR 97214-5398  
##  2 Ainsworth Elementary School 2425 SW VISTA AVE PORTLAND OR 97201-2399   
##  3 Alameda Elementary School   2732 NE FREMONT ST PORTLAND OR 97212-2542  
##  4 Alder Elementary School     17200 SE ALDER ST PORTLAND OR 97233-4260   
##  5 Alice Ott Middle School     12500 SE RAMONA ST PORTLAND OR 97236-4699  
##  6 Alliance High School        4039 NE ALBERTA CT PORTLAND OR 97211-8199  
##  7 Arleta Elementary School    5109 SE 66TH AVE PORTLAND OR 97206-4599    
##  8 Arthur Academy              13717 SE DIVISION ST PORTLAND OR 97236-2841
##  9 Astor Elementary School     5601 N YALE ST PORTLAND OR 97203-5299      
## 10 Atkinson Elementary School  5800 SE DIVISION ST PORTLAND OR 97206-1499 
## # ... with 135 more rows

After import, we limit to only schools in Portland because we’re going to be joining this dataset with our ratings dataset based on the school name, and we want to ensure we don’t join the data for schools with the same name but but different districts. Unfortunately, the addresses file doesn’t tell us the district, or we could include that as one of the keyed variables in our join.

We’ll now join the addresses dataset with our ratings dataset.

d <- inner_join(achievement, addresses, by = c("school_name" = "name"))

Now, we can find the longitude and lattitude of each school using the physical address with the help of the ggmap package and, specifically, the geocode function. This takes a bit of time. Notice I’ve changed the source from the default, which is google, to “dsk”, which uses data science toolkit instead. I like this better because it tends to provide better results, at least in my experience, unless their servers are being overwhelmed (which is not terrifically infrequent), in which case it fails catastrophically.

library(ggmap)
lat_long <- geocode(d$address, source = "dsk")
d <- bind_cols(d, lat_long)
d[ ,c(1, 4:5)]
## # A tibble: 81 x 3
##    school_name                     lon   lat
##    <chr>                         <dbl> <dbl>
##  1 Abernethy Elementary School    -123  45.5
##  2 Ainsworth Elementary School    -123  45.5
##  3 Alameda Elementary School      -123  45.5
##  4 Arleta Elementary School       -123  45.5
##  5 Astor Elementary School        -123  45.6
##  6 Atkinson Elementary School     -123  45.5
##  7 Rosa Parks Elementary School   -123  45.6
##  8 Beach Elementary School        -123  45.6
##  9 Beaumont Middle School         -123  45.5
## 10 Boise-Eliot Elementary School  -123  45.5
## # ... with 71 more rows

Note - an earlier draft of this post had specified source = "dsk" when running the geocode function. I was getting some missing values (~19%) when using google (the default) and I received no missing data with source = "dsk". However, it seems the Data Science Toolkit website is fairly frequently overwhelmed. Last time I tried this it took forever and failed, so I suppose the best option may depend on the availability of the servers at time you run the function.

Mapping the data

To map the data, we’ll use the leaflet and sf packages. We’ll first write a quick function that tells leaflet what color we want the schools in our map to appear depending on their statewide achievement rating. We’ll make a simple diverging palette, where low values are red, middle values are white, and high values are green. These colors must correspond with the possible values for markerColor from leaflet::awesomeIcons.

library(leaflet)

get_col <- function(df, indicator) {
  sapply(df[[as.character(indicator)]], function(rating) {
    if(rating == 5) {
      "green"
    } else if(rating == 4) {
      "lightgreen"
    } else if(rating == 3) {
      "white"
    } else if(rating == 2) {
      "lightred"
    } else {
      "red"
    } })
}

Next we’ll map the Portland area using a combination of leaflet and the tidycensus packages to not only produce the map, but color census tracts according to the median housing income for the tract. Note that you have to have a US Census API key for these function to work (see here). Below, I use the get_acs function from tidycensus package to pull census tract data for Multnomah County. The variable argument tells the function which variable to get data from (see the load_variables function for other variables). I also filter the data for non-negative values (there was one that was returning a negative value, I haven’t taken the time to track down why yet).

library(tidycensus)
pdx_acs <- get_acs(geography = "tract", 
                     variables = "B25077_001", 
                     state = "OR",
                     county = "Multnomah County",
                     geometry = TRUE) %>% 
  filter(estimate > 0) 

The pdx_acs object now has data on the median housing cost for all census tracts in Multnomah County, as well as the geometry of each census tract. We can use this information to produce the map. First, however, we need to create a color palette for the tracts. We’ll use the viridis palette, which is not only beautiful, but also good for people with color blindness, as well as for printing in black and white. We’ll use the colorNumeric function from leaflet, and ask it to fill each tract according to the estimate of the median housing cost, which we just extracted with the tidycensus package.

pal <- colorNumeric(palette = "viridis", 
                    domain = pdx_acs$estimate)

Next we’ll get the actual icons we want to use, coloring them according to the achievement ratings. The last line in this code gets a hexadecimal approximation for these colors, which we’ll use in a legend later.

pins <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = get_col(d, "achievement")
)

cols <- gplots::col2hex(c("red", "pink", "white", "lightgreen", "darkgreen"))

Finally, we’ll produce the map, using the code below. Although it appears somewhat complicated, it’s actually not too bad, it’s just rather long(ish). Essentially we first separate NAME into three distinct variables for the census tract, county, and state, then we (a) create a blank map, (b) set the style in which we want the map to render, (c) add tract information, (d) add school information, and (e) add legends.

pdx_acs %>%
  separate(NAME, c("tract", "county", "state"), sep = ",") %>% 
  leaflet() %>%
  addProviderTiles(provider = "Stamen.TerrainBackground") %>%
  addPolygons(popup = ~ tract,
              stroke = FALSE,
              smoothFactor = 0,
              fillOpacity = 0.7,
              color = ~ pal(estimate)) %>%
  addAwesomeMarkers(data = d, 
                    ~lon, 
                    ~lat, 
                    icon = pins, 
                    popup= ~school_name) %>% 
  addLegend("bottomright", 
            pal = pal, 
            values = ~ estimate,
            title = "Median Home Value",
            labFormat = labelFormat(prefix = "$"),
            opacity = 1) %>% 
  addLegend("bottomright", 
            colors = rev(cols),
            labels = 5:1,
            title = "Achievement Ratings",
            opacity = 1)

One of the nice things about this map is that it’s fully interactive. And that’s particularly helpful here because to really see the relation you’ll need to zoom in quite a bit. Because we added the optional popup argument to addPolygons and addAwesomeMarkers, we’re able to click to identify a specific census tract or the name of a specific school. If we wanted the plot to render differently we could change the provider. For example, you could produce a nice terrain map with provider = "Stamen.TerrainBackground". All the different options are available here, where you can preview a map before applying it to yours.

Note that we have a very clear and very strong relation here - essentially all the red, pink, and white schools are in the more purple tracts, correpsonding to lower median housing price areas. This is powerful information but it is important that it’s not confused to mean that the lowest quality schools are located in the more impoverished areas. This information tells us very little about the quality of the school and, indeed, if we look at estimates of growth (shown below without all the code) there is a much less strong correlation. That is, the ratings schools receive on the amount of growth students make is less correlated with the median housing cost in the area than is absolute achievement, although there does still appear to be some correlation.