library(tidyverse) #data manipulation
library(gt) #formatting tables for display
library(survey) #survey data
library(rgdal) #shapefiles for maps
library(sf) #maps
<- read_csv("https://raw.github.com/natekratzer/raw_data/master/ky_high_speed_internet.csv") df
How to Use Census Microdata to Analyze High Speed Internet in Kentucky
Introduction
This post is a start to finish descriptive analysis of high speed internet access in Kentucky, including tables, graphs, and maps. All of the detail of cleaning the data and iterating while exploring the data is included. This makes for a rather lengthy post, but it also makes it relatively unique in including all of those steps. We go through five attempts at making a table of high speed internet before finally getting it right! There’s quite a bit of cleaning work and then also a detour into calculating standard errors via bootstrap so we can correctly display uncertainty in our visuals.
Census microdata is individual level data provided by the census. Most references to census data refer to tables the census bureau has already made out of their surveys, but the mostly raw survey data is available at the individual level, and that’s what we’ll use for this analysis. While the focus is on the census data, I do show the code used for analysis. I did decide to hide the code used to make all the tables (it gets rather lengthy), but you can see that code on Github if interested.
Getting the Data
The easiest way to get census microdata is through the Integrated Public Use Microdata Series (IPUMS) hosted by the University of Minnesota. While you can get the data directly from the Census Bureau, IPUMS has made it much easier to compare across multiple years and to select the variables you want. IPUMS also provides a codebook that is easy to refer to and notes any important changes from year to year.
I’ve put the data for just Kentucky up on GitHub, so I’ll read it in from there.
Cleaning the Data
When downloading the data it’s all numeric, even for variables that are categorial - they’ve been coded and our first step in the analysis will be using the code book to translate them. I won’t show all the codebooks, but for this first variable let’s take a look at what IPUMS has to say. For all years NA is coded as 00 and No high speed internet is coded as 20. Prior to 2016 there are detailed codes for the type of internet access, while for 2016 and after the code is collapsed.
I’ll use a case_when()
statement to recode high speed interent access into a categorical variable.
# High Speed Internet
<- df %>%
df mutate(
hspd_int = case_when(
== 00 ~ NA_character_,
CIHISPEED == 20 ~ "No",
CIHISPEED >= 10 & CIHISPEED < 20 ~ "Yes",
CIHISPEED TRUE ~ NA_character_
) )
Getting wrong answers by not knowing the data
Now that we have a high speed internet category we can group the data and count up how many responses are in each group. I’ll also pivot the dataframe to make it easy to calculate percent with high speed internet.
# Count numbers with and without high speed internet
<- df %>%
df_group group_by(hspd_int, YEAR) %>%
summarize(count = n(), .groups = "drop")
# Pivot for easier percent calculations
<- df_group %>%
df_wide pivot_wider(id_cols = YEAR,
names_from = hspd_int,
values_from = count) %>%
mutate(
percent_hspd = (Yes / (Yes + No)),
percent_no = 1 - percent_hspd,
percent_NA = (`NA` / (Yes + No + `NA`))
)
Table 1: Quick Analysis | ||||||
These results are wrong! | ||||||
Year | Percent | Number of People | ||||
---|---|---|---|---|---|---|
Yes | No | NA | Yes | No | NA | |
2013 | 87% | 13% | 28% | 28.3K | 4.27K | 12.4K |
2014 | 86% | 14% | 27% | 28.1K | 4.70K | 12.1K |
2015 | 86% | 14% | 26% | 28.7K | 4.52K | 11.5K |
2016 | 82% | 18% | 20% | 29.2K | 6.49K | 9.02K |
2017 | 81% | 19% | 19% | 29.4K | 7.08K | 8.77K |
2018 | 80% | 20% | 17% | 30.3K | 7.35K | 7.86K |
Yes and No percentages are calculated out of those who answered to represent the results an analyst might get if they ignored NA values. NA is calculated separately based on all the data. | ||||||
Source: Author's incorrect analysis of IPUMS data. Used as an example of a mistake to avoid. |
This table is actually wrong for multiple reasons, but the first one we’ll take care of is the failure to use weights. Survey data is often weighted to make it representative of the population. The census bureau provides a PERWT
variable that should be used as the weight for each person in the file. There’s also a HHWT
variable for household level analysis. We’ll stick with weighting the data by the number of people. One of the really nice features of the PERWT
variable is that it sums to the population. That means our tables can show both an overall number of people and the percentage of people.
# Count numbers with and without high speed internet
<- df %>%
df_group group_by(hspd_int, YEAR) %>%
summarize(count = sum(PERWT), .groups = "drop")
# Pivot for easier percent calculations
<- df_group %>%
df_wide pivot_wider(id_cols = YEAR,
names_from = hspd_int,
values_from = count) %>%
mutate(
percent_hspd = (Yes / (Yes + No)),
percent_no = 1 - percent_hspd,
percent_NA = (`NA` / (Yes + No + `NA`))
)
Table 2: Quick Analysis with Weights | ||||||
These results are still wrong! | ||||||
Year | Percent | Number of People | ||||
---|---|---|---|---|---|---|
Yes | No | NA | Yes | No | NA | |
2013 | 86% | 14% | 27% | 2.76M | 454K | 1.18M |
2014 | 84% | 16% | 26% | 2.74M | 509K | 1.17M |
2015 | 86% | 14% | 25% | 2.85M | 479K | 1.10M |
2016 | 80% | 20% | 19% | 2.87M | 708K | 855K |
2017 | 80% | 20% | 18% | 2.90M | 735K | 817K |
2018 | 79% | 21% | 16% | 2.97M | 790K | 709K |
Yes and No percentages are calculated out of those who answered to represent the results an analyst might get if they ignored NA values. NA is calculated separately based on all the data. | ||||||
Source: Author's incorrect analysis of IPUMS data. Used as an example of a mistake to avoid. |
This is better. The second problem is harder to spot. There are 3 hints in the data:
- There is a very high percentage of NA responses. There are more NA answers than there are people who say they don’t have high speed access.
- Percent of of people with high speed access is going down over time, which isn’t what I’d expect to see. That doesn’t mean it’s wrong - sometimes the data shows high level trends we don’t expect. However, it’s always worth a second look when you get a counterintuitive result.
- These numbers look very high for Kentucky.
A sensible guess is that people who say they don’t have internet access at all aren’t then asked about high speed internet and show up as an NA value when we want to code them as not having high speed interent.
So let’s get to know the data a bit better by adding in internet access. We’ll do the same analysis, but I’ll add internet as another id variable just like year. We can see right away that the answers we have above are only including cases where individuals have internet.
<- df %>%
df mutate(
int = case_when(
== 0 ~ NA_character_,
CINETHH == 1 | CINETHH == 2 ~ "Yes",
CINETHH == 3 ~ "No",
CINETHH TRUE ~ NA_character_
)
)
<- df %>%
df_group group_by(hspd_int, int, YEAR) %>%
summarize(count = sum(PERWT), .groups = "drop")
# Pivot for easier percent calculations
<- df_group %>%
df_wide pivot_wider(
id_cols = c(YEAR, int),
names_from = hspd_int,
values_from = count
%>%
) mutate(
percent_hspd = (Yes / (Yes + No)),
percent_no = 1 - percent_hspd,
percent_NA = (`NA` / (Yes + No + `NA`))
)
Table 3: Exploring Internet Data | ||||||
These results are exploratory. But not wrong! | ||||||
Year | Percent | Number of People | ||||
---|---|---|---|---|---|---|
Yes | No | NA | Yes | No | NA | |
Has Internet Access | ||||||
2013 | 86% | 14% | 5% | 2.76M | 454K | 185K |
2014 | 84% | 16% | 6% | 2.74M | 509K | 193K |
2015 | 86% | 14% | 6% | 2.85M | 479K | 217K |
2016 | 80% | 20% | 3% | 2.87M | 708K | 129K |
2017 | 80% | 20% | 3% | 2.90M | 735K | 124K |
2018 | 79% | 21% | 3% | 2.97M | 790K | 125K |
No Internet Access | ||||||
2013 | NA | NA | NA | NA | NA | 866K |
2014 | NA | NA | NA | NA | NA | 842K |
2015 | NA | NA | NA | NA | NA | 753K |
2016 | NA | NA | NA | NA | NA | 596K |
2017 | NA | NA | NA | NA | NA | 560K |
2018 | NA | NA | NA | NA | NA | 452K |
Internet Access is NA | ||||||
2013 | NA | NA | NA | NA | NA | 126K |
2014 | NA | NA | NA | NA | NA | 130K |
2015 | NA | NA | NA | NA | NA | 129K |
2016 | NA | NA | NA | NA | NA | 131K |
2017 | NA | NA | NA | NA | NA | 132K |
2018 | NA | NA | NA | NA | NA | 132K |
Yes and No percentages are calculated out of those who answered to represent the results an analyst might get if they ignored NA values. NA is calculated separately based on all the data. | ||||||
Source: Author's analysis of IPUMS data. |
When we split it out this way we can see that the groups without any internet access are all NA values. So what we were looking at was the percentage of people with internet who have high speed internet. What we want is the percentage of all people who have high speed internet. We can fix the way we create our categories by saying that anyone who has no internet also has no high speed internet.
<- df %>%
df mutate(
#create a categorical variable for having any internet
int = case_when(
== 0 ~ NA_character_,
CINETHH == 1 | CINETHH == 2 ~ "Yes",
CINETHH == 3 ~ "No",
CINETHH TRUE ~ NA_character_
),# change how high speed internet is defined so that houses w/o any interent are counted as 'No' instead of NA
hspd_int = case_when(
== 00 & int != "No" ~ NA_character_,
CIHISPEED == 20 | int == "No" ~ "No",
CIHISPEED >= 10 & CIHISPEED < 20 ~ "Yes",
CIHISPEED TRUE ~ NA_character_
)
)
# Count numbers with and without high speed internet
<- df %>%
df_group group_by(hspd_int, YEAR) %>%
summarize(count = sum(PERWT), .groups = "drop")
# Pivot for easier percent calculations
<- df_group %>%
df_wide pivot_wider(
id_cols = c(YEAR),
names_from = hspd_int,
values_from = count
%>%
) mutate(
percent_hspd = (Yes / (Yes + No)),
percent_no = 1 - percent_hspd,
percent_NA = (`NA` / (Yes + No + `NA`))
)
Table 4: Almost right! | ||||||
These results are still a little wrong! | ||||||
Year | Percent | Number of People | ||||
---|---|---|---|---|---|---|
Yes | No | NA | Yes | No | NA | |
2013 | 68% | 32% | 7% | 2.76M | 1.32M | 311K |
2014 | 67% | 33% | 7% | 2.74M | 1.35M | 324K |
2015 | 70% | 30% | 8% | 2.85M | 1.23M | 346K |
2016 | 69% | 31% | 6% | 2.87M | 1.30M | 260K |
2017 | 69% | 31% | 6% | 2.90M | 1.29M | 256K |
2018 | 71% | 29% | 6% | 2.97M | 1.24M | 257K |
Yes and No percentages are calculated out of those who answered to represent the results an analyst might get if they ignored NA values. NA is calculated separately based on all the data. | ||||||
Source: Author's incorrect analysis of IPUMS data. Used as an example of a mistake to avoid. |
These results look much better, although there’s still one more thing to do about those NA results.
Group Quarters in the Census
The census data includes individuals living in group quarters (mostly prisons, senior living centers, and dorms, but includes any sort of communal living arrangement). However, all census questions about appliances and utilities (the category that internet access falls under) are NA for group quarters. So we’ll add one more line to filter out individuals living in group quarters (a common practice when working with Census microdata). The code below adds a filter for Group Quarters. Since this table is showing correct results I’ll also add a little additional formatting to make it stand out from the others.
I’ll also note that the way the Census Bureau constructs weights is very convenient for getting totals. While I’m focusing on the percent of people who have internet access, the Yes and No columns are accurate estimates of the population with and without access.
# Count numbers with and without high speed internet
<- df %>%
df_group filter(GQ == 1 | GQ ==2 | GQ == 5) %>%
group_by(hspd_int, YEAR) %>%
summarize(count = sum(PERWT), .groups = "drop")
# Pivot for easier percent calculations
<- df_group %>%
df_wide pivot_wider(id_cols = c(YEAR), names_from = hspd_int, values_from = count) %>%
mutate(percent_hspd = (Yes / (Yes + No)),
percent_no = 1 - percent_hspd,
percent_NA = (`NA` / (Yes + No + `NA`)))
Table 5: High Speed Internet in Kentucky | ||||||
Year | Percent | Number of People | ||||
---|---|---|---|---|---|---|
Yes | No | NA | Yes | No | NA | |
2013 | 68% | 32% | 4% | 2.76M | 1.32M | 185K |
2014 | 67% | 33% | 5% | 2.74M | 1.35M | 193K |
2015 | 70% | 30% | 5% | 2.85M | 1.23M | 217K |
2016 | 69% | 31% | 3% | 2.87M | 1.30M | 129K |
2017 | 69% | 31% | 3% | 2.90M | 1.29M | 124K |
2018 | 71% | 29% | 3% | 2.97M | 1.24M | 125K |
Yes and No percentages are calculated out of those who answered. NA is reported out of all the data to provide context on how much data is missing. | ||||||
Source: Author's analysis of IPUMS data. |
That removed about half of our NA values. It might be nice to know a bit more about the missing data, but at around 3 percent of observations it’s unlikely to change our substantive conclusions. I suspect these are cases where there wasn’t an answer for that question. We’ll keep an eye on NA values as we do the analysis, because as we get into questions like how internet access varies by race, income, age, and education we’ll want to know if NA answers are more or less likely in any of those categories.
Checking against data.census.gov
To do a quick check against the way the census bureau itself analyzes the data I looked at data.census.gov for 2018 in Kentucky. An important note is that their data is for households, and so their numeric counts look quite different because I’m counting number of people. They also have a breakdown where cellular is included in broadband, which I do not want, as a cell phone is not really an adequate work or study device. So to get to what I have we need to add “Broadband such as cable, fiber optic or DSL” and “Satellite Internet service”, which gets us to 70.8% compared to the 70.5% in this analysis. The difference is small and most likely the result of their analysis being weighted to the household level rather than the person level. (Internet is measured at the household level and the same for every person in the household, but by choosing to weight it at the person level I am a) letting us talk in terms of people, b) giving more weight to larger households, c) making it possible to break down internet access by categories that do vary within households, like age).
Analysis
Going forward we’re going to want to filter by group quarters, so let’s apply that filter to our main dataframe.
<- df %>%
df filter(GQ == 1 | GQ ==2 | GQ == 5)
Standard Errors
Know that we know the data we’d also like to know how uncertain our sample is so that we know if movements over time are real or just a result of noisy data. There are a few ways to do this. The survey
package does an excellent job with complex survey designs, but does require learning a new syntax to use. The alternative I’ll use here is a method known as bootstrap. IPUMS suggests using bootstrap might be the best way to get standard errors on census microdata. The basic idea of the bootstrap is to resample the existing data and use the sampling error from that as an estimate for sampling error in the overall population. Let’s do an example with high speed internet in 2018 to see how it works. The output here will be the mean and standard deviation for Kentucky. (We’ll use the standard error to calculate confidence intervals once we start displaying actual results.)
#set seed
set.seed(42)
# Filter to just 2018
# Exclude NA values
# Recode as numeric vector of 1 and 0
# The numeric 1 and 0 form will make it much easier to get means without pivoting, which matters a lot when doing this 1000 times
<- df %>%
df2018 filter(YEAR == 2018 & !is.na(hspd_int)) %>%
mutate(hspd_num = if_else(hspd_int == "Yes", 1, 0)) %>%
select(hspd_num, PERWT)
# Write a function so I can map over it.
# In this case, we need the function to do the same thing X number of times and assign an ID that we can use as a grouping variable
<- function(sample_id){
create_samples <- df2018[sample(nrow(df2018), nrow(df2018), replace = TRUE) , ] %>%
df_out as_tibble()
$sample_id <- sample_id
df_outreturn(df_out)
}
<- as.list(seq(1, 1000, by = 1))
nlist <- purrr::map_df(nlist, create_samples)
samples
<- samples %>%
sample_summary group_by(sample_id) %>%
mutate(ind_weight = PERWT / sum(PERWT),
hspd_weight = hspd_num * ind_weight) %>% # PERWT is population and doesn't sum to 1. Rescale it to sum to one
summarize(group_mean = sum(hspd_weight),
weight_check = sum(ind_weight), .groups = "drop") # Check that my weights add up to one
<- tibble(
display_tbl mean = mean(sample_summary$group_mean),
sd = sd(sample_summary$group_mean)
)
mean | sd |
---|---|
0.7053407 | 0.002866898 |
We can also take a look at our bootstrap graphically. We want to check that the distribution of the sample is roughly normal. If it’s not, that means we didn’t do enough bootstrap samples for the Central Limit Theorem to kick in.
#Check that the distribution is normal and than the middle of the distribution is close to the 70.5% we estimated had internet access above
ggplot(sample_summary, aes(group_mean)) +
geom_density() + theme_bw() +
labs(title = "Bootstrapped means of High Speed Internet Access",
x = "Mean",
y = "Kernel Density")
Checking our results against the survey package
Above we found a mean of 0.705 for 2018 and and standard error of 0.0029 based on our bootstrap analysis. It’s worth checking that this is the same result we’d get using an analytic approach (instead of bootstrap).
# Here we're assuming a simple design.
# Survey requires the creation of a design object and then has functions that work with that object.
# You can get more complicated, which is when the survey package would be most useful.
<- svydesign(ids = ~ 1, weights = ~PERWT, data = df2018)
svy_df
# Taking the mean and standard error from our design object
<- svymean(~hspd_num, design = svy_df)
hint_tbl
<- as_tibble(hint_tbl)
hint_tbl names(hint_tbl) <- c("mean", "sd") #The names weren't coerced correctly when transforming into a tibble.
mean | sd |
---|---|
0.7051774 | 0.00293509 |
These results are very similar. Following the IPUMS recommendation we’ll continue on with the bootstrap, but it’s good to know the results are the same for practical purposes. So now instead of just doing 2018, we’ll need to do every year. We already know the mean values for every year, and they’re still saved in the df_wide
variable right now. So let’s write a function for bootstrap that will let us find standard errors for every year or for any other grouping we choose.
Writing a bootstrap function
# Create a helper function
# It needs to have a way to recieve the dataframe from the function that calls it, so we've added a second argument
<- function(sample_id, df){
create_samples
<- df[sample(nrow(df), nrow(df), replace = TRUE) , ] %>%
df_out as_tibble()
$sample_id <- sample_id
df_out
return(df_out)
}
# Need to be able to take in grouping variables so that the summaries can be specific to the groups
# Using the embrace {{}} notation allows use to pass in unquoted variables to the function.
<- function(df, num_samples, group_vars) {
bootstrap_pums
<- as.list(seq(1, num_samples, by = 1))
nlist <- purrr::map_df(nlist, create_samples, df)
samples
<- samples %>%
sample_summary group_by( sample_id, across( {{group_vars}} )) %>%
mutate(ind_weight = PERWT / sum(PERWT),
hspd_weight = hspd_n * ind_weight) %>% # PERWT sums to population instead of to 1. Rescale it to sum to 1.
summarize(group_mean = sum(hspd_weight), .groups = "drop") # Not dropping .groups here results in problems in the next group_by call.
<- sample_summary %>%
sample_sd group_by( across( {{ group_vars }} )) %>%
summarize(sd = sd(group_mean), .groups = "drop")
}
# We do need to prep the data a little so that we're not carrying through the whole dataframe.
<- df %>%
df_in filter(!is.na(hspd_int)) %>%
mutate(hspd_n = if_else(hspd_int == "Yes", 1, 0)) %>%
select(hspd_n, PERWT, YEAR)
# And finally we can call the function
<- bootstrap_pums(df = df_in, num_samples = 500, group_vars = YEAR) boot_results
Now that we have our bootstrap standard errors we can combine them with the data and plot them. We’ll use 95% confidence intervals, which we get by multiplying the standard error by 1.96 (the number of standard deviations that corresponds to a 95% confidence interval).
<- df_wide %>%
df_plt full_join(boot_results, by = "YEAR") %>%
transmute(Year = YEAR,
Percent = 100 * percent_hspd,
me = 100 * 1.96 * sd)
<- ggplot(df_plt, aes(x = Year, y = Percent)) +
plt_int geom_errorbar(aes(ymin = Percent - me, ymax = Percent + me), width = .1) +
geom_line() +
geom_point() +
theme_bw() +
labs(title = "High Speed Internet Access") +
theme(legend.position = "bottom")
plt_int
Race, Poverty, Age, and Geography
Going a little deeper we can use microdata to get results by custom groupings. I’ll show examples for race, poverty, age, and geography, but for any group you can construct with available Census data you can produce an estimate for their internet acess.
Race
Next we’ll build a table by race and year.
# Let's build a table first and then we'll do the standard errors
# Coding a race variable using case_when
<- df %>%
df mutate(race = case_when(
== 1 ~ "White",
RACE == 2 ~ "Black",
RACE > 3 & RACE < 7 ~ "Asian",
RACE > 0 & HISPAN < 5 ~ "Hispanic",
HISPAN TRUE ~ "All Others"
))
<- df %>%
df_group group_by(hspd_int, race, YEAR) %>%
summarize(count = sum(PERWT), .groups = "drop")
# Pivot for easier percent calculations
<- df_group %>%
df_wide pivot_wider(id_cols = c(race, YEAR), names_from = hspd_int, values_from = count) %>%
mutate(percent_hspd = (Yes / (Yes + No)),
percent_no = 1 - percent_hspd,
percent_NA = (`NA` / (Yes + No + `NA`)))
Table 6: High Speed Internet Access | ||||||
By Race and Ethnicity | ||||||
Year | Percent | Number of People | ||||
---|---|---|---|---|---|---|
Yes | No | NA | Yes | No | NA | |
Asian | ||||||
2013 | 84% | 16% | 4% | 42.2K | 8.03K | 2.06K |
2014 | 76% | 24% | 2% | 41.6K | 12.9K | 1.19K |
2015 | 83% | 17% | 3% | 48.8K | 10.3K | 1.76K |
2016 | 77% | 23% | 10% | 43.8K | 13.0K | 6.55K |
2017 | 80% | 20% | 1% | 52.5K | 12.9K | 703 |
2018 | 82% | 18% | 3% | 52.8K | 11.7K | 1.66K |
Black | ||||||
2013 | 64% | 36% | 9% | 192K | 106K | 28.7K |
2014 | 58% | 42% | 9% | 171K | 122K | 30.3K |
2015 | 62% | 38% | 12% | 183K | 113K | 40.5K |
2016 | 62% | 38% | 5% | 204K | 127K | 15.6K |
2017 | 64% | 36% | 2% | 216K | 122K | 8.04K |
2018 | 62% | 38% | 3% | 200K | 122K | 9.90K |
Hispanic | ||||||
2013 | 48% | 52% | 4% | 21.6K | 23.8K | 1.90K |
2014 | 38% | 62% | 12% | 15.7K | 25.2K | 5.66K |
2015 | 47% | 53% | 10% | 17.2K | 19.5K | 4.01K |
2016 | 59% | 41% | 4% | 23.0K | 15.8K | 1.59K |
2017 | 57% | 43% | 1% | 24.7K | 18.5K | 421 |
2018 | 59% | 41% | 1% | 30.9K | 21.1K | 616 |
White | ||||||
2013 | 68% | 32% | 4% | 2.45M | 1.16M | 148K |
2014 | 68% | 32% | 4% | 2.44M | 1.16M | 152K |
2015 | 70% | 30% | 4% | 2.54M | 1.07M | 164K |
2016 | 69% | 31% | 3% | 2.54M | 1.13M | 102K |
2017 | 70% | 30% | 3% | 2.54M | 1.11M | 113K |
2018 | 71% | 29% | 3% | 2.61M | 1.06M | 108K |
All Other Races | ||||||
2013 | 72% | 28% | 6% | 58.3K | 22.3K | 4.90K |
2014 | 70% | 30% | 4% | 65.9K | 27.7K | 3.91K |
2015 | 71% | 29% | 7% | 59.6K | 24.3K | 6.19K |
2016 | 75% | 25% | 4% | 63.2K | 21.3K | 3.12K |
2017 | 71% | 29% | 2% | 68.9K | 28.2K | 2.23K |
2018 | 72% | 28% | 5% | 73.9K | 29.4K | 5.28K |
Yes and No percentages are calculated out of those who answered. NA is reported out of all the data to provide context on how much data is missing. | ||||||
Source: Author's analysis of IPUMS data. |
While we do see high NA values in some years, overall they’re at reasonable levels and seem to be lowest in 2018. This table is quite long though. While that works for exploring the data, for an actual display we’d probably want to focus on just 2018 and show the history in a graph. Below is the table filtered to just 2018 and slightly reformatted.
Table 7: High Speed Internet Access in 2018 | ||||||
By Race and Ethnicity | ||||||
Race/Ethnicity | Percent | Number of People | ||||
---|---|---|---|---|---|---|
Yes | No | NA | Yes | No | NA | |
All Others | 72% | 28% | 5% | 73.9K | 29.4K | 5.28K |
Asian | 82% | 18% | 3% | 52.8K | 11.7K | 1.66K |
Black | 62% | 38% | 3% | 200K | 122K | 9.90K |
Hispanic | 59% | 41% | 1% | 30.9K | 21.1K | 616 |
White | 71% | 29% | 3% | 2.61M | 1.06M | 108K |
Yes and No percentages are calculated out of those who answered. NA is reported out of all the data to provide context on how much data is missing. | ||||||
Source: Author's analysis of IPUMS data. |
Now let’s add standard errors and graph the data. I’ll write the code for graphing the data as a function since we will use it again.
# We do need to prep the data a little so that we're not carrying through the whole dataframe.
<- df %>%
df_in filter(!is.na(hspd_int)) %>%
mutate(hspd_n = if_else(hspd_int == "Yes", 1, 0)) %>%
select(hspd_n, PERWT, YEAR, race)
# And we can call the bootstrap function
<- bootstrap_pums(df = df_in, num_samples = 500, group_vars = c(YEAR, race))
boot_results
<- df_wide %>%
df_plt full_join(boot_results, by = c("race", "YEAR")) %>%
transmute(Year = YEAR,
Race = race,
Percent = 100 * percent_hspd,
me = 100 * 1.96 * sd) %>%
filter(Race != "All Others") # When plotting All Others overlaps White and having five lines makes it quite hard to read.
# At this point I'll introduce a function to plot multiple groups over time, since we'll use this again
<- function(df, group_var, title_text = "") {
plt_by
<- ggplot(data = df, aes(x = Year, y = Percent, group = {{group_var}}, colour = {{group_var}})) +
plt geom_errorbar(aes(ymin = Percent - me, ymax = Percent + me), width = .1) +
geom_point() +
geom_line() +
theme_bw() +
labs(title = title_text, x = "Year", y = "Percent") +
theme(legend.position = "bottom")
plt
}
<- plt_by(df_plt, Race, title_text = "High Speed Internet Access by Race and Ethnicity")
plt_race
plt_race
Poverty Status
I’m going to skip showing the code for Poverty, Age, and Geography because it’s extremely similar to the code used for Race. The poverty variable in the IPUMS data is measured in income as a percent of the poverty line. So for this analysis I code under 100 as being in poverty, between 100 and 200 percent of the poverty line as near poverty, and above 200 percent as not being in poverty.
Table 8: High Speed Internet Access in 2018 | ||||||
By Poverty Status | ||||||
Poverty Status | Percent | Number of People | ||||
---|---|---|---|---|---|---|
Yes | No | NA | Yes | No | NA | |
In Poverty | 53% | 47% | 4% | 376K | 335K | 30.6K |
Near Poverty | 62% | 38% | 4% | 499K | 311K | 29.7K |
Not in Poverty | 78% | 22% | 2% | 2.09M | 595K | 64.8K |
Yes and No percentages are calculated out of those who answered. NA is reported out of all the data to provide context on how much data is missing. | ||||||
Source: Author's analysis of IPUMS data. |
Age
The age varible in IPUMS is the most straightforward, it’s just numeric age. It is worth pointing out that this is based on individual ages, while high speed internet is a household level variable. If we really wanted to do a deep dive into just age, we’d want to look at the age composition of the whole household. The really nice thing about microdata is you can slice and dice it any way you think is appropriate. So if you want to look at households that have individuals under 18 and over 65, you can. If you want to look at households with no one under 65, you can. I’ve just taken a high level cut of the data here though.
Table 9: High Speed Internet Access in 2018 | ||||||
By Age Group | ||||||
Age Group | Percent | Number of People | ||||
---|---|---|---|---|---|---|
Yes | No | NA | Yes | No | NA | |
18 and under | 74% | 26% | 2% | 760K | 263K | 26.0K |
19 to 64 | 72% | 28% | 3% | 1.81M | 691K | 76.7K |
65+ | 58% | 42% | 3% | 398K | 288K | 22.4K |
Yes and No percentages are calculated out of those who answered. NA is reported out of all the data to provide context on how much data is missing. | ||||||
Source: Author's analysis of IPUMS data. |
Geography
IPUMs provides categorizations of where people live as either being in a prinicipal city, in the metro but not the principal city, or outside of a metro area. I have chosen to present this as the more familiar (and shorter) urban, suburban, and rural. IPUMS also has a direct measure of density that would be useful in an analysis - but for tables and graphs the categorial variable is better.
Table 10: High Speed Internet Access in 2018 | ||||||
By Metropolitan Status | ||||||
Metropolitan Status | Percent | Number of People | ||||
---|---|---|---|---|---|---|
Yes | No | NA | Yes | No | NA | |
City | 85% | 15% | 3% | 257K | 46.2K | 8.16K |
Mixed/Unknown | 68% | 32% | 3% | 1.60M | 752K | 74.7K |
Rural | 67% | 33% | 3% | 694K | 341K | 35.4K |
Suburbs | 81% | 19% | 1% | 420K | 102K | 6.77K |
Yes and No percentages are calculated out of those who answered. NA is reported out of all the data to provide context on how much data is missing. | ||||||
Source: Author's analysis of IPUMS data. |
Mapping the Data
To map the data you’ll need to first get shapefiles for Public Use Microdata Areas. The tradeoff for using microdata is that it’s not available down to the tract level. It’s individual responses, so to protect the privacy of respondents all geographic areas have to include at least 100,000 people. Those areas are drawn by the census and are the public use microdata areas mentioned above. You can get shapefiles for them from the census here. You’ll need to unzip the folder and then the rgdal
package points at the entire unzipped folder, not just an individual file.
There are a lot of options for mapping in R. I’ve chosen to use sf
and ggplot2
because it works well with the tidyverse and uses familiar ggplot2 syntax. I really like the leaflet
package for making interactive maps, but static maps also have their place, and that’s what I’ve gone with here.
# Set up shapefile as an sf object
<- readOGR("tl_2019_21_puma10", layer = "tl_2019_21_puma10", GDAL1_integer64_policy = TRUE, verbose = FALSE)
ky_shp @data$PUMA <- as.numeric(as.character(ky_shp@data$PUMACE10))
ky_shp<- st_as_sf(ky_shp)
ky_sf
# Calculate internet access at the PUMA level
<- df %>%
df_group filter(YEAR == 2018) %>%
group_by(hspd_int, PUMA) %>%
summarize(count = sum(PERWT), .groups = "drop")
# Pivot for easier percent calculations
<- df_group %>%
df_wide pivot_wider(id_cols = PUMA, names_from = hspd_int, values_from = count) %>%
mutate(percent_hspd = (Yes / (Yes + No)),
percent_no = 1 - percent_hspd)
<- tibble(
int_puma PUMA = df_wide$PUMA,
int_per = df_wide$percent_no * 100
)
<- full_join(ky_sf, int_puma, by = "PUMA")
ky_sf
ggplot(ky_sf) +
geom_sf(aes(fill=int_per)) +
scale_fill_gradient(low = "blue", high = "purple", name = "Percent") +
theme_bw() +
theme(panel.grid = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.border = element_blank()) +
labs(title = "People in Households without High Speed Internet Access",
caption = "Map is shaded by percent of people without high speed access in each Public Use Microdata Area")
School Age Children
As I’ve mentioned, a nice feature of using microdata is that you can look at the data in ways that aren’t available in the premade Census tabulations. With a lot of school districts using online learning, a look at the percent of school age children (ages 5-18) without high speed internet access at home could be useful for policymakers. Here we can see that there are parts of Western Kentucky where up to 60% do not have high speed access at home.
#need to recreate the sf object for the joins to work correctly
<- st_as_sf(ky_shp)
ky_sf
# Calculate internet access at the PUMA level
<- df %>%
df_group filter(YEAR == 2018 & AGE >= 5 & AGE <= 18) %>%
group_by(hspd_int, PUMA) %>%
summarize(count = sum(PERWT), .groups = "drop")
# Pivot for easier percent calculations
<- df_group %>%
df_wide pivot_wider(id_cols = PUMA, names_from = hspd_int, values_from = count) %>%
mutate(percent_hspd = (Yes / (Yes + No)),
percent_no = 1 - percent_hspd)
<- tibble(
int_puma PUMA = df_wide$PUMA,
int_per = df_wide$percent_no * 100,
int_num = df_wide$No
)
<- full_join(ky_sf, int_puma, by = "PUMA")
ky_sf
ggplot(ky_sf) +
geom_sf(aes(fill=int_per)) +
scale_fill_gradient(low = "blue", high = "purple", name = "Percent") +
theme_bw() +
theme(panel.grid = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.border = element_blank()) +
labs(title = "Children ages 5-18 in Households without High Speed Internet Access",
caption = "Map is shaded by percent of children without high speed access in each Public Use Microdata Area")
Louisville (Jefferson County) with Labels
#need to recreate the sf object for the joins to work correctly
<- st_as_sf(ky_shp) %>%
ky_sf filter(PUMA %in% c("1701", "1702", "1703", "1704", "1705", "1706"))
# Calculate internet access at the PUMA level
# Filter to only the PUMAs in Jefferson County
<- df %>%
df_group filter(YEAR == 2018 & AGE >= 5 & AGE <= 18 & PUMA %in% c("1701", "1702", "1703", "1704", "1705", "1706")) %>%
group_by(hspd_int, PUMA) %>%
summarize(count = sum(PERWT), .groups = "drop")
# Pivot for easier percent calculations
<- df_group %>%
df_wide pivot_wider(id_cols = PUMA, names_from = hspd_int, values_from = count) %>%
mutate(percent_hspd = (Yes / (Yes + No)),
percent_no = 1 - percent_hspd)
<- tibble(
int_puma PUMA = df_wide$PUMA,
int_per = df_wide$percent_no * 100,
int_num = formattable::comma(round(df_wide$No, -2), digits = 0)
)
<- full_join(ky_sf, int_puma, by = "PUMA")
ky_sf
ggplot(ky_sf) +
geom_sf(aes(fill=int_per)) +
geom_sf_label(aes(label = int_num, fontface = "bold")) +
scale_fill_gradient(low = "blue", high = "purple", name = "Percent") +
theme_bw() +
theme(panel.grid = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.border = element_blank()) +
labs(title = "Children ages 5-18 in Households without High Speed Internet Access in Jefferson County",
caption = "Map is shaded by percent of children without access in each Public Use Microdata Area and
the number of children without access is given by the label rounded to the nearest 100")
Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
give correct results for longitude/latitude data
Conclusion and next steps
The one thing we haven’t really touched on in this blog is applying this data. The focus is here is intentionally on getting to know the data and making descriptive statistics. A clear next step to making the analysis useful though is to start talking to subject matter experts. Is internet access better in other states? Other countries? Are there cities where access is higher? How did they do it? What’s the history of our broadband infrastructure? I recommend this article from the Atlantic as a starting point if you’re interested in the high speed internet question. No matter what data you’re analyzing though, you’ll need to put it in context both to know how to explore and model it, and also to communicate the results in a way that will be meaningful to other people.
Once you get the hang of working with Census microdata it unlocks the ability to answer some really interesting questions. Like all analytics work, it’s important to first get to know the data, which is why we spent so long at the top of this post making wrong tables until figuring out the data structure. Visualizing uncertainty is also important when working with survey data. There are still a few places this analysis could be cleaned up. As one example, while I made functions for bootstrapping standard errors and for graphing the data over time, some of the cleaning code could be functionalized rather than repeated. The code should also be rearranged so that all the data gets recoded at the beginning rather than doing it piecemeal on an as needed basis. We could also start to look at how the categories we’ve analyzed (race, poverty, age, geography) overlap, either using more detailed crosstabs or by building some models. The goal here though was to show what a full analysis would look like, including stumbling through some of the initial exploration of the data.