import numpy as np
import pandas as pd
import geopandas as gpd
import requests
Mapping Census Data with Python
This post covers two useful skils, making maps and accessing Census data. Most of the python map making tutorials I found online showed how to make maps using data that already came with the library, and didn’t have many notes on how the data is formatted or how to bring in new data.
In addition to the standard analysis libraries numpy
and pandas
, we’re going to use GeoPandas
for mapping data and the requests
library for getting Census data
The Initial Map Layer
The census has shapefiles here and I’ve downloaded a zip file of all counties in the United States. I’ll filter it down to just Kentucky for this example.
GeoPandas works well with census shapefiles out of the box. Don’t unzip the downloaded file, it can be read straight into a GeoPandas dataframe, which resembles a pandas dataframe, but with an additional column of geographic attributes.
= gpd.read_file('cb_2021_us_county_500k.zip')
county_shp = county_shp[county_shp.STATE_NAME == 'Kentucky'] ky_counties
Geopandas also provides built-in integrations with both matplotlib (for static maps) and folium (for interactive maps). Folium is a python wrapper for leaflet, which produces interactive maps. Producing the initial map layer is straightforward using the built in explore method.
ky_counties.explore()
Getting Census Data
I’m going to use internet access as our example data. The American Community Survey collects information about internet access in table S2801. However, downloading data from data.census.gov is a formatting nightmare, because there’s no clear geography column to match on and the data seems to only be available in an oddly nested structure. Instead, I’m going to use the Census API.
I used data.census.gov to figure out what tables contained information on internet access. Then I found the variables table for ACS 5 year data and searched for table S2801. It wasn’t there, but B28003 was and contains the same information in raw counts without the summary percentages. It’ll be easy enough to recompute the percentages.
I want to know what percent of households have internet access, not including cell phone only internet access or dial-up. Variable B28002_007E tells me how many households have an internet subscription that isn’t dial up or cellular. Variable B28002_001E gives me the total population.
There is a census
python package, but it doesn’t make things that much simpler than it is to just use the API directly using the more popular and generic python library requests
. You need your own census API key, which is available here. I saved mine in a text file to avoid putting it into the blog directly.
Putting together the request to the api involves constructing a long url string out of multiple pieces. An advantage of pasting together the pieces from separate variables is that it makes it easier to modify this example to suit your use case.
#load API key
with open('census_api_key.txt') as key:
=key.read().strip()
api_key
#specify the data source by year and survey
= '2021'
year = 'acs' #American Community Survey
dsource = 'acs5' #5 year average from American Community Survey
dname = f'https://api.census.gov/data/{year}/{dsource}/{dname}'
base_url
#unique to this specific data request
= 'NAME,B28002_001E,B28002_007E' #NAME of geography as well as the variables I want to pull
cols = 'county' #county geography level
geo = 'state:21' #21 is the FIPS code for Kentucky
state
#add unique request features to the base_url
= f'{base_url}?get={cols}&for={geo}&in={state}&key={api_key}'
data_url
#go get the data
= requests.get(data_url)
response
#take the data in json and format it into a dataframe
= response.json()
data = (pd.DataFrame(data = data[1:], columns = data[0]) #first row is column names, everything else is data.
df = {'NAME' : 'county_name',
.rename(columns 'B28002_001E' : 'population',
'B28002_007E': 'pop_int_access',
'state' : 'state_fips',
'county' : 'county_fips'}))
#the fips for a county is a concatenation of state and county fips
'fips'] = df['state_fips'] + df['county_fips'] #make sure these are strings so it concatenates and doesn't add.
df[
#changing the data to be numeric, since everything starts as string
'population', 'pop_int_access']] = df[['population', 'pop_int_access']].apply(pd.to_numeric)
df[[
#calculating percent with internet access
'percent_int'] = np.round(100 * (df['pop_int_access']/df['population']), 1)
df[
df.head()
county_name | population | pop_int_access | state_fips | county_fips | fips | percent_int | |
---|---|---|---|---|---|---|---|
0 | Adair County, Kentucky | 6890 | 3989 | 21 | 001 | 21001 | 57.9 |
1 | Allen County, Kentucky | 7629 | 4559 | 21 | 003 | 21003 | 59.8 |
2 | Anderson County, Kentucky | 9063 | 5032 | 21 | 005 | 21005 | 55.5 |
3 | Ballard County, Kentucky | 2945 | 1610 | 21 | 007 | 21007 | 54.7 |
4 | Barren County, Kentucky | 17307 | 12309 | 21 | 009 | 21009 | 71.1 |
Mapping the Data
Now that we have our data, we need to combine it back into our map dataset. A nice thing about GeoPandas is that we can work with it the same way we work with a pandas dataframe.
We’ll still use the explore method from GeoPandas to map, and to create a chloropleth map we set the column argument to the name of the column we want to use to shade the map. We can specify what information goes in the tooltip by passing a list of column names. I’ve also set a few other options to change the map background to something a bit more muted and bring out the lines between the counties a bit more cleanly.
#construct the fips code to match with the geography data
#matching on fips is generally preferable to matching on county name because of potential formatting differences in names.
'fips'] = ky_counties['STATEFP'] + ky_counties['COUNTYFP']
ky_counties[
#merge in the new data
#clean up the names to be a bit more presentable
= (ky_counties.merge(df, how = 'left', on = 'fips')
df_map = {'NAME' : 'County Name',
.rename(columns 'percent_int' : 'Home Internet Access (%)',
'population' : 'County Population'}))
df_map.explore(= 'Home Internet Access (%)',
column = ['County Name', 'Home Internet Access (%)', 'County Population'],
tooltip = 'CartoDB positron', #fades into the background better than the default
tiles =dict(color="black") #outlines stay black for a crisper feel
style_kwds )
Tract level analysis
To show the flexibility of this approach and out of personal curiosity, let’s switch from county level data to tract level data and zoom in on Louisville, Kentucky (Jefferson County). The Census Bureau’s example API queries are extremely useful once you’ve mastered the basics. Here we have to change the geography request from county to tract, and then add a filter to restrict it to Jefferson County, Kentucky.
The tract numbers aren’t easily interpretable, which is one place where the interactive background map can be very helpful. You can zoom in and see the default neighborhood names in the background map, as well as major roads and interstates that allow people familiar with Louisville to understand where each census tract is located.
The code here is a little repetitive with the code for Kentucky counties, and if we were doing this frequently we’d want to factor out the common parts into functions (i.e. a function that did the routine data processing post-API call and another function that set all the default styles for the map).
## First we need a tract level map, downloaded from the same census collection
# if downloading from census, read in as a zip file: https://geopandas.org/en/stable/docs/user_guide/io.html
= gpd.read_file('cb_2021_21_tract_500k.zip')
ky_shp = ky_shp[ky_shp['COUNTYFP'] == '111']
lou_shp
## Now pulling Census data
#we can use the same base_url, but the data_url needs to be updated
#the variables we want to pull stay the same
= 'NAME,B28002_001E,B28002_007E' #NAME of geography as well as the variables I want to pull
cols = 'tract' #county geography level
geo = 'state:21' #21 is the FIPS code for Kentucky
state = 'county:111' #111 is the FIPs code for Jefferson County, KY
county
#add unique request features to the base_url
= f'{base_url}?get={cols}&for={geo}&in={state}&in={county}&key={api_key}'
data_url
#go get the data
= requests.get(data_url)
response
#clean up the data into a dataframe
= response.json()
data
= (pd.DataFrame(data = data[1:], columns = data[0]) #first row is column names, everything else is data.
df = {'NAME' : 'tract_name',
.rename(columns 'B28002_001E' : 'population',
'B28002_007E': 'pop_int_access',
'state' : 'state_fips',
'county' : 'county_fips',
'tract' : 'tract_fips'}))
#the fips for a tract is a concatenation of state, county, and tract fips
'fips'] = df['state_fips'] + df['county_fips'] + df['tract_fips'] #make sure these are strings so it concatenates and doesn't add.
df[
#changing the data to be numeric, since everything starts as string
'population', 'pop_int_access']] = df[['population', 'pop_int_access']].apply(pd.to_numeric)
df[[
#calculating percent with internet access
'percent_int'] = np.round(100 * (df['pop_int_access']/df['population']), 1)
df[
#tract fips is in our shape file already as GEOID
= (lou_shp.merge(df, how = 'left', left_on = 'GEOID', right_on = 'fips')
lou_map = {'tract_name' : 'Tract Number',
.rename(columns 'percent_int' : 'Home Internet Access (%)',
'population' : 'Tract Population'}))
#tracts are pretty small in the map, so I've opted against the black boundary outlines
lou_map.explore(= 'Home Internet Access (%)',
column = ['Tract Number', 'Home Internet Access (%)', 'Tract Population'],
tooltip = 'CartoDB positron' #fades into the background better than the default
tiles )
More on Internet Access
When schools were moving to remote learning during Covid, I made a much longer post on how to use Census microdata to analyze at an individual level how internet access relates not only to geography, but also to race, poverty, and age. Internet access remains an important consideration. There seems to me to be an assumption that internet access is widespread in the United States. While people may have some access through libraries and cell phone data, there is a large gap in terms of who has reliable home internet access that can be used for work and homework.