Three-digit ZIP Codes appear frequently in real world health care data. Since patient registration and medical billing rely on patient addresses, they are common data elements in EHR and medical claims information systems. Providing the first three digits of a ZIP Code is a common data strategy vendors seek to provide geographic data while protecting patient privacy. Unfortunately, ZIP Codes are difficult to work with, and utilizing three-digit versions offers additional challenges.
Three-digit ZIP Codes refer to a group of ZIP Codes that share the same first three digits. For example, the St. Louis, Missouri ZIP Codes 63101, 63102, and 63103 would all be part of the 631 three-digit ZIP Code. These first three digits correspond to “sectional center facilities” (SCFs) operated by the United States Postal Service (USPS). Sectional center facilities sit between larger “network distribution centers” (NDCs) and local post offices, sorting and distributing mail. Each SCF has one more three-digit ZIP Codes associated with it. The SCF for St. Louis is in St. Louis City, Missouri, and it services approximately a dozen three-digit ZIP Codes in Eastern Missouri and Southern Illinois.
Unlike five-digit ZIP Codes, which have the Census Bureau analogue of ZIP Code Tabulation Areas (ZCTAs), there is no Census equivalent for three-digit ZIP Codes. This is because three-digit ZIP Codes are not geographic areas, but rather mail sorting facilities. Aggregating ZCTAs using their first three digits illustrate yet another challenge - the boundaries of three-digit ZCTAs are not contiguous. This means that some ZCTAs are split into multiple pieces that are not adjacent to each other.
When the first three-digits are the only three digits given, it is
not possible to use the ZIP to ZCTA crosswalk files included in
zippeR
. This increases the misclassification rate, because
some of the observations will be assigned to the wrong three-digit ZCTA.
For example, the ZIP Code 64999 in Kansas City is part of the 649
three-digit ZIP Code, but it is not part of the 649 three-digit ZCTA.
According to the 2022 UDS crosswalk file, the appropriate ZCTA for 64999
is 64108, which has the 641 three-digit ZIP Code.
zippeR
provides several functions for downloading and
using three-digit ZCTA data. They should be used with caution and the
user should be aware of the limitations of the data described above.
The zi_load_labels()
function can be used to load a set
of labels for three-digit ZIP Codes. The function requires a
type
argument, which should be set to "zip3"
.
The function will return a tibble with the area and state associated
with the SCF assigned to a particular three-digit ZIP.
> zi_load_labels(source = "USPS", type = "zip3", vintage = 202408)
# A tibble: 931 × 3
zip3 label_area label_state
<chr> <chr> <chr>
1 005 MID-ISLAND NY
2 006 SAN JUAN PR
3 007 SAN JUAN PR
4 008 SAN JUAN PR
5 009 SAN JUAN PR
6 010 HARTFORD CT
7 011 HARTFORD CT
8 012 HARTFORD CT
9 013 CENTRAL MA
10 014 CENTRAL MA
# ℹ 921 more rows
# ℹ Use `print(n = ...)` to see more rows
Use these values with caution - the area and state may not correspond
to the physical location of associated five-digit ZIP Codes. For
example, the three-digit ZIP 010
covers Western
Massachusetts. However, the SCF that serves it is located in Hartford,
CT. The label_area
and label_state
values are
based on the SCF location, not the geographic area served by the
three-digit ZIP Code.
The zi_label()
function can be used to label your data
with these values. If you have five-digit ZIP Codes and you want to
convert them to three-digit ZIPs, the zi_convert()
function
is a helpful tool for shortening those values quickly.
Three-digit ZCTA geometric data can be downloaded using
zi_get_geometry()
. The following syntax downloads all ZCTA3
for the United States, excluding overseas territories:
Optionally, you can specify a specific state, county, or territory to limit your data object’s extent:
mo_zcta3 <- zi_get_geometry(year = 2020, style = "zcta3", state = "MO", territory = NULL, method = "intersect")
The zi_get_geometry()
function downloads pre-made
geometric data from the Census Bureau’s TIGER/Line Shapefiles, which
were created by downloading the ZCTA data, grouping features by the
first three digits of the ZCTA, and then summarizing the features to
dissolve them. Finally,
sf::st_simplify(out, preserveTopology = TRUE, dTolerance = 20)
was used to simplify the features and reduce the size of each file.
Data are available from 2010 through 2023, excluding 2011. If a
specific state or county is requested using those optional arguments,
included ZCTAs are defined using either
method = "intersect"
or method = "centroid"
.
The "intersect"
approach includes any ZCTA that touches a
given state or county with an area greater than 0
, while
the "centroid"
approach includes any ZCTA whose geographic
midpoint lies within the requested state or county.
Creating a master list of three-digit ZCTAs is a pre-requisite for
creating demographic estimates for these geographies. The object we
created above, mo_zcta3
, has a ZCTA3
column
that can serve as that reference. Once you have your list, you should
download demographic data using zi_get_demographics()
. For
example, to download population estimates for 2020, you would use the
following code:
Be sure not to limit your download with the zcta
argument. It is important that all ZCTAs are included in the download,
even if they are not in the list of three-digit ZCTAs. If only the
five-digit ZCTAs that overlap with your state or county of interest are
included, you will get incorrect values for ZCTAs that are split across
multiple jurisdictions.
Once these are obtained, we can pass the object to
zi_aggregate()
and can specify an input for
zcta
at this stage:
mo_pop20 <- zi_aggregate(mo_pop20, year = 2020, extensive = "B01003_001", survey = "acs5", zcta = mo_zcta3$ZCTA3)
This will aggregate the population estimates for the five-digit ZCTAs to the three-digit ZCTAs.
The zi_aggregate()
function requires that you specify
two sets of variable lists - those that are extensive
(i.e. count data) and those that are intensive
(i.e. ratio
or median data). For extensive
data,
zi_aggregate()
sums the estimates and applies a formula to
the margins of error (the square root of the sum of squared margins of
error for each five-digit ZCTA within a three-digit region). For
intensive
variables, a weighted mean or median is used for
both the estimate and the margin of error. Note that you can pipe this
workflow and can specify multiple variables at once for aggregation:
zi_get_demographics(year = 2020, variables = c("B01003_001", "B19083_001"), survey = "acs5") %>%
zi_aggregate(year = 2020, extensive = "B01003_001", intensive = "B19083_001", survey = "acs5") -> demo20
The variables
, table
(which can be used in
place of variables
for zi_get_demographics()
),
extensive
, and intensive
arguments are not
validated before being passed via tidycensus
to the Census
Bureau, so incorrectly formatted variable or table names will generate
potentially cryptic errors.
Three-digit ZIP codes are common, especially in health care data, but
are challenging to work with. While zippeR
provides a set
of tools for calculating demographic estimates from the American
Community Survey and mapping them, this should be done with caution
based on the limitations described above.