The Panel Study of Income Dynamics (PSID) is the longest running longitudinal household survey in the world, which provides invaluable data covering numerous topics including marriage, income, wealth, health and etc. However, the process of converting raw PSID data files into datasets ready for analysis is quite complex and challenging, especially for new users.
This package is developed with the purpose of addressing these challenges within only R environment without additional assistance from other statistical programming softwares. By bridging these gaps, the package aims to make PSID datasets more usable and manageable for researchers and analysts.
The package is now on GitHub. To install the package, try this code in your R Console:
devtools::install_github(repo = "Qcrates/psidread")
library(DiagrammeR)
library(psidread)
File
Structure of PSID: The main PSID data files include two
types of data files: 1. single-year family files and 2. a cross-year
individual file. The single-year family files contain data collected in
each wave from 1968 through 2021, with one record for each family
interviewed in the specified year. These files include family-level
variables and are identified by the family Interview Number for that
year. The cross-year individual file, on the other hand, contains all
individual-level variables collected from 1968 to 2021 in one single
file. This file includes data for both respondents and non-respondents,
identified by the 1968 family Interview Number and Person Number
(ER30001
and ER30002
). Therefore, the
procedure of merging datasets from multiple waves is necessary before
conducting any further analysis, if family-level variables are
involved.
Data Downloading Approach: PSID’s website offers two primary method to download the data: 1. packaged files and 2. customized shopping cart with only selected variables. These two methods have both pros and cons:
Packaged Files | Customized File | |
---|---|---|
Pros |
|
|
Cons |
|
|
Variable Name: A significant challenge when
analyzing PSID data is its variable name which is not so intuitive or
interpretable (e.g. ER00000
, V000
). It can be
a heavy workload for researchers working with multiple waves of data to
rename these variables manually.
Missing Waves: In the PSID, survey questions vary across different waves, leading to some variables not being consistently available in all waves. Detailed information about the inclusion of specific questions in each wave is accessible only on the cross-year index webpage, a method that is not user-friendly for quick reference. Manually creating a list of variables for different years is an option, but it is tedious and inconvenient.
What psidread
package is created to help:
Create a table of data structure across multiple waves using the text that can be copied and pasted from the website
Unzip and convert the zipped files without additional help of other software
Read and merge the data files from multiple waves
Rename and reshape the dataset to fit the need for advanced analysis
psidread
PackageWhile users have the option to directly jump to a specific step in the process, I strongly advise following the procedure sequentially without skipping any steps. This approach ensures a replicable code for importing the PSID dataset. Additionally, skipping steps might lead to the failure of the complications, particularly if certain prerequisites for the code’s operation are not met.
psid_str()
: Build Your Table of StructureThis step is required no matter which type of your dataset is, because it generates an important object which will be used in the following steps: the table of data structure. The basic format of this output will be like below:
Year | hh_educ |
num_child |
---|---|---|
2013 | ER58223 |
ER53020 |
2015 | ER65459 |
ER60021 |
2017 | ER71538 |
ER66021 |
2019 | ER77599 |
ER72021 |
You can generate this table with the below code:
psid_str(
varlist = c("hh_educ || [13]ER58223 [15]ER65459 [17]ER71538 [19]ER77599",
"num_child || [13]ER53020 [15]ER60021 [17]ER66021 [19]ER72021"),
type = "separated"
)
## year hh_educ num_child
## 2 2013 ER58223 ER53020
## 3 2015 ER65459 ER60021
## 4 2017 ER71538 ER66021
## 5 2019 ER77599 ER72021
It’s easy to find the [YY]VARCODE
text from PSID
website. Any codebook includes a full list of the variables in the
“Years Available” part. All you need to do is to define the variable
name that you want to use in your analysis code
(e.g. num_child
).
It is recommended to format the input of variable list like the above
code following this syntax:
c("varname1 || [YY]VARCODE [YY]VARCODE [YY]VARCODE","varname2 || [YY]VARCODE [YY]VARCODE [YY]VARCODE")
.
It is not necessary for the user to pay specific attention to the space
in the text, but two mandatory requirements need to be satisfied:
"||"
Please leave the type value as its default value
"separated"
.
This way of input is inspired by psidtools
package
developed by Professor Ulrich Kohler in Stata. Therefore, this function
also offers an option for users who would like to transfer their work
from Stata to R. You can directly copy and paste your Stata code after
psid use
without making any changes. The only effort
required here is to set the type
argument to
"integrated"
.
For example:
psid_varlist <- "|| religion_hh /// Household head's religious preference
[97]ER11895 [99]ER15977 [01]ER20038 [03]ER23474 [05]ER27442 [07]ER40614 ///
|| denom_hh /// Household head's religious denominations
[97]ER11896 [99]ER15978 [03]ER23475 [05]ER27443 [07]ER40615 ///"
psid_str(
varlist = psid_varlist,
type = "integrated"
)
## year religion_hh denom_hh
## 2 1997 ER11895 ER11896
## 3 1999 ER15977 ER15978
## 4 2001 ER20038 <NA>
## 5 2003 ER23474 ER23475
## 6 2005 ER27442 ER27443
## 7 2007 ER40614 ER40615
Please note that it is the user’s responsibility to make sure that
the year and variable code is correct. Do not include any ALL-YEAR
variables (e.g. individual’s sex, individual’s birth order) in this
function. It will be declared in the idvars
argument in
psid_read()
.
psid_unzip()
: Prepare Data FilesThis function helps to unzip the data files downloaded from PSID
website and convert them to .rda
files, a data format that
is easier to manage in R.
Please put your packaged files in .zip
format in one
directory. Here I set the input and output directory to be the same. You
can set the exdir
to other directory so that the output
.rda
files will be exported there separately from the
directory you put the original downloaded data files.
Please note that in the below example we use
system.file...
and tempdir()
just because we
would like to use the data file in the package file folder. In practice,
it should be your directory pathway in the format like
"your/directory/pathway/psid/file/folder"
input_directory <- system.file(package = "psidread","extdata")
output_directory <- tempdir()
psid_unzip(indir = input_directory,
exdir = output_directory,
zipped = TRUE,
type = "package",
filename = NA)
If you have already unzipped ALL the
.zip
data files. You can also skip the procedure by setting
the zipped
argument to be FALSE
:
psid_unzip(indir = input_directory,
exdir = output_directory,
zipped = FALSE,
type = "package",
filename = NA)
It takes some time to unzip and convert all the packaged files if
your analysis involves numerous waves of data. Therefore, once this
function is executed and all the .rda
files needed to
generate your dataset is ready, you do not have to run it every time
before you run the psid_read()
and
psid_reshape()
function.
If you download the dataset from your shopping cart with selected
variables, you can also use this function to unzip and convert the
files. One thing to note is that you should choose the
ASCII Data With SAS Statements
when downloading. Compared
to packaged files, you will need to
filename
argumenttype
argument to "single"
For example:
psid_unzip(indir = input_directory,
exdir = output_directory,
zipped = TRUE,
type = "single",
filename = "J327825.zip")
The user can also use psid_unzip()
in this way to unzip
and convert specific packaged data files, especially when they are
adding one wave to their dataset but do not want to go through the whole
directory again.
psid_read()
: Read DataPlease make sure you have had the below checklist done before you run
the psid_read()
function:
Run the psid_str()
function
and get the table of data structure stored in the global
environment.
Run the psid_unzip()
function and have all the data files prepared in .rda
format.
Have the cross-year individual packaged
file (if you are packaged file user) downloaded, or have at least one
individual-level variable downloaded in your customized dataset (if you
are the customized file user). Even if you do not use individual-level
variables, please do this. This package will collapse your dataset to
household-level if you need in psid_reshape()
.
All the item above checked? Let’s move on to this core step!
The advantage of this package is outstanding especially for data processing over multiple packaged dataset. One example:
psid_varlist = c(" hh_age || [13]ER53017 [17]ER66017", " p_age || [13]ER34204")
str_df <- psid_str(varlist = psid_varlist, type = "separated")
input_directory <- system.file(package = "psidread","extdata")
psid_df <- psid_read(indir = input_directory, str_df = str_df,idvars = c("ER30000"),type = "package",filename = NA)
## Data for year 2013 has been added!
## Data for year 2017 has been added!
str(psid_df)
## 'data.frame': 5 obs. of 11 variables:
## $ ER34201: num 8684 8684 7300 8569 8698
## $ ER34501: num 5620 8559 6510 8691 7682
## $ ER34202: num 1 2 2 1 3
## $ ER34502: num 1 1 1 2 2
## $ ER34203: num 10 40 20 10 30
## $ ER34503: num 10 10 10 20 30
## $ ER30000: num 1 1 1 1 1
## $ pid : num 4006 4007 4031 4038 4049
## $ ER34204: num 55 53 39 25 4
## $ ER53017: num 55 55 41 25 32
## $ ER66017: num 59 57 43 34 27
## - attr(*, "problems")=<externalptr>
If you are reading all the variables from one single file, the only different things you need to change here are:
type
argument to "single"
psid_df <- psid_read(indir = input_directory, str_df = str_df,idvars = c("ER30000"),type = "single",filename = "J327825")
str(psid_df)
## 'data.frame': 5 obs. of 11 variables:
## $ ER53017: num 55 55 41 25 32
## $ ER66017: num 59 57 43 34 27
## $ ER34204: num 55 53 39 25 4
## $ ER34202: num 1 2 2 1 3
## $ ER34502: num 1 1 1 2 2
## $ ER34203: num 10 40 20 10 30
## $ ER34503: num 10 10 10 20 30
## $ ER34201: num 8684 8684 7300 8569 8698
## $ ER34501: num 5620 8559 6510 8691 7682
## $ ER30000: num 1 1 1 1 1
## $ pid : num 4006 4007 4031 4038 4049
## - attr(*, "problems")=<externalptr>
Please note that the indir
argument in this function
should be the directory where you store the .rda
files.
Therefore, it should be the exdir
in
psid_unzip()
if you use this function to prepare data.
The user may notice that some additional variables, which are not declared in your table of data structure, are also added to the data frame:
pid
: The individual-level identification key, equals to
ER30001 * 1000 + ER30002
They are survey information variables. Please do not drop them before
you run the psid_reshape()
. I will strongly recommend you
to keep them in the final output because they can be very useful in the
analysis.
psid_reshape()
: Format DataWe finally come to the last step! psid_reshape()
function will rename and reshape the data to the final output ready for
your next-step analysis!
All the variables will be renamed following your self-defined
variable name in psid_str()
. You can also reshape the
dataset to a long version if you want to further process the data of
multiple waves together. For example:
df <- psid_reshape(psid_df = psid_df, str_df = str_df, shape = "long", level = "individual")
df
## # A tibble: 10 × 8
## ER30000 pid year hh_age p_age xsqnr rel2hh indfid
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 4006 2013 55 55 1 10 8684
## 2 1 4006 2017 59 NA 1 10 5620
## 3 1 4007 2013 55 53 2 40 8684
## 4 1 4007 2017 57 NA 1 10 8559
## 5 1 4031 2013 41 39 2 20 7300
## 6 1 4031 2017 43 NA 1 10 6510
## 7 1 4038 2013 25 25 1 10 8569
## 8 1 4038 2017 34 NA 2 20 8691
## 9 1 4049 2013 32 4 3 30 8698
## 10 1 4049 2017 27 NA 2 30 7682
If you would like to keep the wide shape of the data frame. The
variable name will be varname_YYYY
. For example,
df <- psid_reshape(psid_df = psid_df, str_df = str_df, shape = "wide", level = "individual")
df
## hh_age_2013 hh_age_2017 p_age_2013 xsqnr_2013 xsqnr_2017 rel2hh_2013
## 1 55 59 55 1 1 10
## 2 55 57 53 2 1 40
## 3 41 43 39 2 1 20
## 4 25 34 25 1 2 10
## 5 32 27 4 3 2 30
## rel2hh_2017 indfid_2013 indfid_2017 ER30000 pid
## 1 10 8684 5620 1 4006
## 2 10 8684 8559 1 4007
## 3 10 7300 6510 1 4031
## 4 20 8569 8691 1 4038
## 5 30 8698 7682 1 4049
You can also collapse the data to household level in this step. Only one record will be kept here for each household at each wave:
df <- psid_reshape(psid_df = psid_df, str_df = str_df, shape = "long", level = "household")
df
## # A tibble: 5 × 8
## ER30000 pid year hh_age p_age xsqnr rel2hh indfid
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 4006 2013 55 55 1 10 8684
## 2 1 4006 2017 59 NA 1 10 5620
## 3 1 4007 2017 57 NA 1 10 8559
## 4 1 4031 2017 43 NA 1 10 6510
## 5 1 4038 2013 25 25 1 10 8569
Feel free to reshape the data based on your own needs!