Step 1. Generate a sequence cohort

Introduction

In this vignette we will explore the functionalities of generateSequenceCohort().

Create a cdm object

CohortSymmetry package is designed to work with data mapped to OMOP, so the first step is to create a reference to the data using the CDMConnector package. We will use the Eunomia dataset for the subsequent examples.

library(CDMConnector)
library(dplyr)
library(DBI)
library(CohortSymmetry)
library(duckdb)

db <- DBI::dbConnect(duckdb::duckdb(), 
                     dbdir = CDMConnector::eunomia_dir())
cdm <- cdm_from_con(
  con = db,
  cdm_schema = "main",
  write_schema = "main"
)

Instantiate two cohorts in the cdm reference

CohortSymmetry package requires that the cdm object contains two cohort tables: the index cohort and the marker cohort. There are a lot of different ways to create these cohorts, and it will depend on what the index cohort and marker cohort represent. Here, we use the DrugUtilisation package to generate two drug cohorts in the cdm object. For illustrative purposes, we will carry out SSA on aspirin (index_cohort) against acetaminophen (marker_cohort).

library(DrugUtilisation)
cdm <- DrugUtilisation::generateIngredientCohortSet(
  cdm = cdm,
  name = "aspirin",
  ingredient = "aspirin")

cdm <- DrugUtilisation::generateIngredientCohortSet(
  cdm = cdm,
  name = "acetaminophen",
  ingredient = "acetaminophen")

cdm$aspirin |> 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id           <int> 148, 245, 263, 449, 895, 1046, 1232, 1338, 1376, …
#> $ cohort_start_date    <date> 1980-09-04, 1917-03-15, 1958-03-31, 1976-03-03, …
#> $ cohort_end_date      <date> 1980-09-18, 1917-03-29, 1958-04-14, 1976-03-17, …

cdm$acetaminophen |> 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id           <int> 35, 64, 165, 224, 235, 326, 408, 693, 715, 719, 7…
#> $ cohort_start_date    <date> 1960-06-20, 1985-02-28, 1974-02-11, 1992-02-19, …
#> $ cohort_end_date      <date> 1960-07-04, 1985-03-14, 1974-02-25, 1992-03-11, …

Generate a sequence cohort

In order to initiate the calculations, the two cohorts tables need to be intersected using generateSequenceCohortSet(). This process will output all the individuals who appear on both tables subject to different parameters. Each parameter corresponds to a specific requirement. The parameters for this function include cohortDateRange, daysPriorObservation, washoutWindow, indexMarkerGap and combinationWindow. Let’s go through examples to see how each parameter works.

No specific requirements

Let’s study the simplest case where no requirements are imposed. See figure below to see an example of an analysis containing six different participants.

See that only the first event/episode (for both the index and the marker) is included in the analysis. As there is no restriction criteria and all the individuals have an episode in the index and the marker cohort, all the subjects are included in the analysis. We can get a sequence cohort without including any particular requirement like so:

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c(NA, NA)), #default
  daysPriorObservation = 0, #default
  washoutWindow = 0, #default
  indexMarkerGap = NULL, #default
  combinationWindow = c(0,Inf))

cdm$intersect |> 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 6
#> Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id           <int> 6, 16, 42, 35, 40, 53, 49, 11, 32, 43, 12, 17, 63…
#> $ cohort_start_date    <date> 1965-06-23, 1972-04-10, 1914-07-09, 1960-06-20, …
#> $ cohort_end_date      <date> 1969-12-20, 1974-06-11, 1937-09-07, 1993-04-28, …
#> $ index_date           <date> 1965-06-23, 1972-04-10, 1914-07-09, 1993-04-28, …
#> $ marker_date          <date> 1969-12-20, 1974-06-11, 1937-09-07, 1960-06-20, …

Important Observations

See that the generated table has the format of an OMOP CDM cohort, but it also includes two additional columns: the index_date and the marker_date, which are the cohort_start_date of the index and marker episode respectively. The cohort_start_date and the cohort_end_date are defined as:

  • cohort_start_date: earliest cohort_start_date between the index and the marker events.
  • cohort_end_date: latest cohort_start_date between the index and the marker events.

The cohort_definition_id in the output is associated with the cohort_definition_id} of the index table (indexId) and the cohort_definition_id of the marker table (markerId). To see the correspondence, one could do the following:

attr(cdm$intersect, "cohort_set")
#> # Source:   table<main.intersect_set> [1 x 10]
#> # Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#>   cohort_definition_id cohort_name     index_id index_name marker_id marker_name
#>                  <int> <chr>              <int> <chr>          <int> <chr>      
#> 1                    1 index_1191_asp…        1 1191_aspi…         1 161_acetam…
#> # ℹ 4 more variables: days_prior_observation <dbl>, washout_window <dbl>,
#> #   index_marker_gap <chr>, combination_window <chr>

The user may also wish to subset the index table and marker table based on their cohort_definition_id using indexId and markerId respectively. For example, the following code only includes cohort_definidtion_id \(= 1\) from both the index and the marker table.

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c(NA, NA)),
  indexId = 1,
  markerId = 1,
  daysPriorObservation = 0,
  washoutWindow = 0,
  indexMarkerGap = NULL,
  combinationWindow = c(0,Inf))

Specified study period

We can restrict the study period of the analysis to only include episodes or events happening during a specific period of time. See figure below to see an example of an analysis containing six different participants.

Notice that, by imposing a restriction on study period, some of the participants might be excluded. For example, participant 4 is excluded because the only index episode is outside of the study period whereas participant 6 is included because he/she does have an index episode within the study period.

The study period can be restricted using the cohortDateRange argument, which is defined as:

cohortDateRange = c(start_of_the_study_period, end_of_the_study_period)

See an example of the usage below, where we have restricted the cohortDateRange within 01/01/1950 until 01/01/1969. Consequently, the cohort range falls into the pre-specified period:

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c("1950-01-01","1969-01-01")),
  combinationWindow = c(0,Inf))

cdm$intersect |>  
  dplyr::summarise(min_cohort_start_date = min(cohort_start_date), 
            max_cohort_start_date = max(cohort_start_date),
            min_cohort_end_date   = min(cohort_end_date),
            max_cohort_end_date   = max(cohort_end_date)) |> 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#> $ min_cohort_start_date <date> 1950-01-02
#> $ max_cohort_start_date <date> 1968-09-08
#> $ min_cohort_end_date   <date> 1950-07-19
#> $ max_cohort_end_date   <date> 1969-01-01

Specified study period and prior history requirement

We can also specify the minimum prior history that an individual has to have before the start of the first event. Individuals with not enough prior history will be excluded. See the figure below, imagine the prior observation history is set to be 31 days, then participant 5 would be excluded because the first event happening within the study period does not have more than (or equal to) 31 days of prior history:

The number of days of prior history required can be implemented using the argument daysPriorObservation. See an example below, where we focus on two different individuals: participant 2 and 53. Without a prior history requirement, both individuals are included in the analysis:

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c("1950-01-01","1969-01-01")),
  daysPriorObservation = 0,
  combinationWindow = c(0,Inf))

cdm$intersect |> 
  dplyr::inner_join(
    cdm$observation_period |> 
      dplyr::select("subject_id" = "person_id", "observation_period_start_date")
  ) |> 
  dplyr::filter(subject_id %in% c(2,53)) |> 
  dplyr::mutate(daysPriorObservation = cohort_start_date - observation_period_start_date) |> 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 8
#> Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#> $ cohort_definition_id          <int> 1, 1
#> $ subject_id                    <int> 53, 2
#> $ cohort_start_date             <date> 1962-12-13, 1952-07-13
#> $ cohort_end_date               <date> 1965-05-01, 1955-10-22
#> $ index_date                    <date> 1965-05-01, 1952-07-13
#> $ marker_date                   <date> 1962-12-13, 1955-10-22
#> $ observation_period_start_date <date> 1962-08-15, 1920-06-01
#> $ daysPriorObservation          <dbl> 120, 11730

Now we impose a prior history requirement of 365 days. As seen, participant 53 is excluded as it does not have enough days of prior observation before the included event.

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c("1950-01-01","1980-01-01")),
  daysPriorObservation = 365,
  combinationWindow = c(0,Inf))

cdm$intersect |> 
  dplyr::inner_join(
    cdm$observation_period |> 
      dplyr::select("subject_id" = "person_id", "observation_period_start_date")
  ) |> 
  dplyr::filter(subject_id %in% c(2,53)) |>
  dplyr::glimpse()
#> Rows: ??
#> Columns: 7
#> Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#> $ cohort_definition_id          <int> 1
#> $ subject_id                    <int> 2
#> $ cohort_start_date             <date> 1952-07-13
#> $ cohort_end_date               <date> 1955-10-22
#> $ index_date                    <date> 1952-07-13
#> $ marker_date                   <date> 1955-10-22
#> $ observation_period_start_date <date> 1920-06-01

Specified study period, prior history requirement and washout period

We can also specify the minimum washout period required for an event or episode to be included. In the following figure, we exclude participant 6 as another episode took place within the washout period. Washout period is applied to index and marker respectively.

This functionality can be implemented using the washoutWindow argument. See an example of its implementation below, where we analyse the case of subject_id number 1936 and 3565.

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c("1950-01-01","1980-01-01")),
  daysPriorObservation = 365,
  washoutWindow = 0,
  combinationWindow = c(0, Inf))

cdm$aspirin |> 
  dplyr::filter(subject_id %in% c(1936,3565)) |> 
  dplyr::group_by(subject_id) |> 
  dplyr::arrange(cohort_start_date)
#> # Source:     SQL [6 x 4]
#> # Database:   DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#> # Groups:     subject_id
#> # Ordered by: cohort_start_date
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                  <int>      <int> <date>            <date>         
#> 1                    1       1936 1950-05-06        1950-05-20     
#> 2                    1       3565 1950-01-02        1950-01-30     
#> 3                    1       3565 1957-11-19        1957-12-24     
#> 4                    1       1936 1949-10-08        1949-11-05     
#> 5                    1       3565 1945-12-24        1946-01-07     
#> 6                    1       1936 1955-08-04        1955-08-25

cdm$intersect |> 
  dplyr::filter(subject_id %in% c(1936,3565)) |> 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 6
#> Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#> $ cohort_definition_id <int> 1, 1
#> $ subject_id           <int> 1936, 3565
#> $ cohort_start_date    <date> 1950-05-06, 1950-01-02
#> $ cohort_end_date      <date> 1951-09-26, 1951-08-27
#> $ index_date           <date> 1950-05-06, 1950-01-02
#> $ marker_date          <date> 1951-09-26, 1951-08-27

Notice that by setting a washout window of 0, both participants are included. However, see that the included episode of participant 1936 happens at 1950-05-06, and a previous episode (not included within the study period), happened just 210 days before this one (1949-10-08). Hence, by using a washoutWindow of 365, this participant is excluded from the analysis:

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c("1950-01-01","1969-01-01")),
  daysPriorObservation = 365,
  washoutWindow = 365,
  combinationWindow = c(0, Inf))

cdm$intersect |> 
  dplyr::filter(subject_id %in% c(1936,3565)) |>
  dplyr::arrange(subject_id, cohort_start_date) |>
  dplyr::glimpse()
#> Rows: ??
#> Columns: 6
#> Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#> Ordered by: subject_id, cohort_start_date
#> $ cohort_definition_id <int> 1
#> $ subject_id           <int> 3565
#> $ cohort_start_date    <date> 1950-01-02
#> $ cohort_end_date      <date> 1951-08-27
#> $ index_date           <date> 1950-01-02
#> $ marker_date          <date> 1951-08-27

Specified study period, prior history requirement and combination window

We define the combination window as the minimum and the maximum days between the start of the first event (either if is the index or the marker) and the start of the next event. In other words:

\(x =\) second_episode(start_date) \(-\) first_episode(start_date);

combinationWindow[1] \(< x \leq\) combinationWindow[2]

See in the figure below an example, where we define combinationWindow = c(0,20). This means that the gap between the start date of the second episode and the start of the first episode should be larger than 0 and less or equal than 20. As participant 2 and 3 do not fulfill this condition, they are excluded from the analysis.

In the generateSequenceCohortSet() function, this is implemented using the combinationWindow argument. Notice that in the previous examples, as we did not want any combination window requirement, we have set this argument to combinationWindow = c(0,Inf), as by default is combinationWindow = c(0, 365). In the following example, we explore subject_id 80 and 187 to see the functionality of this argument. When using no restriction for the combination window, both are included in the intersect cohort:

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c("1950-01-01","1969-01-01")),
  daysPriorObservation = 365,
  combinationWindow = c(0, Inf))

cdm$intersect |>
  dplyr::filter(subject_id %in% c(80,187)) |>
  dplyr::mutate(combinationWindow = pmax(index_date, marker_date) - pmin(index_date, marker_date))
#> # Source:   SQL [2 x 7]
#> # Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date index_date
#>                  <int>      <int> <date>            <date>          <date>    
#> 1                    1         80 1953-02-24        1953-12-31      1953-12-31
#> 2                    1        187 1957-04-20        1965-10-07      1957-04-20
#> # ℹ 2 more variables: marker_date <date>, combinationWindow <dbl>

However, when restricting a maximum combination window of 365 days, participant 187 is excluded from the analysis, as the difference between two initiations is greater than 365:

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c("1950-01-01","1969-01-01")),
  daysPriorObservation = 365,
  combinationWindow = c(0, Inf))

cdm$intersect |>
  dplyr::filter(subject_id %in% c(80,187))
#> # Source:   SQL [2 x 6]
#> # Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date index_date
#>                  <int>      <int> <date>            <date>          <date>    
#> 1                    1         80 1953-02-24        1953-12-31      1953-12-31
#> 2                    1        187 1957-04-20        1965-10-07      1957-04-20
#> # ℹ 1 more variable: marker_date <date>

Specified study period, prior history requirement and index gap

We define the index-marker gap to refer to the maximum number of days between the start of the second episode and the end of the first episode. That means:

\(x =\) second_episode(cohort_start_date) \(-\) first_episode(cohort_end_date);

x \(\leq\) indexMarkerGap

See an example below, where all participants with an index-marker gap higher than 30 days are excluded from the analysis (participant 2, 3 and 6):

Use indexGap argument to impose this restriction. See how this affects participant 80 and 754 in the example below:

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c("1950-01-01","1969-01-01")),
  daysPriorObservation = 365,
  indexMarkerGap = NULL)

cdm$intersect |>
  dplyr::filter(subject_id %in% c(80,754)) |>
  dplyr::inner_join(
    # As for both, acetaminophen (marker) is the first event:
    cdm$acetaminophen |> 
      dplyr::select("subject_id", 
             "marker_date" = "cohort_start_date", 
             "first_episode_end_date" = "cohort_end_date"),
    by = c("subject_id", "marker_date")
  ) |>
  dplyr::inner_join(
    cdm$aspirin |> 
      dplyr::select("subject_id", 
             "index_date" = "cohort_start_date",
             "second_episode_start_date" = "cohort_start_date"),
    by = c("subject_id", "index_date")
  ) |>
  dplyr::mutate(indexMarkerGap = second_episode_start_date - first_episode_end_date)
#> # Source:   SQL [2 x 9]
#> # Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date index_date
#>                  <int>      <int> <date>            <date>          <date>    
#> 1                    1        754 1950-08-12        1950-09-20      1950-09-20
#> 2                    1         80 1953-02-24        1953-12-31      1953-12-31
#> # ℹ 4 more variables: marker_date <date>, first_episode_end_date <date>,
#> #   second_episode_start_date <date>, indexMarkerGap <dbl>

By using a indexMarkerGap = 30, participant 80 is excluded from the analysis, as is index-marker gap is larger than 30:

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "aspirin",
  markerTable = "acetaminophen",
  name = "intersect",
  cohortDateRange = as.Date(c("1950-01-01","1969-01-01")),
  daysPriorObservation = 365,
  indexMarkerGap = 30)

cdm$intersect |>
  dplyr::filter(subject_id %in% c(80,754)) 
#> # Source:   SQL [1 x 6]
#> # Database: DuckDB v0.10.1 [xihangc@Windows 10 x64:R 4.3.1/C:\Users\xihangc\AppData\Local\Temp\RtmpE3SweR\file5a902245fbd.duckdb]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date index_date
#>                  <int>      <int> <date>            <date>          <date>    
#> 1                    1        754 1950-08-12        1950-09-20      1950-09-20
#> # ℹ 1 more variable: marker_date <date>
CDMConnector::cdmDisconnect(cdm = cdm)