The smoothy package is an R package that provides a collection of functions to smooth the drug exposition using electronic health records. Also, the package helps to prepare the electronic drug records to model the drug exposition and perform longitudinal treatment pattern analysis. This vignette serves as a guide to understanding the main functionalities of the package and provides examples of how to use the functions.
You can install the smoothy
package from CRAN using the
following command:
The smoothy
package includes the following main
functions:
smooth_algorithm()
: This function performs the smooth
algorithm as descrubibed in Ouchi et.al.[ref]smooth_parse()
: This function transforms the data to
obtain the daily treatment.smooth_deparse()
: This function returns to the original
format but collapsing all the drugs exposed on the same day.smooth_diff()
: This function computes the differences
between the original treatment and the smoothed treatment.The algorithm described in the published paper [ref] is based on moving averages commonly used in time series analysis. Its purpose is to facilitate modeling of drug exposure using electronic drug records. The algorithm calculates the average or most probable treatment exposure for each day of the follow-up period.
The package contains all the necessary functions to execute the algorithm, which can be divided into three main steps:
smooth_parse()
functionsmooth_algorithm()
functionsmooth_deparse()
functionAdditionally, the package include one function to validate and better assess the impact of the algorithm on the original records:
smooth_diff()
: Calculates the differences between the
original data and the smoothed dataThe raw data (input data) should be structured as a data frame with columns representing unique identifiers for each drug administration event (id), the start date of drug administration (start_date), the end date of drug administration (end_date), and the name of the drug administered (drug).
The smoothy package comes with an example dataset called drugstreatment. This dataset contains information of 387 patients and the prescribed drugs. It has the following variables:
You can load the example dataset using the following command:
id | start_date | end_date | drug |
---|---|---|---|
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 | 1971-01-02 | 1971-01-04 | C |
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 | 1971-01-05 | 1971-03-09 | C |
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 | 1971-03-10 | 1971-04-07 | C |
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 | 1971-04-08 | 1971-04-28 | C |
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 | 1971-04-29 | 1971-06-05 | C |
867e99f2-6cc1-492c-a1c0-e36f48c7aa95 | 1971-06-06 | 1971-08-24 | C |
For this example, we are going to use a single patient:
We will expand the original drug records to represent daily active treatment during the defined study period.
structured_df <- smooth_parse(
id = my_data$id,
start_date = my_data$start_date,
end_date = my_data$end_date,
drug = my_data$drug,
study_from = "1970-01-01",
study_to = "1975-01-01"
)
The smooth_parse()
function will return a new data frame
with four columns:
id | day | N | treatment |
---|---|---|---|
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1970-01-01 | 0 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1970-01-02 | 0 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1970-01-03 | 0 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1970-01-04 | 0 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1970-01-05 | 0 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1970-01-06 | 0 | None |
Next, we will apply the smooth algorithm to the parsed data.
smoothed <- smooth_algorithm(
id = structured_df$id,
treatment = structured_df$treatment,
day = structured_df$day,
N = structured_df$N,
width = 61
)
This function performs the algorithm on each patient and it returns the same data frame but with a new column smoothed_treatment:
id | treatment | day | N | smoothed_treatment |
---|---|---|---|---|
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | None | 1970-01-01 | 0 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | None | 1970-01-02 | 0 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | None | 1970-01-03 | 0 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | None | 1970-01-04 | 0 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | None | 1970-01-05 | 0 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | None | 1970-01-06 | 0 | None |
To perform analysis on the treatment pattern, is better to collapse the treatments according to the exposition periods (end date and start date).
The smooth_parse()
function use the smoothed data frame
as input and returns a new one very similar to the original dataset that
we are more commonly using in analysis. It has four columns:
id | start_date | end_date | treatment |
---|---|---|---|
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1970-01-01 | 1972-03-01 | None |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1972-03-02 | 1972-04-13 | A |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1972-04-14 | 1972-10-31 | A+C |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1972-11-01 | 1973-06-21 | A+B+C |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1973-06-22 | 1973-12-21 | A+B |
62eb1ebd-1b49-4ba7-9af9-ec0c4ada0956 | 1973-12-22 | 1975-01-01 | None |
Also, the function can be used on the non-smoothed treatment pattern (column treatment) which is the same as the initial data but including the untreated periods.
The validation includes assessing the differences before and after applying the smooth algorithm, measuring the number of days changed and the proportion of total days, and visualizing the longitudinal treatment pattern to see the impact of the algorithm in the overall pattern.
We will calculate the number of days changed and the proportion of total days in overall, exposition period and at each treatment.
type | days_changed | proportion_of_change |
---|---|---|
Global | 54 | 0.0295567 |
Expousure period | 53 | 0.0804249 |
drug change A | 1 | 0.0333333 |
drug change A+B | 22 | 0.1073171 |
drug change A+B+C | 0 | 0.0000000 |
drug change A+C | 0 | 0.0000000 |
drug change C | 31 | 1.0000000 |
Using the following code, in ggplot2, we can visualize the longitudinal treatment pattern of each individual before and after using the smooth algorithm.
require(ggplot2)
#> Loading required package: ggplot2
require(gridExtra)
#> Loading required package: gridExtra
tts = unique(deparsed$treatment)
tts = tts[tts!='None']
yorder = c('None',tts[order(nchar(tts), tts)])
p = ggplot(deparsed, aes(x = start_date, y = treatment)) + geom_segment(aes(x = start_date,
xend = end_date, y = treatment, yend = treatment), size = 2, alpha = 0.85,
col = 'grey20') + theme_bw() + scale_x_date(date_breaks = '6 months') +
scale_y_discrete(limits = yorder) + theme(axis.text.x = element_text(angle = 60,
hjust = 1), legend.position = "none") + ylab("") + xlab("") +
ggtitle(paste0('Original treatment'))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
tts = unique(deparsed.sm$treatment)
tts = tts[tts!='None']
yorder = c('None',tts[order(nchar(tts), tts)])
p.sm = ggplot(deparsed.sm, aes(x = start_date, y = treatment)) + geom_segment(aes(x = start_date,
xend = end_date, y = treatment, yend = treatment), size = 2, alpha = 0.85,
col = 'grey20') + theme_bw() + scale_x_date(date_breaks = '6 months') +
scale_y_discrete(limits = yorder) + theme(axis.text.x = element_text(angle = 60,
hjust = 1), legend.position = "none") + ylab("") + xlab("") +
ggtitle(paste0('Smoothed treatment'))
grid.arrange(p,p.sm,ncol=1)
The default window size is set to 61 days. This value is based on various simulation studies and real-world analyses, which have shown that 61 days effectively captures treatment patterns without omitting critical treatment periods.
However, it is essential to note that this window size may not be suitable for all patients. For cases where there are significant changes in the treatment pattern, it may be beneficial to manually review and assess whether the algorithm oversimplifies the treatment pattern.
In the following example, we have selected a patient with frequent treatment changes to illustrate the effects of different window sizes:
my_data <- dplyr::filter(drugstreatment, id == '25094328-3819-4061-941d-191c4e0bc939')
structured_df <- smooth_parse(
id = my_data$id,
start_date = my_data$start_date,
end_date = my_data$end_date,
drug = my_data$drug,
study_from = "1970-01-01",
study_to = "1975-01-01"
)
# smooth
smoothed <- smooth_algorithm(id = structured_df$id,
treatment = structured_df$treatment,
day = structured_df$day,
N = structured_df$N,
width = 31)
smoothed45 <- smooth_algorithm(id = structured_df$id,
treatment = structured_df$treatment,
day = structured_df$day,
N = structured_df$N,
width = 45)
smoothed61 <- smooth_algorithm(id = structured_df$id,
treatment = structured_df$treatment,
day = structured_df$day,
N = structured_df$N,
width = 61)
deparsed <- smooth_deparse(smoothed$id, smoothed$day, smoothed$treatment)
deparsed.sm <- smooth_deparse(smoothed$id, smoothed$day, smoothed$smoothed_treatment)
deparsed.sm45 <- smooth_deparse(smoothed45$id, smoothed45$day, smoothed45$smoothed_treatment)
deparsed.sm61 <- smooth_deparse(smoothed61$id, smoothed61$day, smoothed61$smoothed_treatment)
Executing the entire workflow in a real study can be computationally intensive. Without a workstation equipped with ample RAM and high-speed processors, applying the smooth algorithm without parallel computing may prove to be unfeasible.
To overcome this challenge, we demonstrate how to run the algorithm on the entire example dataset by dividing it into manageable chunks and parallelizing the calculations in R.
To achieve this, make sure to have the following R packages installed:
Below is a script demonstrating how to run the entire workflow in parallel in R:
In this example, the chunk size is 50, but in a big data set we recommend to use chunks of size 1000.
Parameters to pass into the foreach()
function.
Additionally, the progress bar feature provides a helpful visual
representation of the ongoing computation, allowing you to monitor the
progress and estimate the remaining time until completion
The entire workflow is encompassed within the foreach() function, executed in a loop across all the data chunks using %dopar%. During each iteration, the computed output is saved in the tempdir() location in the RDS format. This allows for parallel processing, optimizing the performance and enhancing computational efficiency.
# path to temporal directory:
tmp.path = tempdir()
diff=FALSE
l <- foreach(c = 1:chunks,
.packages = c("Kendall", "smoothy", "data.table", "anytime", "dplyr"),
.options.snow = opts,
.multicombine = F) %dopar% {
chunk.id <- unique(data$id)[inds[[c]]]
# run the workflow in each individual from the chunk:
df <- filter(data, id %in% chunk.id)
# 1) parse data:
structured_df <- smooth_parse(
id = df$id,
start_date = df$start_date,
end_date = df$end_date,
drug = df$drug,
study_from = "1970-01-01",
study_to = "1975-01-01"
)
# 2) smooth algorithm:
width <- 61
smoothed <- smooth_algorithm(
id = structured_df$id,
treatment = structured_df$treatment,
day = structured_df$day,
N = structured_df$N,
width = width
)
# 3) deparse data (original format):
deparsed_smoothed <- smooth_deparse(
smoothed$id,
smoothed$day,
smoothed$smoothed_treatment
)
# 4) Per patient changes due to smooth algorithm:
if(diff){
# Calculate differences by patient mapping with the group_map function:
df <- smoothed %>%
group_by(id) %>%
group_map(~ smooth_diff(.$treatment,.$smoothed_treatment)) %>%
bind_rows(.id = "group_id") %>%
data.frame
# Format output and filter global, exposure period:
df <- df %>%
mutate(percentage_of_change = round(proportion_of_change*100,2)) %>%
filter(type%in%c('Global','Exposure period')) %>%
mutate(type = factor(type,levels=c('Global','Exposure period'),
labels=c('total_change','exposure_change')))
# add 'id' and reshape:
df <- df %>%
left_join(data.frame(id=unique(smoothed$id),group_id = as.character(seq(1,n_distinct(smoothed$id))))) %>%
reshape2::dcast(id~type,value.var='percentage_of_change')
# attach to deparsed_smoothed dataframe:
deparsed_smoothed <- left_join(
deparsed_smoothed,
df
)
rm(df)
}
# Save chunk output to a temporary folder:
saveRDS(deparsed_smoothed,paste0(tmp.path,"/chunk_",c,".rds"))
rm(df,structured_df,smoothed,deparsed_smoothed);gc()
}
Within the loop, there is an optional code segment used to calculate the differences after applying the smooth algorithm at a patient level. This step is entirely optional, but it can be useful for validation, especially in cases where patients exhibit significant changes (approximately 10% of change).
# Calculate differences by patient mapping with the group_map function:
aux <- smoothed %>%
group_by(id) %>%
group_map(~ smooth_diff(.$treatment,.$smoothed_treatment)) %>%
bind_rows(.id = "group_id") %>%
data.frame
# Format output and filter global, exposure period:
aux <- aux %>%
mutate(percentage_of_change = round(proportion_of_change*100,2)) %>%
filter(type%in%c('Global','Exposure period')) %>%
mutate(type = factor(type,levels=c('Global','Exposure period'),
labels=c('total_change','exposure_change')))
# add 'id' and reshape:
aux <- aux %>%
left_join(data.frame(id=unique(smoothed$id),group_id = as.character(seq(1,n_distinct(smoothed$id))))) %>%
reshape2::dcast(id~type,value.var='percentage_of_change')
# join to deparsed_smoothed dataframe:
deparsed_smoothed <- left_join(
deparsed_smoothed,
aux
)
After completing the process, it is essential to close the clusters to release the computing resources. Additionally, you may want to calculate the total computation time for the entire workflow.
To complete the process, we read the .rds files stored in the temporary folder and combine them into a single data.frame using the bind_rows function:
# Import and combine all chunks into a single data.frame:
rds_files <- list.files(tmp.path, pattern = "chunk_", full.names = TRUE)
all_chunks <- bind_rows(lapply(rds_files, readRDS))
The computation time may vary depending on the number of samples and the study period. We conducted several tests on both Mac and Windows platforms, and the performance was similar.
As a reference:
The entire process took 16 hours to finish.