
๐ Documentation โข ๐ Get Started โข ๐ฌ Issues โข ๐ค Contributing
Languages: English | ็ฎไฝไธญๆ
ukbflow provides a streamlined, RAP-native R workflow for UK Biobank analysis โ from phenotype extraction and disease derivation to association analysis and publication-quality figures.
UK Biobank Data Policy (2024+): Individual-level data must remain within the RAP environment. Only summary-level outputs may be downloaded locally. All
ukbflowfunctions are designed with this constraint in mind.
library(ukbflow)
# Simulate UKB-style data locally (on RAP: replace with extract_batch() + job_wait())
data <- ops_toy(n = 5000, seed = 2026) |>
derive_missing()
# Derive lung cancer outcome (ICD-10 C34) and follow-up time
data <- data |>
derive_icd10(name = "lung", icd10 = "C34",
source = c("cancer_registry", "hes")) |>
derive_followup(name = "lung",
event_col = "lung_icd10_date",
baseline_col = "p53_i0",
censor_date = as.Date("2022-10-31"),
death_col = "p40000_i0")
# Define exposure: ever vs. never smoker
data[, smoking_ever := factor(
ifelse(p20116_i0 == "Never", "Never", "Ever"),
levels = c("Never", "Ever")
)]
# Cox regression: smoking โ lung cancer (3-model adjustment)
res <- assoc_coxph(data,
outcome_col = "lung_icd10",
time_col = "lung_followup_years",
exposure_col = "smoking_ever",
covariates = c("p21022", "p31", "p22189"))
# Forest plot
res_df <- as.data.frame(res)
plot_forest(
data = res_df,
est = res_df$HR,
lower = res_df$CI_lower,
upper = res_df$CI_upper,
ci_column = 2L
)# Recommended
pak::pkg_install("evanbio/ukbflow")
# or
remotes::install_github("evanbio/ukbflow")Requirements: R โฅ 4.1 ยท dxpy (dx-toolkit, required for RAP interaction)
pip install dxpy| Layer | Key Functions | Description |
|---|---|---|
| Connection | auth_login, auth_select_project |
Authenticate to RAP via dx-toolkit |
| Data Access | fetch_metadata, extract_batch,
job_wait |
Retrieve phenotype data from UKB dataset on RAP |
| Data Processing | decode_names, decode_values,
derive_icd10, derive_followup,
derive_case |
Harmonize multi-source records; derive analysis-ready cohort |
| Association Analysis | assoc_coxph, assoc_logistic,
assoc_subgroup |
Three-model adjustment; subgroup & trend analysis |
| Genomic Scoring | grs_bgen2pgen, grs_score,
grs_standardize |
Distributed plink2 scoring on RAP worker nodes |
| Visualization | plot_forest, plot_tableone |
Publication-ready figures & tables |
| Utilities | ops_setup, ops_toy, ops_na,
ops_snapshot, ops_withdraw |
Environment check, synthetic data, pipeline diagnostics, and cohort management |
auth_login(), auth_status(),
auth_logout(), auth_list_projects(),
auth_select_project() โ RAP authenticationfetch_ls(), fetch_tree(),
fetch_url(), fetch_file() โ RAP file
systemfetch_metadata(), fetch_field() โ UKB
metadata shortcutsextract_ls(), extract_pheno(),
extract_batch() โ phenotype extractiondecode_values() โ integer codes โ human-readable
labelsdecode_names() โ field IDs โ snake_case column
namesjob_status() โ query job status by IDjob_wait() โ block until job completes (with
timeout)job_path() โ get output path of a completed jobjob_result() โ retrieve job result objectjob_ls() โ list recent jobsderive_missing() โ handle โDo not knowโ / โPrefer not
to answerโderive_covariate() โ type conversion + summaryderive_cut() โ bin continuous variablesderive_selfreport() โ self-reported disease status +
datederive_hes() โ HES inpatient ICD-10derive_first_occurrence() โ First Occurrence
fieldsderive_cancer_registry() โ cancer registryderive_death_registry() โ death registryderive_icd10() โ combine sources (wrapper)derive_case() โ merge self-report + ICD-10derive_timing() โ prevalent vs.ย incident
classificationderive_age() โ age at eventderive_followup() โ follow-up end date and
durationassoc_coxph() / assoc_cox() โ Cox
proportional hazards (HR)assoc_logistic() / assoc_logit() โ
logistic regression (OR)assoc_linear() / assoc_lm() โ linear
regression (ฮฒ)assoc_coxph_zph() โ proportional hazards assumption
testassoc_subgroup() โ stratified analysis + interaction
LRTassoc_trend() โ dose-response trend + p_trendassoc_competing() โ Fine-Gray competing risks
(SHR)assoc_lag() โ lagged exposure sensitivity analysisplot_forest() โ forest plot (PNG / PDF / JPG / TIFF,
300 dpi)plot_tableone() โ Table 1 (DOCX / HTML / PDF /
PNG)ops_setup() โ environment health check (dx CLI, RAP
auth, R packages)ops_toy() โ generate synthetic UKB-like data for
development and testingops_na() โ summarise missing values (NA and
"") across all columnsops_snapshot() โ record pipeline checkpoints and track
dataset changesops_snapshot_cols() โ retrieve column list from a saved
snapshotops_snapshot_diff() โ compare columns between two
snapshotsops_snapshot_remove() โ remove columns added after a
given snapshotops_set_safe_cols() โ define protected columns that
ops_snapshot_remove will not dropops_withdraw() โ exclude UKB withdrawn participants
from a cohortgrs_check() โ validate SNP weights filegrs_bgen2pgen() โ convert BGEN โ PGEN on RAP (submits
cloud jobs)grs_score() โ score GRS across chromosomes with
plink2grs_standardize() / grs_zscore() โ Z-score
standardisationgrs_validate() โ OR/HR per SD, high vs low, trend,
AUC/C-indexFull vignettes and function reference:
https://evanbio.github.io/ukbflow/
Bug reports, feature requests, and pull requests are welcome. See CONTRIBUTING.md.
MIT License ยฉ 2026 Yibin Zhou
Made with โค๏ธ by Yibin Zhou