Since the use of High-throughput sequencing (HTS) was first
introduced to analyze immunoglobulin (B-cell receptor, antibody) and
T-cell receptor repertoires (Freeman et al, 2009; Robins et al, 2009;
Weinstein et al, 2009), the increasing number of studies making use of
this technique has produced enormous amounts of data and there exists a
pressing need to develop and adopt common standards, protocols, and
policies for generating and sharing data sets. The Adaptive Immune Receptor Repertoire
(AIRR) Community formed in 2015 to address this challenge (Breden et
al, 2017) and has stablished the set of minimal metadata elements
(MiAIRR) required for describing published AIRR datasets (Rubelt et al,
2017) as well as file formats to represent this data in a
machine-readable form. The airr
R package provide read,
write and validation of data following the AIRR Data Representation
schemas. This vignette provides a set of simple use examples.
The AIRR Community’s recommendations for a minimal set of metadata that should be used to describe an AIRR-seq data set when published or deposited in a AIRR-compliant public repository are described in Rubelt et al, 2017. The primary aim of this effort is to make published AIRR datasets FAIR (findable, accessible, interoperable, reusable); with sufficient detail such that a person skilled in the art of AIRR sequencing and data analysis will be able to reproduce the experiment and data analyses that were performed.
Following this principles, V(D)J reference alignment annotations are saved in standard tab-delimited files (TSV) with associated metadata provided in accompanying YAML formatted files. The column names and field names in these files have been defined by the AIRR Data Representation Working Group using a controlled vocabulary of standardized terms and types to refer to each piece of information.
The airr
package contains the function
read_rearrangement
to read and validate files containing
AIRR Rearrangement records, where a Rearrangement record describes the
collection of optimal annotations on a single sequence that has
undergone V(D)J reference alignment. The usage is straightforward, as
the file format is a typical tabulated file. The argument that needs
attention is base
, with possible values "0"
and "1"
. base
denotes the starting index for
positional fields in the input file. Positional fields are those that
contain alignment coordinates and names ending in “_start” and “_end”.
If the input file is using 1-based closed intervals (R style), as
defined by the standard, then positional fields will not be modified
under the default setting of base="1"
. If the input file is
using 0-based coordinates with half-open intervals (python style), then
positional fields may be converted to 1-based closed intervals using the
argument base="0"
.
# Imports
library(airr)
library(tibble)
# Read Rearrangement example file
f1 <- system.file("extdata", "rearrangement-example.tsv.gz", package="airr")
rearrangement <- read_rearrangement(f1)
glimpse(rearrangement)
## Rows: 101
## Columns: 33
## $ sequence_id <chr> "SRR765688.7787", "SRR765688.35420", "SRR765688.366…
## $ sequence <chr> "NNNNNNNNNNNNNNNNNNNNGCTGACCTGCACCTTCTCTGGATTCTCACT…
## $ rev_comp <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ productive <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, T…
## $ vj_in_frame <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
## $ stop_codon <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
## $ v_call <chr> "IGHV2-5*02", "IGHV5-51*01", "IGHV7-4-1*02", "IGHV7…
## $ d_call <chr> "IGHD5-24*01", "IGHD3-16*02,IGHD3-3*01,IGHD3-3*02",…
## $ j_call <chr> "IGHJ4*02", "IGHJ6*02,IGHJ6*04", "IGHJ4*02", "IGHJ6…
## $ c_call <chr> "IGHG", "IGHG", "IGHG", "IGHG", "IGHG", "IGHA", "IG…
## $ sequence_alignment <chr> "..................................................…
## $ germline_alignment <chr> "CAGATCACCTTGAAGGAGTCTGGTCCT...ACGCTGGTGAAACCCACACA…
## $ junction <chr> "TGTGCACACAGTGCGGGATGGCTGCCTGATTACTGG", "TGTGCGAGGC…
## $ junction_aa <chr> "CAHSAGWLPDYW", "CARHGLYGCDHTGCYTSFYYYGMDVW", "CARE…
## $ v_cigar <chr> "20S56N21=1X11=1X7=1X9=3X62=6D2=1X1=2X2=2X50=1X7=1X…
## $ d_cigar <chr> "274S5N7=", "305S29N7=", "293S13N12=", "290S9N8=", …
## $ j_cigar <chr> "288S11N32=1X4=", "318S7N12=1X15=", "305S5N6=1X14=1…
## $ v_sequence_start <int> 21, 21, 21, 21, 21, 21, 21, 20, 22, 21, 21, 20, 21,…
## $ v_sequence_end <int> 269, 276, 283, 283, 283, 264, 283, 259, 281, 266, 2…
## $ v_germline_start <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ v_germline_end <int> 320, 320, 320, 320, 320, 320, 320, 318, 318, 320, 3…
## $ d_sequence_start <int> 275, 306, 294, 291, 284, 274, 290, 268, 282, 279, 2…
## $ d_sequence_end <int> 281, 312, 305, 298, 290, 281, 295, 276, 286, 291, 2…
## $ d_germline_start <int> 6, 30, 14, 10, 5, 13, 7, 10, 8, 8, 9, 10, 10, 7, 22…
## $ d_germline_end <int> 12, 36, 25, 17, 11, 20, 12, 18, 12, 20, 15, 16, 17,…
## $ j_sequence_start <int> 289, 319, 306, 322, 291, 297, 312, 281, 300, 301, 2…
## $ j_sequence_end <int> 325, 346, 348, 368, 309, 344, 349, 326, 339, 347, 3…
## $ j_germline_start <int> 12, 8, 6, 16, 18, 1, 12, 3, 9, 2, 5, 20, 9, 14, 15,…
## $ j_germline_end <int> 48, 35, 48, 62, 36, 48, 49, 48, 48, 48, 51, 62, 59,…
## $ junction_length <int> 36, 78, 45, 66, 33, 60, 48, 45, 36, 61, 51, 48, 51,…
## $ np1_length <int> 5, 29, 10, 7, 0, 9, 6, 8, 0, 12, 13, 3, 7, 8, 27, 5…
## $ np2_length <int> 7, 6, 0, 23, 0, 15, 16, 4, 13, 9, 4, 14, 2, 3, 9, 9…
## $ duplicate_count <int> 3, 3, 13, 3, 2, 2, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 3,…
AIRR Data Model records, such as Repertoire and GermlineSet, can be read from either a YAML or JSON formatted file into a nested list.
# Read Repertoire example file
f2 <- system.file("extdata", "repertoire-example.yaml", package="airr")
repertoire <- read_airr(f2)
glimpse(repertoire)
## List of 1
## $ Repertoire:List of 3
## ..$ :List of 5
## .. ..$ repertoire_id : chr "1841923116114776551-242ac11c-0001-012"
## .. ..$ study :List of 13
## .. ..$ subject :List of 15
## .. ..$ sample :List of 1
## .. ..$ data_processing:List of 1
## ..$ :List of 5
## .. ..$ repertoire_id : chr "1602908186092376551-242ac11c-0001-012"
## .. ..$ study :List of 13
## .. ..$ subject :List of 15
## .. ..$ sample :List of 1
## .. ..$ data_processing:List of 1
## ..$ :List of 5
## .. ..$ repertoire_id : chr "2366080924918616551-242ac11c-0001-012"
## .. ..$ study :List of 13
## .. ..$ subject :List of 15
## .. ..$ sample :List of 1
## .. ..$ data_processing:List of 1
# Read GermlineSet example file
f3 <- system.file("extdata", "germline-example.json", package="airr")
germline <- read_airr(f3)
glimpse(germline)
## List of 2
## $ GermlineSet:List of 1
## ..$ :List of 17
## .. ..$ germline_set_id : chr "OGRDB:G00007"
## .. ..$ author : chr "William Lees"
## .. ..$ lab_name : chr ""
## .. ..$ lab_address : chr "Birkbeck College, University of London, Malet Street, London"
## .. ..$ acknowledgements : list()
## .. ..$ release_version : int 1
## .. ..$ release_description : chr ""
## .. ..$ release_date : chr "2021-11-24"
## .. ..$ germline_set_name : chr "CAST IGH"
## .. ..$ germline_set_ref : chr "OGRDB:G00007.1"
## .. ..$ pub_ids : chr ""
## .. ..$ species :List of 2
## .. ..$ species_subgroup : chr "CAST_EiJ"
## .. ..$ species_subgroup_type: chr "strain"
## .. ..$ locus : chr "IGH"
## .. ..$ allele_descriptions :List of 2
## .. ..$ curation : NULL
## $ GenotypeSet:List of 1
## ..$ :List of 2
## .. ..$ receptor_genotype_set_id: chr "1"
## .. ..$ genotype_class_list :List of 1
The airr
package contains the function
write_rearrangement
to write Rearrangement records to the
AIRR TSV format.
AIRR Data Model records can be written to either YAML or JSON using
the write_airr
function.
The airr
package contains the function
validate_rearrangement
to validate tabular
(data.frame
) Rearrangement records and AIRR Data Model
objects, respectively.
## [1] TRUE
## [1] TRUE
## GenotypeSet GermlineSet
## TRUE TRUE