Here we visualize chemical metrics for archaeal and bacterial taxa and viruses using precomputed reference proteomes in the chem16S package.
Chemical metrics are molecular properties computed from elemental compositions – inferred from amino acid compositions of proteins – and include carbon oxidation state (ZC) and stoichiometric hydration state (nH2O), as described by Dick et al. (2020).
This vignette uses the RefSeq reference database for reference proteomes. The GTDB reference database is also available in chem16S (and is the default for functions in the package), but doesn’t have viral reference proteomes, which are visualized below.
This vignette was compiled on 2024-07-01 with chem16S version 1.1.0.
taxon_AA <- read.csv(system.file("RefDB/RefSeq_206/taxon_AA.csv.xz", package = "chem16S"))
ranks <- taxon_AA$protein
table(ranks)[unique(ranks)]
## ranks
## species genus family order class phylum
## 1 4737 757 299 138 68
## superkingdom
## 3
taxnames <- read.csv(system.file("RefDB/RefSeq_206/taxonomy.csv.xz", package = "chem16S"))
phylum_to_genus <- function(phylum) na.omit(unique(taxnames$genus[taxnames$phylum == phylum]))
get_Zc <- function(genera) na.omit(taxon_Zc[match(genera, taxon_AA$organism)])
sapply(sapply(sapply(c("Crenarchaeota", "Euryarchaeota"), phylum_to_genus), get_Zc), mean)
## Crenarchaeota Euryarchaeota
## -0.2148740 -0.1220774
Within the Euryarchaeota, there are classes with extremely high and low ZC (see below and Dick and Tan, 2023). Let’s look at a couple of them:
class_to_genus <- function(class) na.omit(unique(taxnames$genus[taxnames$class == class]))
sapply(sapply(sapply(c("Methanococci", "Halobacteria"), class_to_genus), get_Zc), mean)
## Methanococci Halobacteria
## -0.22562725 -0.08043995
Read full list of taxonomic names; remove viruses; prune to keep unique genera; count number of genera in each phylum; calculate ZC for genera in 20 most highly represented phyla of Bacteria and Archaea; order according to mean ZC; make boxplots:
taxnames2 <- taxnames[taxnames$superkingdom != "Viruses", ]
taxnames3 <- taxnames2[!duplicated(taxnames2$genus), ]
(top20_phyla <- head(sort(table(taxnames3$phylum), decreasing = TRUE), 20))
##
## Proteobacteria Firmicutes
## 1375 672
## Actinobacteria Bacteroidetes
## 419 354
## Cyanobacteria Euryarchaeota
## 113 107
## Planctomycetes Chloroflexi
## 62 38
## Verrucomicrobia Crenarchaeota
## 32 28
## Acidobacteria Spirochaetes
## 25 19
## Thermotogae Aquificae
## 14 13
## Synergistetes Chlamydiae
## 13 12
## Fusobacteria Tenericutes
## 12 12
## Thaumarchaeota Candidatus Thermoplasmatota
## 12 10
Notice the large range for Euryarchaota and Protobacteria. Let’s take a closer look at the classes within each phylum.
opar <- par(mfrow = c(1, 2), mar = c(4, 10, 1, 1))
for(phylum in c("Euryarchaeota", "Proteobacteria")) {
taxnames4 <- taxnames3[taxnames3$phylum == phylum, ]
classes <- na.omit(unique(taxnames4$class))
Zc_list <- sapply(sapply(classes, class_to_genus), get_Zc)
order_Zc <- order(sapply(Zc_list, mean))
Zc_list <- Zc_list[order_Zc]
boxplot(Zc_list, horizontal = TRUE, las = 1, xlab = chemlab("Zc"))
}
See take-home message #1.
As above, but calculate nH2O instead of ZC.
taxon_nH2O <- canprot::calc_metrics(taxon_AA, "nH2O")[, 1]
get_nH2O <- function(genera) na.omit(taxon_nH2O[match(genera, taxon_AA$organism)])
opar <- par(mfrow = c(1, 2), mar = c(4, 10, 1, 1))
for(phylum in c("Euryarchaeota", "Proteobacteria")) {
taxnames4 <- taxnames3[taxnames3$phylum == phylum, ]
classes <- na.omit(unique(taxnames4$class))
nH2O_list <- sapply(sapply(classes, class_to_genus), get_nH2O)
order_nH2O <- order(sapply(nH2O_list, mean))
nH2O_list <- nH2O_list[order_nH2O]
boxplot(nH2O_list, horizontal = TRUE, las = 1, xlab = chemlab("nH2O"))
}
See take-home message #2.
Zc_mean <- sapply(sapply(sapply(names(top50_phyla), phylum_to_genus), get_Zc), mean)
nH2O_mean <- sapply(sapply(sapply(names(top50_phyla), phylum_to_genus), get_nH2O), mean)
domain <- taxnames$superkingdom[match(names(top50_phyla), taxnames$phylum)]
pchs <- c(24, 21, 23)
pch <- sapply(domain, switch, Archaea = pchs[1], Bacteria = pchs[2], Viruses = pchs[3])
bgs <- topo.colors(3, alpha = 0.5)
bg <- sapply(domain, switch, Archaea = bgs[1], Bacteria = bgs[2], Viruses = bgs[3])
opar <- par(mar = c(4, 4, 1, 1))
plot(Zc_mean, nH2O_mean, xlab = chemlab("Zc"), ylab = chemlab("nH2O"), pch = pch, bg = bg)
ilow <- nH2O_mean < -0.77 & domain == "Bacteria"
xadj <- c(-0.9, -0.8, 0.8, 1, -0.8)
yadj <- c(0, 1, 1, -1, -1)
text(Zc_mean[ilow] + 0.02 * xadj, nH2O_mean[ilow] + 0.005 * yadj, names(top50_phyla[ilow]), cex = 0.9)
legend("bottomleft", c("Archaea", "Bacteria", "Viruses"), pch = pchs, pt.bg = bgs)
See take-home message #3.
Besides ZC and nH2O, the calc_metrics()
function in canprot can calculate elemental ratios (H/C, N/C, O/C, and S/C), grand average of hydropathicity (GRAVY), isoelectric point (pI), average molecular weight of amino acid residues (MW), and protein length.
AAcomp <- taxon_AA[match(classes, taxon_AA$organism), ]
metrics <- canprot::calc_metrics(AAcomp, c("HC", "OC", "NC", "SC", "GRAVY", "pI", "MW", "plength"))
layout(rbind(c(1, 2, 5), c(3, 4, 5)), widths = c(2, 2, 1.5))
opar <- par(mar = c(4.5, 4, 1, 1), cex = 1)
plot(metrics$OC, metrics$HC, col = 1:10, pch = 1:10, xlab = "O/C", ylab = "H/C")
plot(metrics$NC, metrics$SC, col = 1:10, pch = 1:10, xlab = "N/C", ylab = "S/C")
plot(metrics$pI, metrics$GRAVY, col = 1:10, pch = 1:10, xlab = "pI", ylab = "GRAVY")
plot(metrics$plength, metrics$MW, col = 1:10, pch = 1:10, xlab = "Length", ylab = "MW")
plot.new()
legend("right", classes, col = 1:10, pch = 1:10, bty = "n", xpd = NA)
Respectively, these findings suggest genomic adaptation by Methanococci and Epsilonproteobacteria – now known as Campylobacterota – to reducing environments (which may be found in submarine hot springs and anoxic zones of sediments), by Gammaproteobacteria to lower water availability in certain habitats, and by viruses to lower water availability in their environment. A notable observation in this regard is that viruses without an envelope have lower water content than bacterial cells (Matthews, 1975).
In summary, chemical metrics provide insight into how environmental factors shape the amino acid and elemental composition of proteins.
Dick JM, Tan J. 2023. Chemical links between redox conditions and estimated community proteomes from 16S rRNA and reference protein sequences. Microbial Ecology 85(4): 1338–1355. doi: 10.1007/s00248-022-01988-9
Dick JM, Yu M, Tan J. 2020. Uncovering chemical signatures of salinity gradients through compositional analysis of protein sequences. Biogeosciences 17(23): 6145–6162. doi: 10.5194/bg-17-6145-2020
Matthews REF. 1975. A classification of virus groups based on the size of the particle in relation to genome size. Journal of General Virology 27(2): 135–149. doi: 10.1099/0022-1317-27-2-135