Metabolomic studies involve the measurement of spectra from biological samples collected by researchers from their study cohort and the analysis of the resulting data. Often the goal of such studies is to identify metabolites which exhibit different behaviour in samples belonging to different groups in the study cohort. In this way they may act as an identifier of the particular feature being studied. Studies must utilise an adequate sample size such that the study achieves an appropriate statistical power to validate its conclusions. Clearly this sample size must be determined before commencing the study. In many biological fields an optimal sample size is determined by performing a pilot study, however, this is not always possible for metabolomic studies due to factors such as experiment cost, availability of resources, and availability of subjects. MetSizeR (Nyamundanda et al. 2013) offers a tool to estimate the sample size needed for an experiment to achieve a desired statistical power, without the use of pilot data or historical data.
MetSizeR operates on the idea of an analysis informed approach, meaning that the sample size should be estimated for an experiment based on the method of analysis the researcher plans to use. There are currently two analysis methods supported by MetSizeR; probabilistic principal components analysis (PPCA), originally developed by Tipping and Bishop (1999), and probabilistic principal components and covariates analysis (PPCCA), developed by Nyamundanda et al. (2010). The application operates by using simulated data in the place of pilot data to perform its estimation. The data are simulated based on the analysis method the researcher plans to use, and the sample size which yields the desired false discovery rate (FDR) for the study is returned to the user.
Full details of the MetSizeR algorithm can be found in Nyamundanda et al. (2013).
A Shiny application is used as a graphic user interface (GUI) for the application, to encourage use in the field without the need for experience using R.
The MetSizeR package provides a R Shiny application which allows users to estimate the sample size required for their study to achieve a desired statistical power. The tool is built to estimate the sample size required for both targeted and untargeted metabolomic experiments. Sample size estimation is performed without the need for pilot data, however, if pilot data are available then these can be uploaded to the application to aid in the estimations. Several inputs are required by the user; it is expected that these inputs should be informed by the user’s own knowledge of their field. Additionally, where pilot data are present it is expected that these data have been sourced and treated appropriately in the context of the user’s research.
To install the MetSizeR package, the user should have the R software environment installed on their machine. To download, visit the website for The R Project for Statistical Computing and follow the instructions to download and install.
When in R, the following command will allow the user to install the MetSizeR package:
install.packages("MetSizeR")
Package dependencies will also be installed if they are not already installed on the user’s computer. These consist of the following:
Alternatively, MetSizeR can be downloaded and installed manually by the user if they so prefer by visiting its entry on The Comprehensive R Archive Network (CRAN).
Once installed, the MetSizeR package can be loaded into the user’s current R session by the following command:
library(MetSizeR)
The MetSizeR R package is designed to perform all functionality through the associated R Shiny application. The Shiny application can be launched by running the following command:
MetSizeR()
The application will then launch, either in the user’s external browser, a pop-up window, or in the viewer pane of the user’s R IDE. This depends on how the user has configured their own settings.
Once the application has launched, the initial landing page is an About
page which contains information about the package, its functionality, and references for the methods used within. Navigation through the application occurs through the navigation bar at the top of the page, with options About
, Sample Size Estimation
, and Vary Proportion of Significant Spectral Bins
leading to the about page, the page on which to perform sample size estimation for a single experiment, and the page on which to estimate the sample size for varying proportions of significant bins or metabolites respectively.
The core functionality of MetSizeR is the ability to estimate the sample size required for a metabolomic experiment to achieve a desired statistical power. MetSizeR can perform this estimation for targeted or untargeted experiments, both for cases when pilot data are available or unavailable. Sample size estimation takes place based on the intended method of analysis in the experiment; at present the tool is available for PPCA or PPCCA. To navigate to the sample size estimation tool, select Sample Size Estimation
on the navigation bar at the top of the page.
MetSizeR can estimate the sample size required for a study to achieve a desired statistical power even when pilot data are unavailable. In this case, the tool can be used for both targeted and untargeted analysis. The user must specify several parameters for the algorithm to use, given as follows. These are specified by selecting or inputting options on the sidebar located on the left side of the page.
Are experimental pilot data available?
checkbox.Once the input values have been entered to the user’s specifications, the Estimate Optimal Sample Size
button at the bottom of the sidebar must be clicked to start the MetSizeR process. Note that the process may take several minutes for larger numbers of bins. The user will know the algorithm is running as a notification will appear on the bottom right of the screen.
When results are ready they will appear on the right side of the screen. A plot of FDR versus sample size will appear, showing the 90th, 50th, and 10th percentiles of the FDR values calculated for each sample size tested. The estimated optimal sample size will be indicated by a blue vertical line, and also in text in the legend of the plot. The target FDR is given by a black dotted line on the plot.
Below the plot, results will be displayed in text. The estimated optimal sample size will be printed, along with the per-group breakdown of the sample sizes, in the same ratio as input by the researcher. There is an option to download the plot, by clicking the Download Plot
button. The plot will then download to the location of choice on the user’s computer. There is also an option to view the exact values from the points on the plot by clicking the Show values from plot?
checkbox. This opens a table showing the values of the 90th, 50th, and 10th percentiles of the FDR values calculated alongside the relevant sample size. The user can download these data as a CSV file by clicking the Download Plot Data as .csv
button, which will download the data to the location of the user’s choosing.
If the user wishes to change any of the inputs and re-run the algorithm, they can simply change the relevant inputs and click Estimate Optimal Sample Size
once more, and the results will update. The user will know the algorithm is running as a notification will appear on the bottom right of the screen.
If pilot data are available then MetSizeR can use this to aid its estimation of the optimal sample size required for the study to achieve a desired level of power. To estimate the sample size in this manner, navigate to the Sample Size Estimation
page on the navigation bar at the top of the page. This page contains a sidebar on the left side which allows the user to input specifications of their planned study as well as upload pilot data. The following specifications are required from the user:
Are experimental pilot data available?
checkbox. This will open several options to the user. Firstly, the user must specify if their data file contains a header as the first row, if so click the Does data contain a header?
checkbox. The option to upload data as a CSV file is then given. To select the data file, click Browse
, navigate to the file’s location and select the file. A blue confirmation bar should appear under the file upload section, saying Upload complete
, and the name of the file should appear in the box beside the Browse
button.
Note that the number of spectral bins (untargeted analysis) or metabolites (targeted analysis) does not need to be specified by the user when pilot data are present as this is read from the uploaded file. When the input values are correctly specified, the Estimate Optimal Sample Size
button at the bottom of the sidebar must be clicked to start the MetSizeR process. Note that the process may take several minutes for larger data. The user will know the algorithm is running as a notification will appear on the bottom right of the screen.
When results are ready they will appear on the right side of the screen. A plot of FDR versus sample size will appear, showing the 90th, 50th, and 10th percentiles of the FDR values calculated for each sample size tested. The estimated optimal sample size will be indicated by a blue vertical line, and also in text in the legend of the plot. The target FDR is given by a black dotted line on the plot.
Below the plot, results will be displayed in text. The estimated optimal sample size will be printed, along with the per-group breakdown of the sample sizes, in the same ratio as input by the researcher. There is an option to download the plot, by clicking the Download Plot
button. The plot will then download to the location of choice on the user’s computer. There is also an option to view the exact values from the points on the plot by clicking the Show values from plot?
checkbox. This opens a table showing the values of the 90th, 50th, and 10th percentiles of the FDR values calculated alongside the relevant sample size. The user can download these data as a CSV file by clicking the Download Plot Data as .csv
button, which will download the data to the location of the user’s choosing.
If the user wishes to change any of the inputs and re-run the algorithm, they can simply change the relevant inputs and click Estimate Optimal Sample Size
once more, and the results will update. The user will know the algorithm is running as a notification will appear on the bottom right of the screen.
If the user wishes to test different proportions of significant bins to find the optimal sample size, they can navigate to the Vary Proportion of Significant Spectral Bins
tab on the navigation bar at the top of the page. Here, up to four different significance proportions can be tested for the same experimental design.
First the user should specify whether they intend to perform targeted or untargeted analysis by selecting the applicable option at the top of the sidebar on the left of the page. There are then boxes on the sidebar where the user can enter up to four proportions. One proportion should be entered in each box. If less than four proportions are required, zero should be entered in the remaining boxes. The values can be input by either typing the proportion as a decimal, or using the arrows on the right of the boxes to change the values.
The user must then also choose the number of spectral bins (untargeted analysis) or metabolites (targeted analysis) to test by typing the desired number into the relevant box or using the arrows. They must then select the desired model to use, either PPCA or PPCCA, by selecting the button for their chosen model. If PPCCA is selected, the number of numeric and categorical covariates must be specified, with the number of levels of any categorical variables entered also. The target FDR must then be specified, by typing in the desired value or using the arrows to select. Once the inputs are entered to the user’s specifications, the Run MetSizeR for Varied Proportions
button should be clicked to start the process. Note that this process can take some time, especially for larger numbers of bins. A notification will appear on the bottom right of the screen, indicating that the algorithm is running. This process will usually be longer than the single sample size calculation performed on the Sample Size Estimation
tab, as there are more calculations to be performed.
When results are ready, one plot for each of the specified proportions will appear on the right side of the screen, along with a statement of their respective proportions. A download button is available for each plot which, when clicked, will allow the user to download the plot as a PNG file to the location of their choosing.
Nyamundanda, G., Brennan, L., & Gormley, I. C. (2010). Probabilistic principal component analysis for metabolomic data. BMC Bioinformatics, 11(1), 571. https://doi.org/10.1186/1471-2105-11-571
Nyamundanda, G., Gormley, I. C., Fan, Y., Gallagher, W. M., & Brennan, L. (2013). MetSizeR: Selecting the optimal sample size for metabolomic studies using an analysis based approach. BMC Bioinformatics, 14(1), 338. https://doi.org/10.1186/1471-2105-14-338
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622. https://doi.org/10.1111/1467-9868.00196