Running Haplin on cluster

Julia Romanowska

2024-08-20

General about running analysis on a cluster

NB: Usually running on a cluster requires some scripting and coding skills, however, with the VPN graphical connections, it’s becoming easier for non-programmers to run any software. Below, we provide some exemplary scripts that one can usually copy and use with small modifications on many clusters. If in doubt, check with your administrator and/or write to us!

Extra requirements

To run Haplin on a cluster you will need an MPI implementation and the Rmpi package installed manually, before the Haplin package installation. How to install extra R packages can vary from cluster to cluster, so check the manual!

Job submission

To run a job on a cluster, usually one needs to submit a script to a job queue. The submission method varies depending on the queue system used, so check the help pages of your cluster. Here, we present the quite popular SLURM queueing system.

Below, is an exemplary script that sets up a SLURM job:

#!/bin/bash

#SBATCH --job-name=haplin_cluster_run
#SBATCH --output=haplin_cluster_run.out
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=8
#SBATCH --time=8:00:00
#SBATCH --mem-per-cpu=100
#SBATCH --mail-user=user```domain.com
#SBATCH --mail-type=ALL

module load R
module load openmpi

echo "nodes: $SLURM_JOB_NODELIST"
myhostfile="cur_nodes.dat"

echo "----STARTING THE JOB----"
date
echo "------------------------"

mpiexec --hostfile $myhostfile -n 1 R --save < haplin_cluster_run.r >& mpi_run.out

exit_status=$?
echo "----JOB EXITED WITH STATUS---: $exit_status"
exit $exit_status
echo "----DONE----"

Here, the important part is the mpiexec line, where the R session is loaded to run in parallel on several cores. To achieve this with the Rmpi package, one needs to provide a list of cores available currently for the user, which is done through the --hostfile $myhostfile part. This means that the given file should hold a list of cores — if this is not available automatically on the cluster, one can extract it from the $SLURM_JOB_NODELIST variable (see submit_haplin_cluster_rmpi.sh script in this folder).

For a more detailed explanation of the #SBATCH commands, see e.g., the official documentation.

Running parallel Haplin analysis on a cluster

The most effective way of using Haplin on a cluster is to run haplinSlide on a large GWAS dataset. The data preparation and calling haplinSlide is the same as for single run, see the section above. However, before calling any parallel function one needs to setup the cluster with the function:

initParallelRun()

This will make use of maximum number of available cores. If one wants to limit the run to a specific number of CPUs, the cpus argument needs to be specified.

Then, when evoking the analysis, one needs to specify that the Rmpi package will be used:

haplinSlide( trial.data2.prep, use.missing = TRUE, ccvar = 2, design =
  "cc.triad", reference = "ref.cat", response = "mult", para.env = "Rmpi" )

Finally, right before the script finishes, we need to close all the threads created by initParallelRun:

finishParallelRun()

CAUTION: If the user forgets to call this function before exiting R, all the work will still be saved, however, the mpirun will end with an error.

To sum up, an exemplary R script to run on a cluster, would look like that:

library( Haplin )

initParallelRun()

chosen.markers <- 3:55

data.in <- genDataLoad( filename = "mynicedata" )
# analysis without maternal risks calculated
results1 <- haplinSlide( data = data.in, markers = chosen.markers, winlength = 2, 
    design = "triad", use.missing = TRUE, maternal = FALSE, response = "free",
    cpus = 2, verbose = FALSE, printout = FALSE, para.env = "Rmpi" )

# analysis with maternal risks calculated
results2 <- haplinSlide( data = data.in, markers = chosen.markers, winlength = 2, 
    design = "triad", use.missing = TRUE, maternal = TRUE, response = "mult",
    cpus = 2, verbose = FALSE, printout = FALSE, para.env = "Rmpi" )

finishParallelRun()

IMPORTANT: To run in parallel, we need to specify both the cpus and para.env arguments, however, the true number of CPUs used will be set within initParallelRun and not by the cpus argument.