How to use R on the Bioinformatics cluster

Authors

Gaëlle Lefort

Alyssa Imbert

Julien Henry

Nathalie Vialaneix

Modified

January 18, 2024

DO NOT run treatments on frontal servers, always use sbatch or srun.
Before contacting the support, READ THE FAQ: http://bioinfo.genotoul.fr/index.php/faq/

This tutorial aims at describing how to run R scripts and compile RMarkdown files on the Toulouse Bioinformatics cluster¹. To do so, you need to have an account (ask for an account on this page http://bioinfo.genotoul.fr/index.php/ask-for/create-an-account/). You can then connect to the cluster using the ssh command on linux and Mac OS and using Mobaxterm² on Windows. Similarly, you can copy files between the cluster and your computer using the scp command on linux and Mac OS and using OpenSSH³ on Windows. The login address is genobioinfo.toulouse.inrae.fr. Once you are connected, you have two solutions to run a script: running it in batch mode or starting an interactive session. The script must never be run on the first server you connect to. Also, be careful that the programs that you can use from the cluster are not available until you have loaded the corresponding module. How to manage modules is explained in Section 1.

1 Use of modules

All programs are made available by loading the corresponding modules. These are the main useful commands to work with modules:

module avail: list all available modules
search_module [TEXT]: find a module with keyword
module load [MODULE_NAME]: to load a module (for instance to load R, module load statistics/R/4.3.0). This command is either used directly (in interactive mode) or included in the file that is used to run your R script in batch mode (see below)
module purge: purge all previous loaded modules

2 Run an R script in batch mode

To launch an R script on the slurm cluster:

Write an R script:
```
HelloWorld.R
```
```
print("Hello world!")
```

Write a bash script:

myscript.sh

#! /bin/bash
#SBATCH -J launchRscript
#SBATCH -o output.out

# Purge all previously loaded modules
module purge

# Load the R module
module load statistics/R/4.3.0

# The command lines that I want to run on the cluster
Rscript HelloWorld.R

Launch the script with the sbatch command:
```
sbatch myscript.sh
```

The scripts myscript.sh and HelloWorld.R are supposed to be located in the same directory from which the sbatch command is launched. For Rmd files (see Section 7), be careful that you cannot compile a document if the Rmd file is not in a writable directory.

2.1 sbatch options

Jobs can be launched with customized options (more memory, for instance). There are two ways to handle sbatch options:

[RECOMMENDED] at the beginning of the bash script with lines of the form: #SBATCH [OPTION] [VALUE]
in the sbatch command: sbatch [OPTION] [VALUE] [OPTION] [VALUE] […] myscript.sh
Many options are available. To see all options use sbatch –help. Useful options:
-J, –job-name=jobname: name of job
-e, –error=err: file for batch script’s standard error
-o, –output=out: file for batch script’s standard output
–mail-type=BEGIN,END,FAIL: send an email at the beginning, end or fail of the script (default email is your user email and can be changed with –mail-user=truc@bidule.fr, to use with care)
-t, –time=HH:MM:SS: time limit (default to 04:00:00)
–mem=XG: to change memory reservation (default to 4G)
-c, –cpus-per-task=ncpus: number of cpus required per task (default to 1)
–mem-per-cpu=XG: maximum amount of real memory per allocated cpu required by the job

2.2 Job management

After a job has been launched, you can monitor it with squeue -u [USERNAME] or squeue -j [JOB_ID] and also cancel it with scancel [JOB_ID].

3 Use R in interactive mode

To use R in a console mode, use srun –pty bash to be connected to a node. Then, module load statistics/R/4.3.0 (for the latest R version) and R to launch R.

srun can be run with the same options than sbatch (cpu and memory reservations) (see Section 2.1).

3.1 X11 sessions

X11 sessions are useful to directly display plots in an interactive session. Prior their use, generate an ssh key with ssh-keygen and add it in authorized keys file with: cat .ssh/id_rsa.pub >> .ssh/authorized keys
The interactive session is then launched by:

Logging on the cluster with ssh -Y [USERNAME]@genobioinfo.toulouse.inrae.fr
Running an interactive session with srun –x11 –pty bash

4 R in a parallel environment

To use R with a parallel environment, the -c (or –cpus-per-task) option for the sbatch and srun is needed. In the R script, the number of cores must be set to the SAME value.

4.1 Paralell with doParalell package

Several packages exist to use parallel with R: doParallel, BiocParallel, and future (examples are provided for 2 parallel jobs and the first two packages).

Write an R script:

With doParallel package:

TestParallel.R

library(doParallel)
# specify the number of cores with makeCluster
cl <- makeCluster(2)
registerDoParallel(cl)

foreach(i=1:3) %dopar% sqrt(i)

With BiocParallel package:

TestParallel.R

library(BiocParallel)

# specify the number of cores with workers = 2
bplapply(1:10, print, BPPARAM = MulticoreParam(workers = 2))

Write a bash script:

myscript.sh

#! /bin/bash
#SBATCH -J lauchRscript
#SBATCH -o output.out
#SBATCH -c 2

#Purge any previous modules
module purge

#Load the application
module load statistics/R/4.3.0

# My command lines I want to run on the cluster
Rscript TestParallel.R

Launch the script with the sbatch command:

sbatch myscript.sh

5 Arguments in a script

External arguments can be passed to an R script. The basic method is described below but the package optparse provides ways to handle external arguments à la Python.

Write an R script:

HelloWorld.R

args <- commandArgs(trailingOnly=TRUE)
        
print(args[1])

Write a bash script:

myscript.sh

#! /bin/bash
#SBATCH -J lauchRscript
#SBATCH -o output.out

#Purge any previous modules
module purge

#Load the application
module load statistics/R/4.3.0

# My command lines I want to run on the cluster
Rscript --vanilla HelloWorld.R "Hi!"

Launch the script with the sbatch command:

sbatch myscript.sh

6 Install packages in your own environment

Once in an interactive R session, R packages are installed (in a personal library) using the standard install.packages command line.

Your personal library is usually located at the root of your personal directory whose allocated space is very limited. A simple solution consists in:

creating a directory named R elsewhere: mkdir ~/work/R
making a symbolic link to this directory: ln -s ~/work/R ~/R

As a note, a few R packages are already installed inside an R version. For example, the package dplyr is already installed inside the module statistics/R/4.3.0. You can check if a package is pre-installed using the command search_R_package [PACKAGE_NAME]. In case you would like to have an additional package pre-installed in a given R version, you could request it here : https://bioinfo.genotoul.fr/index.php/ask-for/install-soft/.

7 Create and compile .Rmd (RMarkdown) files on the cluster (batch mode)

To compile a .Rmd file, two packages are needed: rmarkdown and knitr. You also need to load the module tools/Pandoc/3.1.2.
As for an R script, you can pass external arguments to a .Rmd document.

Write a .Rmd script called MyDocument.Rmd with parameters in the header:

---
title: My Document
output: html_document
params:
    text: "Hi!"
---

What is your text?
'''{r}
print(params$text)
'''

Write an R script to pass parameters:

TestRmd.R

rmarkdown::render("MyDocument.Rmd", 
                  params = list(text = "Hola!"))

Write a bash script:

myscript.sh

#! /bin/bash

#SBATCH -J lauchRscript
#SBATCH -o output.out

module purge
module load statistics/R/4.3.0
module load tools/Pandoc/3.1.2

Rscript --vanilla TestRmd.R

Launch the script with the sbatch command:

sbatch myscript.sh

8 Access R through conda

conda is a module on the cluster that gives you access to additional R versions. The following script creates a conda-R environment with R 4.2.0.

module load devel/Miniconda/Miniconda3
conda search r-base # this shows which R versions are available
conda create -p ~/work/conda_R r-base=4.2.0 # you can choose another path after '-p'
conda activate ~/work/conda_R
# you can then launch R
conda deactivate