Welcome to the fourth day of the FAS Informatics Bioinformatics Tips and Tricks Workshop!
If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit and execute the code in this file. Please download the raw file here
Today you’ll learn how to put the concepts learned on day 3 into practice, installing the required software with conda, and creating SLURM job scripts to run jobs on FAS RC cluster compute nodes.
Conda is an open-source, cross-platform (Linux, macOS, Windows) command-line software utility that provides:
“Package, dependency and environment management for any language–Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.” https://docs.conda.io/
Conda allows an unprivileged user on a High-Performance Computing (HPC) cluster (e.g., the FAS RC Cannon cluster) to install software in any file system directory they have write access to.
In this context, a package is a general term for a particular program that is maintained and hosted at a particular location. For instance, samtools, which we learned about on Day 1 is a package on bioconda. As are bedtools and bcftools. The term package is one we’ll use a lot today.
For Pythonistas: Conda resembles pip + venv, but (1) only installs binary packages (doesn’t compile from source code), and (2) can be used to install software written in programming languages other than Python.
Scientific Python distribution that bundles conda + over 250 software libraries/packages
Popular choice for the “desktop data scientist”
Anaconda Navigator - GUI for installing & launching conda packages
“Miniconda is a free minimal installer for conda. It is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others.”
Drop-in replacement for conda for faster package installation
Replace conda
command with
mamba
;
subcommands/options about the same
installed packages / environments compatible with
conda
Can be installed using conda (conda install mamba
),
or with a standalone installer (Mambaforge)
For technical expedience, for this workshop we’ll be using a minimal, single-executable (no Python distribution) version of mamba called micromamba. Micromamba only supports a subset of conda/mamba subcommands & options, but enough for this workshop. Moreover, conda environments (covered shortly) created by micromamba are compatible with/usable by conda and mamba.
To install micromamba for this session and (for
typing convenience), we will define a mamba
command (in
this case, shell function) that calls micromamba
.
Execute the following code chunk (which is very specific to this environment—don’t worry too much about trying to understand exactly what it does):
if ! command -v micromamba
then
curl -qL https://micro.mamba.pm/api/micromamba/linux-64/1.4.1 | tar -C /usr -xj bin/micromamba
mv /etc/profile /etc/profile.orig
cp /etc/profile.orig /etc/profile
mkdir -p /etc/conda
echo "repodata_use_zst: true" > /etc/conda/.condarc
touch ~/.condarc
echo 'eval "$(micromamba shell hook --shell=bash)"; mamba() { micromamba "$@"; }; export MAMBA_ROOT_PREFIX=/tmp/mamba' >> /etc/profile
fi
## /usr/bin/micromamba
Subsequent commands illustrated in this tutorial are largely
compatible with conda
as well—just replace the
mamba
command with conda
.
Invoke mamba
with the -h
or
--help
option to display a list of subcommands, as well as
global options that apply to all subcommands.
Run the code block below to see the help menu for
mamba
:
mamba --help
# mamba: A variant on the conda package manager
# --help: This option tells mamba to display a help menu
## Version: 1.4.1
##
## Usage: /usr/bin/micromamba [OPTIONS] [SUBCOMMAND]
##
## Options:
## -h,--help Print this help message and exit
## --version
##
##
## Configuration options:
## --rc-file TEXT ... Paths to the configuration files to use
## --no-rc Disable the use of configuration files
## --no-env Disable the use of environment variables
##
##
## Global options:
## -v,--verbose Set verbosity (higher verbosity with multiple -v, e.g. -vvv)
## --log-level ENUM:value in {critical->5,debug->1,error->4,info->2,off->6,trace->0,warning->3} OR {5,1,4,2,6,0,3}
## Set the log level
## -q,--quiet Set quiet mode (print less output)
## -y,--yes Automatically answer yes on prompted questions
## --json Report all output as json
## --offline Force use cached repodata
## --dry-run Only display what would have been done
## --download-only Only download and extract packages, do not link them into environment.
## --experimental Enable experimental features
##
##
## Prefix options:
## -r,--root-prefix TEXT Path to the root prefix
## -p,--prefix TEXT Path to the target prefix
## --relocate-prefix TEXT Path to the relocation prefix
## -n,--name TEXT Name of the target prefix
##
## Subcommands:
## shell Generate shell init scripts
## create Create new environment
## install Install packages in active environment
## update Update packages in active environment
## self-update Update micromamba
## repoquery Find and analyze packages in active environment or channels
## remove Remove packages from active environment
## list List packages in active environment
## package Extract a package or bundle files into an archive
## clean Clean package cache
## config Configuration of micromamba
## info Information about micromamba
## constructor Commands to support using micromamba in constructor
## env List environments
## activate Activate an environment
## run Run an executable in an environment
## ps Show, inspect or kill running processes
## auth Login or logout of a given host
## search Find packages in active environment or channels
To display the usage for an individual subcommand, add the
-h
or --help
option.
Run the code block below to display the help menu for the
mamba env
command:
mamba env -h
# mamba: A variant on the conda package manager
# env: The sub-command of mamba we want to run
# -h: This option tells mamba env to display a help menu
## List environments
## Usage: /usr/bin/micromamba env [OPTIONS] [SUBCOMMAND]
##
## Options:
## -h,--help Print this help message and exit
##
##
## Configuration options:
## --rc-file TEXT ... Paths to the configuration files to use
## --no-rc Disable the use of configuration files
## --no-env Disable the use of environment variables
##
##
## Global options:
## -v,--verbose Set verbosity (higher verbosity with multiple -v, e.g. -vvv)
## --log-level ENUM:value in {critical->5,debug->1,error->4,info->2,off->6,trace->0,warning->3} OR {5,1,4,2,6,0,3}
## Set the log level
## -q,--quiet Set quiet mode (print less output)
## -y,--yes Automatically answer yes on prompted questions
## --json Report all output as json
## --offline Force use cached repodata
## --dry-run Only display what would have been done
## --download-only Only download and extract packages, do not link them into environment.
## --experimental Enable experimental features
##
##
## Prefix options:
## -r,--root-prefix TEXT Path to the root prefix
## -p,--prefix TEXT Path to the target prefix
## --relocate-prefix TEXT Path to the relocation prefix
## -n,--name TEXT Name of the target prefix
##
## Subcommands:
## list List known environments
## create Create new environment (pre-commit.com compatibility alias for 'micromamba create')
## export Export environment
## remove Remove an environment
A conda channel is the URL of a directory that contains a set of conda packages. A few popular channels we’ll use today include:
defaults - “meta-channel” maintained by Anaconda, Inc. (the company behind the Anaconda Distribution)
conda-forge - community-curated set of high-quality conda packages
bioconda - channel focused on bioinformatics; thousands of packages available
The software tools we will install and use in in subsequent sections are available in the bioconda channel. bioconda packages may have dependencies on packages in the conda-forge and defaults channels. We can specify channels to use as command-line arguments to conda operations (where applicable).
Run the code block below to search a couple channels for a package called bedtools:
mamba search -c conda-forge -c bioconda bedtools
# mamba: A variant on the conda package manager
# search: The sub-command of mamba we want to run
# -c: This option tells mamba search to search this channel for the provided package name; multiple -c options can be provided
## Getting repodata from channels...
##
##
##
## Name Version Build Channel
## ────────────────────────────────────────────────
## bedtools 2.30.0 hc088bd4_0 bioconda/linux-64
## bedtools 2.30.0 h7d7f7ad_2 bioconda/linux-64
## bedtools 2.30.0 h7d7f7ad_1 bioconda/linux-64
## bedtools 2.30.0 h468198e_3 bioconda/linux-64
## bedtools 2.29.2 hc088bd4_0 bioconda/linux-64
## bedtools 2.29.1 hc088bd4_1 bioconda/linux-64
## bedtools 2.29.1 hc088bd4_0 bioconda/linux-64
## bedtools 2.29.0 hc088bd4_3 bioconda/linux-64
## bedtools 2.29.0 hc088bd4_2 bioconda/linux-64
## bedtools 2.29.0 h6ed99ea_1 bioconda/linux-64
## bedtools 2.29.0 h0da2602_0 bioconda/linux-64
## bedtools 2.28.0 hdf88d34_0 bioconda/linux-64
## bedtools 2.27.1 he941832_2 bioconda/linux-64
## bedtools 2.27.1 he860b03_3 bioconda/linux-64
## bedtools 2.27.1 he513fc3_4 bioconda/linux-64
## bedtools 2.27.1 hd03093a_6 bioconda/linux-64
## bedtools 2.27.1 h9a82719_5 bioconda/linux-64
## bedtools 2.27.1 1 bioconda/linux-64
## bedtools 2.27.1 0 bioconda/linux-64
## bedtools 2.27.0 he941832_2 bioconda/linux-64
## bedtools 2.27.0 he860b03_3 bioconda/linux-64
## bedtools 2.27.0 he513fc3_4 bioconda/linux-64
## bedtools 2.27.0 1 bioconda/linux-64
## bedtools 2.27.0 0 bioconda/linux-64
## bedtools 2.26.0 0 bioconda/linux-64
## bedtools 2.26.0gx 0 bioconda/linux-64
## bedtools 2.26.0gx 1 bioconda/linux-64
## bedtools 2.26.0gx he513fc3_4 bioconda/linux-64
## bedtools 2.26.0gx he860b03_3 bioconda/linux-64
## bedtools 2.26.0gx he941832_2 bioconda/linux-64
## bedtools 2.25.0 3 bioconda/linux-64
## bedtools 2.25.0 he860b03_5 bioconda/linux-64
## bedtools 2.25.0 he941832_4 bioconda/linux-64
## bedtools 2.25.0 1 bioconda/linux-64
## bedtools 2.25.0 0 bioconda/linux-64
## bedtools 2.25.0 2 bioconda/linux-64
## bedtools 2.24.0 0 bioconda/linux-64
## bedtools 2.23.0 h5b5514e_6 bioconda/linux-64
## bedtools 2.23.0 0 bioconda/linux-64
## bedtools 2.23.0 h2e03b76_5 bioconda/linux-64
## bedtools 2.23.0 h8b12597_4 bioconda/linux-64
## bedtools 2.23.0 he860b03_2 bioconda/linux-64
## bedtools 2.23.0 he941832_1 bioconda/linux-64
## bedtools 2.23.0 hdbcaa40_3 bioconda/linux-64
## bedtools 2.22 h2e03b76_5 bioconda/linux-64
## bedtools 2.22 0 bioconda/linux-64
## bedtools 2.22 h5b5514e_6 bioconda/linux-64
## bedtools 2.22 h8b12597_4 bioconda/linux-64
## bedtools 2.22 hdbcaa40_3 bioconda/linux-64
## bedtools 2.22 he860b03_2 bioconda/linux-64
## bedtools 2.22 he941832_1 bioconda/linux-64
## bedtools 2.20.1 he941832_1 bioconda/linux-64
## bedtools 2.20.1 he860b03_2 bioconda/linux-64
## bedtools 2.20.1 0 bioconda/linux-64
## bedtools 2.19.1 he941832_1 bioconda/linux-64
## bedtools 2.19.1 he860b03_2 bioconda/linux-64
## bedtools 2.19.1 0 bioconda/linux-64
## bedtools 2.17.0 0 bioconda/linux-64
## bedtools 2.16.2 0 bioconda/linux-64
It can be convenient to configure a default list of channels so the channel list doesn’t need to be explicitly specified for mamba commands.
The following channel setup (adapted from the bioconda documentation)
updates the user conda/mamba configuration file
(~/.condarc
).
Run the code block below to add the specified channels to your
~/.condarc
file:
mamba config remove-key channels # reset channels in ~/.condarc (if set)
mamba config append channels conda-forge
mamba config append channels bioconda
mamba config append channels defaults
mamba config set channel_priority strict
# mamba: A variant on the conda package manager
# config: The sub-command of mamba we want to run
These commands write (or update) our ~/.condarc
file.
Run the code block below to view the contents of your
~/.condarc
file:
cat ~/.condarc
# cat: A Unix command to display the contents of a file to the screen
## channel_priority: strict
## channels:
## - conda-forge
## - bioconda
## - defaults
Some conda packages exist in all three channels. Strict
channel priority (channel_priority: strict
) tells
mamba (or conda) to search for specified packages in
higher-priority channels first, and if found, ignore packages with the
same name that exist in lower-priority channels. In this example,
specifying that packages should first be searched for in
conda-forge
, then (if not found) in bioconda
,
and finally the defaults
channel.
We can verify the list of channels the mamba
command
uses.
Run the code block below to display some information about
mamba
:
mamba info
# mamba: A variant on the conda package manager
# info: The sub-command of mamba we want to run
##
## __
## __ ______ ___ ____ _____ ___ / /_ ____ _
## / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
## / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
## / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
## /_/
##
##
## environment : None (not found)
## env location : -
## user config files : /n/home/user/.mambarc
## populated config files : /n/home/user/.condarc
## /etc/conda/.condarc
## libmamba version : 1.4.1
## micromamba version : 1.4.1
## curl version : libcurl/7.88.1 OpenSSL/3.1.0 zlib/1.2.13 zstd/1.5.2 libssh2/1.10.0 nghttp2/1.52.0
## libarchive version : libarchive 3.6.2 zlib/1.2.13 bz2lib/1.0.8 libzstd/1.5.2
## virtual packages : __unix=0=0
## __linux=3.10.0=0
## __glibc=2.35=0
## __archspec=1=x86_64
## channels : https://conda.anaconda.org/conda-forge/linux-64
## https://conda.anaconda.org/conda-forge/noarch
## https://conda.anaconda.org/bioconda/linux-64
## https://conda.anaconda.org/bioconda/noarch
## https://repo.anaconda.com/pkgs/main/linux-64
## https://repo.anaconda.com/pkgs/main/noarch
## https://repo.anaconda.com/pkgs/r/linux-64
## https://repo.anaconda.com/pkgs/r/noarch
## base environment : /tmp/mamba
## platform : linux-64
Now that we have our channels set-up, we can begin installing
packages (aka software). We can do this in a few ways,
one of which is directly in the command line with
mamba search
.
Run the code block below to use
mamba search
to search for packages by name, in this casebcftools
:
E.g., to search for the package bcftools
(exact
match):
mamba search bcftools
# mamba: A variant on the conda package manager
# search: The sub-command of mamba we want to run
## Getting repodata from channels...
##
## conda-forge/linux-64 Using cache
## conda-forge/noarch Using cache
## bioconda/linux-64 Using cache
## bioconda/noarch Using cache
## pkgs/main/linux-64 Using cache
## pkgs/main/noarch Using cache
## pkgs/r/linux-64 Using cache
## pkgs/r/noarch Using cache
##
##
## Name Version Build Channel
## ───────────────────────────────────────────────
## bcftools 1.16 hfe4b78e_1 bioconda/linux-64
## bcftools 1.16 hfe4b78e_0 bioconda/linux-64
## bcftools 1.16 haef29d1_2 bioconda/linux-64
## bcftools 1.15.1 h0ea216a_0 bioconda/linux-64
## bcftools 1.15.1 hfe4b78e_1 bioconda/linux-64
## bcftools 1.15 h0ea216a_1 bioconda/linux-64
## bcftools 1.15 h0ea216a_2 bioconda/linux-64
## bcftools 1.15 haf5b3da_0 bioconda/linux-64
## bcftools 1.14 h88f3f91_0 bioconda/linux-64
## bcftools 1.14 hde04aa1_1 bioconda/linux-64
## bcftools 1.13 h3a49de5_0 bioconda/linux-64
## bcftools 1.12 h3f113a9_0 bioconda/linux-64
## bcftools 1.12 h45bccc9_1 bioconda/linux-64
## bcftools 1.11 h7c999a4_0 bioconda/linux-64
## bcftools 1.10.2 h4f4756c_2 bioconda/linux-64
## bcftools 1.10.2 h4f4756c_3 bioconda/linux-64
## bcftools 1.10.2 hd2cd319_0 bioconda/linux-64
## bcftools 1.10.2 h4f4756c_1 bioconda/linux-64
## bcftools 1.10.1 hd2cd319_0 bioconda/linux-64
## bcftools 1.10 h5d15f04_0 bioconda/linux-64
## bcftools 1.9 ha228f0b_4 bioconda/linux-64
## bcftools 1.9 ha228f0b_3 bioconda/linux-64
## bcftools 1.9 h68d8f2e_9 bioconda/linux-64
## bcftools 1.9 h68d8f2e_8 bioconda/linux-64
## bcftools 1.9 h68d8f2e_7 bioconda/linux-64
## bcftools 1.9 h5c2b69b_6 bioconda/linux-64
## bcftools 1.9 h5c2b69b_5 bioconda/linux-64
## bcftools 1.9 h47928c2_2 bioconda/linux-64
## bcftools 1.9 h47928c2_1 bioconda/linux-64
## bcftools 1.8 2 bioconda/linux-64
## bcftools 1.8 h4da6232_3 bioconda/linux-64
## bcftools 1.8 1 bioconda/linux-64
## bcftools 1.8 0 bioconda/linux-64
## bcftools 1.7 0 bioconda/linux-64
## bcftools 1.6 1 bioconda/linux-64
## bcftools 1.6 0 bioconda/linux-64
## bcftools 1.5 h1ff2904_4 bioconda/linux-64
## bcftools 1.5 3 bioconda/linux-64
## bcftools 1.5 2 bioconda/linux-64
## bcftools 1.5 1 bioconda/linux-64
## bcftools 1.5 0 bioconda/linux-64
## bcftools 1.4.1 0 bioconda/linux-64
## bcftools 1.4 0 bioconda/linux-64
## bcftools 1.3.1 hed695b0_6 bioconda/linux-64
## bcftools 1.3.1 ha92aebf_3 bioconda/linux-64
## bcftools 1.3.1 h84994c4_5 bioconda/linux-64
## bcftools 1.3.1 h84994c4_4 bioconda/linux-64
## bcftools 1.3.1 h5bf99c6_7 bioconda/linux-64
## bcftools 1.3.1 2 bioconda/linux-64
## bcftools 1.3.1 1 bioconda/linux-64
## bcftools 1.3.1 0 bioconda/linux-64
## bcftools 1.3 h5bf99c6_6 bioconda/linux-64
## bcftools 1.3 ha92aebf_2 bioconda/linux-64
## bcftools 1.3 1 bioconda/linux-64
## bcftools 1.3 0 bioconda/linux-64
## bcftools 1.3 h7132678_7 bioconda/linux-64
## bcftools 1.3 hed695b0_5 bioconda/linux-64
## bcftools 1.3 h84994c4_3 bioconda/linux-64
## bcftools 1.2 h4da6232_3 bioconda/linux-64
## bcftools 1.2 h02bfda8_4 bioconda/linux-64
## bcftools 1.2 2 bioconda/linux-64
## bcftools 1.2 1 bioconda/linux-64
## bcftools 1.2 0 bioconda/linux-64
In this case, we want to see if there is any package
called bcftools
. mamba
will look at all of the
URLs in the channels in our config file for the specified
package and return anything that matches what we searched for. Here we
see that it found a lot of matches exactly for the string “bcftools” and
all on the bioconda channel. The difference between them is
their versions, so depending on whether you want to perform an analysis
with the latest version of the software, or replicate an analysis from a
paper that used a specific version, you should be able to find what you
need (at least for a well-maintained package).
Well, maybe we have an idea of what the name of the
package is, but don’t remember it exactly. The
mamba search
command allows wildcards for inexact matches.
For instance, the *
character can be used as a
wildcard.
Run the code block below to search for all packages beginning with the string “bcf”:
mamba search 'bcf*'
# mamba: A variant on the conda package manager
# search: The sub-command of mamba we want to run
## Getting repodata from channels...
##
## conda-forge/linux-64 Using cache
## conda-forge/noarch Using cache
## bioconda/linux-64 Using cache
## bioconda/noarch Using cache
## pkgs/main/linux-64 Using cache
## pkgs/main/noarch Using cache
## pkgs/r/linux-64 Using cache
## pkgs/r/noarch Using cache
##
##
## Name Version Build Channel
## ──────────────────────────────────────────────────────────────
## bcftools 1.16 hfe4b78e_1 bioconda/linux-64
## bcftools 1.16 hfe4b78e_0 bioconda/linux-64
## bcftools 1.16 haef29d1_2 bioconda/linux-64
## bcftools 1.15.1 h0ea216a_0 bioconda/linux-64
## bcftools 1.15.1 hfe4b78e_1 bioconda/linux-64
## bcftools 1.15 h0ea216a_1 bioconda/linux-64
## bcftools 1.15 h0ea216a_2 bioconda/linux-64
## bcftools 1.15 haf5b3da_0 bioconda/linux-64
## bcftools 1.14 h88f3f91_0 bioconda/linux-64
## bcftools 1.14 hde04aa1_1 bioconda/linux-64
## bcftools 1.13 h3a49de5_0 bioconda/linux-64
## bcftools 1.12 h3f113a9_0 bioconda/linux-64
## bcftools 1.12 h45bccc9_1 bioconda/linux-64
## bcftools 1.11 h7c999a4_0 bioconda/linux-64
## bcftools 1.10.2 h4f4756c_3 bioconda/linux-64
## bcftools 1.10.2 hd2cd319_0 bioconda/linux-64
## bcftools 1.10.2 h4f4756c_2 bioconda/linux-64
## bcftools 1.10.2 h4f4756c_1 bioconda/linux-64
## bcftools 1.10.1 hd2cd319_0 bioconda/linux-64
## bcftools 1.10 h5d15f04_0 bioconda/linux-64
## bcftools 1.9 h47928c2_1 bioconda/linux-64
## bcftools 1.9 h47928c2_2 bioconda/linux-64
## bcftools 1.9 h5c2b69b_5 bioconda/linux-64
## bcftools 1.9 h5c2b69b_6 bioconda/linux-64
## bcftools 1.9 h68d8f2e_7 bioconda/linux-64
## bcftools 1.9 h68d8f2e_8 bioconda/linux-64
## bcftools 1.9 h68d8f2e_9 bioconda/linux-64
## bcftools 1.9 ha228f0b_3 bioconda/linux-64
## bcftools 1.9 ha228f0b_4 bioconda/linux-64
## bcftools 1.8 h4da6232_3 bioconda/linux-64
## bcftools 1.8 2 bioconda/linux-64
## bcftools 1.8 1 bioconda/linux-64
## bcftools 1.8 0 bioconda/linux-64
## bcftools 1.7 0 bioconda/linux-64
## bcftools 1.6 1 bioconda/linux-64
## bcftools 1.6 0 bioconda/linux-64
## bcftools 1.5 h1ff2904_4 bioconda/linux-64
## bcftools 1.5 3 bioconda/linux-64
## bcftools 1.5 2 bioconda/linux-64
## bcftools 1.5 1 bioconda/linux-64
## bcftools 1.5 0 bioconda/linux-64
## bcftools 1.4.1 0 bioconda/linux-64
## bcftools 1.4 0 bioconda/linux-64
## bcftools 1.3.1 2 bioconda/linux-64
## bcftools 1.3.1 0 bioconda/linux-64
## bcftools 1.3.1 1 bioconda/linux-64
## bcftools 1.3.1 hed695b0_6 bioconda/linux-64
## bcftools 1.3.1 h5bf99c6_7 bioconda/linux-64
## bcftools 1.3.1 h84994c4_4 bioconda/linux-64
## bcftools 1.3.1 h84994c4_5 bioconda/linux-64
## bcftools 1.3.1 ha92aebf_3 bioconda/linux-64
## bcftools 1.3 hed695b0_5 bioconda/linux-64
## bcftools 1.3 0 bioconda/linux-64
## bcftools 1.3 1 bioconda/linux-64
## bcftools 1.3 h5bf99c6_6 bioconda/linux-64
## bcftools 1.3 h7132678_7 bioconda/linux-64
## bcftools 1.3 h84994c4_3 bioconda/linux-64
## bcftools 1.3 ha92aebf_2 bioconda/linux-64
## bcftools 1.2 h4da6232_3 bioconda/linux-64
## bcftools 1.2 h02bfda8_4 bioconda/linux-64
## bcftools 1.2 2 bioconda/linux-64
## bcftools 1.2 1 bioconda/linux-64
## bcftools 1.2 0 bioconda/linux-64
## bcftools-gtc2vcf-plugin 1.16 h0fdf51a_0 bioconda/linux-64
## bcftools-gtc2vcf-plugin 1.9 hedc5323_0 bioconda/linux-64
## bcftools-snvphyl-plugin 1.9 he673b24_1 bioconda/linux-64
## bcftools-snvphyl-plugin 1.9 h2559242_7 bioconda/linux-64
## bcftools-snvphyl-plugin 1.9 h34584cc_4 bioconda/linux-64
## bcftools-snvphyl-plugin 1.9 h4da6232_0 bioconda/linux-64
## bcftools-snvphyl-plugin 1.9 h80657d4_3 bioconda/linux-64
## bcftools-snvphyl-plugin 1.9 ha13ca6a_2 bioconda/linux-64
## bcftools-snvphyl-plugin 1.9 hc0af00e_5 bioconda/linux-64
## bcftools-snvphyl-plugin 1.9 hdd6bb30_6 bioconda/linux-64
## bcftools-snvphyl-plugin 1.8 0 bioconda/linux-64
## bcftools-snvphyl-plugin 1.8 h4da6232_2 bioconda/linux-64
## bcftools-snvphyl-plugin 1.6 0 bioconda/linux-64
## bcftools-snvphyl-plugin 1.6 1 bioconda/linux-64
## bcftools-snvphyl-plugin 1.5 0 bioconda/linux-64
Exercise: Search for an open-source software package that you currently or would like to use in your workflows. Try using wildcards if an exact match isn’t found. Note that conda packages don’t exist for much (especially lesser-used) bioinformatics software.
## Write a command to search for a conda-package version of your chosen software.
## Write a command to search for a conda-package version of your chosen software.
It can sometimes be more expedient (and for bioconda packages, informative) to search for packages using a web browser. This also may be more intuitive for many people, and search results on the web often display the exact command need to install a given package.
The complete list of bioconda packages is available at:
https://bioconda.github.io/conda-package_index.html
A more-comprehensive package search interface, containing packages from many channels (but with less information about bioconda packages than the above bioconda package list) is available at: https://anaconda.org .
Conda packages are installed into an environment, which is a directory structure containing a set of conda packages. Environments are managed separately, and are isolated from each other—changes made to one environment will not impact the software in another environment. The main benefit of this is that the environments are set up in such a way that the user has complete write access to their entire filesystem. In essence they are a filesystem within the main filesystem. This simplifies many aspects of installing software and their dependencies.
For instance, when compiling a package from source, that is downloading the code files and trying to make them executable, the code may rely on functions that exist external libraries of code. By default, during compilation, the program looks in certain locations for these dependencies and it can be complicated to change where it looks. If these dependencies need to be installed where the user doesn’t have access, this can be an almost impossible bottelneck for most users to get past.
However, when working in an environment, the user has complete access to all paths within the environment file system. This means that dependencies can easily be installed within the environment. Packages on conda are also pre-compiled, meaning that one doesn’t have to make them executable from the raw code: the code and its dependencies simply need to be placed in certain locations in the environment to work, and conda/mamba keeps track of all of this in almost all cases.
mamba env create
The basic syntax for creating a conda or environment is as follows:
mamba env create -n/--name ENVIRONMENT
This creates a “named” conda environment in your
envs
directory (by default
${HOME}/.conda/envs
).
Note: mamba create
is a synonym for
mamba env create
Run the following code block to create a named environment called day4:
mamba env create -y -n day4
# mamba: A variant on the conda package manager
# create: The sub-command of mamba env we want to run
# -y: don't prompt y/n to create the environment; assume "y"
# -n: This option tells mamba env create to call the environment the provided string (e.g. day4, in this case)
##
## __
## __ ______ ___ ____ _____ ___ / /_ ____ _
## / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
## / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
## / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
## /_/
##
## Empty environment created at prefix: /tmp/mamba/envs/day4
mamba env list
In the course of your work, you may end up creating a lot of environments. Some may be for specific packages or projects. In general, environments are pretty robust, but the more packages you install in one the higher the chance you may run into an incompatibility between them that may have unexpected consequences (e.g. downgrading one package because another depends on a specific version, or even breaking the environment). In these cases it is generally ok to just create a new environment that isn’t broken (e.g. acting weird, for lack of a better description), though this can be time consuming.
One useful thing you may want to do is look at the names and locations of all the environments you have created.
Run the code block below to use the
mamba env list
command to list all environments you have created:
mamba env list
# mamba: A variant on the conda package manager
# env: The sub-command of mamba we want to run
# list: The sub-command of mamba env we want to run
##
## __
## __ ______ ___ ____ _____ ___ / /_ ____ _
## / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
## / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
## / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
## /_/
##
## Name Active Path
## ──────────────────────────────────────
## base /tmp/mamba
## day4 /tmp/mamba/envs/day4
You should at least see a base
and day4
environment.
A few comments regarding the base environment:
The base environment is default environment that Python and conda/mamba itself is installed in
On the FAS RC cluster (Cannon), the base environment is global (shared by all users) and read-only
On a local (user-installed) conda installation, the base environment may be writable—however, never install packages into the base environment to avoid breaking your conda installation.
mamba activate
Creating an environment does not mean you can begin using it. You must first activate it.
Activating a conda environment sets environment variables
(such as $PATH
, which is a colon-separated list of
directories the shell searches for commands) to allow software in the
environment to be used, and makes the environment the default target for
relevant mamba
commands that operate on environments (such
as mamba install
).
Run the code block below to see how a system variable,
$PATH
, changes when you activate an environment:
echo "PATH before mamba activate: ${PATH}"
echo
mamba activate day4
# mamba: A variant on the conda package manager
# activate: The sub-command of mamba we want to run
echo "PATH after mamba activate: ${PATH}"
## PATH before mamba activate: /n/home/user/bin:/condabin:/usr/bin:/n/home/user/R/ifxrstudio/RELEASE_3_16/python-user-base/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/usr/local/texlive/bin/x86_64-linux:/usr/lib/rstudio-server/bin/quarto/bin:/usr/lib/rstudio-server/bin/postback/postback
##
## PATH after mamba activate: /tmp/mamba/envs/day4/bin:/n/home/user/bin:/condabin:/usr/bin:/n/home/user/R/ifxrstudio/RELEASE_3_16/python-user-base/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/texlive/bin/x86_64-linux:/usr/lib/rstudio-server/bin/quarto/bin:/usr/lib/rstudio-server/bin/postback/postback
In this example, an envs/day4/bin
directory has been
pre-pended to the $PATH
variable.
In an interactive shell environment (i.e., if using the
Terminal), the shell prompt is prefixed with the
environment name in parentheses (day4)
.
Note that every time you log on and want to use an environment you will have to activate it. Also, if you are in one environment and want to use another one, you must activate the new one!
The main reason to create and activate an environment is to
install and use packages (i.e. programs) within it, so
one of the most basic things we want to know is what
packages are currently installed in our environment. We
can see what packages are already installed in our current environment
using mamba list
command.
Run the code block below to list the packages currently installed in our current environment (
day4
):
mamba activate day4
# mamba: A variant on the conda package manager
# activate: The sub-command of mamba we want to run
mamba list
# mamba: A variant on the conda package manager
# list: The sub-command of mamba we want to run
## List of packages in environment: "/tmp/mamba/envs/day4"
NOTE: We use mamba activate
again in this code
chunk because each R markdown code chunk is a separate shell
environment. In an interactive shell (Terminal) or shell script, the
environment changes from mamba activate
will persist in
that shell / script until explicitly deactivated with
mamba deactivate
(described below).
Ok, nothing in our environment so far. That makes sense since we just created it. Let’s install some packages in it now.
mamba install
The mamba install
command installs the listed
package(s)---including dependencies—in the specified environment, or the
current (activated) environment if an environment isn’t specified.
Run the code block below to install packages named
bedtools
andsamtools
in ourday4
environment. This may take some time:
mamba install -y -n day4 bedtools samtools grampa
# mamba: A variant on the conda package manager
# install: The sub-command of mamba we want to run
# -y : don't prompt y/n to install packages; assume "y"
# -n: The name of the environment in which we want to install the specified packages
##
## __
## __ ______ ___ ____ _____ ___ / /_ ____ _
## / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
## / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
## / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
## /_/
##
## conda-forge/linux-64 Using cache
## conda-forge/noarch Using cache
## bioconda/linux-64 Using cache
## bioconda/noarch Using cache
## pkgs/main/linux-64 Using cache
## pkgs/main/noarch Using cache
## pkgs/r/linux-64 Using cache
## pkgs/r/noarch Using cache
##
## Transaction
##
## Prefix: /tmp/mamba/envs/day4
##
## Updating specs:
##
## - bedtools
## - samtools
## - grampa
##
##
## Package Version Build Channel Size
## ────────────────────────────────────────────────────────────────────────────────────────
## Install:
## ────────────────────────────────────────────────────────────────────────────────────────
##
## + _libgcc_mutex 0.1 conda_forge conda-forge/linux-64 3kB
## + _openmp_mutex 4.5 2_gnu conda-forge/linux-64 24kB
## + bedtools 2.30.0 h468198e_3 bioconda/linux-64 16MB
## + bzip2 1.0.8 h7f98852_4 conda-forge/linux-64 496kB
## + c-ares 1.18.1 h7f98852_0 conda-forge/linux-64 115kB
## + ca-certificates 2022.12.7 ha878542_0 conda-forge/linux-64 146kB
## + grampa 1.4.0 pyhdfd78af_0 bioconda/noarch 46kB
## + htslib 1.17 h6bc39ce_0 bioconda/linux-64 2MB
## + keyutils 1.6.1 h166bdaf_0 conda-forge/linux-64 118kB
## + krb5 1.20.1 hf9c8cef_0 conda-forge/linux-64 1MB
## + ld_impl_linux-64 2.40 h41732ed_0 conda-forge/linux-64 705kB
## + libcurl 7.87.0 h6312ad2_0 conda-forge/linux-64 347kB
## + libdeflate 1.13 h166bdaf_0 conda-forge/linux-64 80kB
## + libedit 3.1.20191231 he28a2e2_2 conda-forge/linux-64 124kB
## + libev 4.33 h516909a_1 conda-forge/linux-64 106kB
## + libffi 3.4.2 h7f98852_5 conda-forge/linux-64 58kB
## + libgcc-ng 12.2.0 h65d4601_19 conda-forge/linux-64 954kB
## + libgomp 12.2.0 h65d4601_19 conda-forge/linux-64 466kB
## + libnghttp2 1.51.0 hdcd2b5c_0 conda-forge/linux-64 623kB
## + libnsl 2.0.0 h7f98852_0 conda-forge/linux-64 31kB
## + libsqlite 3.40.0 h753d276_0 conda-forge/linux-64 810kB
## + libssh2 1.10.0 haa6b8db_3 conda-forge/linux-64 239kB
## + libstdcxx-ng 12.2.0 h46fd767_19 conda-forge/linux-64 4MB
## + libuuid 2.38.1 h0b41bf4_0 conda-forge/linux-64 34kB
## + libzlib 1.2.13 h166bdaf_4 conda-forge/linux-64 66kB
## + ncurses 6.3 h27087fc_1 conda-forge/linux-64 1MB
## + openssl 1.1.1t h0b41bf4_0 conda-forge/linux-64 2MB
## + pip 23.0.1 pyhd8ed1ab_0 conda-forge/noarch 1MB
## + python 3.11.0 h10a6764_1_cpython conda-forge/linux-64 31MB
## + readline 8.2 h8228510_1 conda-forge/linux-64 281kB
## + samtools 1.16.1 h00cdaf9_2 bioconda/linux-64 420kB
## + setuptools 67.6.1 pyhd8ed1ab_0 conda-forge/noarch 580kB
## + tk 8.6.12 h27826a3_0 conda-forge/linux-64 3MB
## + tzdata 2023c h71feb2d_0 conda-forge/noarch 118kB
## + wheel 0.40.0 pyhd8ed1ab_0 conda-forge/noarch 56kB
## + xz 5.2.6 h166bdaf_0 conda-forge/linux-64 418kB
## + zlib 1.2.13 h166bdaf_4 conda-forge/linux-64 94kB
##
## Summary:
##
## Install: 37 packages
##
## Total download: 71MB
##
## ────────────────────────────────────────────────────────────────────────────────────────
##
##
##
## Transaction starting
## Linking _libgcc_mutex-0.1-conda_forge
## Linking libstdcxx-ng-12.2.0-h46fd767_19
## Linking ld_impl_linux-64-2.40-h41732ed_0
## Linking ca-certificates-2022.12.7-ha878542_0
## Linking libgomp-12.2.0-h65d4601_19
## Linking _openmp_mutex-4.5-2_gnu
## Linking libgcc-ng-12.2.0-h65d4601_19
## Linking libev-4.33-h516909a_1
## Linking c-ares-1.18.1-h7f98852_0
## Linking libuuid-2.38.1-h0b41bf4_0
## Linking libffi-3.4.2-h7f98852_5
## Linking bzip2-1.0.8-h7f98852_4
## Linking ncurses-6.3-h27087fc_1
## Linking libnsl-2.0.0-h7f98852_0
## Linking keyutils-1.6.1-h166bdaf_0
## Linking openssl-1.1.1t-h0b41bf4_0
## Linking xz-5.2.6-h166bdaf_0
## Linking libdeflate-1.13-h166bdaf_0
## Linking libzlib-1.2.13-h166bdaf_4
## Linking libedit-3.1.20191231-he28a2e2_2
## Linking readline-8.2-h8228510_1
## Linking libssh2-1.10.0-haa6b8db_3
## Linking libnghttp2-1.51.0-hdcd2b5c_0
## Linking tk-8.6.12-h27826a3_0
## Linking libsqlite-3.40.0-h753d276_0
## Linking zlib-1.2.13-h166bdaf_4
## Linking krb5-1.20.1-hf9c8cef_0
## Linking libcurl-7.87.0-h6312ad2_0
## Linking tzdata-2023c-h71feb2d_0
## Linking bedtools-2.30.0-h468198e_3
## Linking htslib-1.17-h6bc39ce_0
## Linking samtools-1.16.1-h00cdaf9_2
## Linking python-3.11.0-h10a6764_1_cpython
## Linking wheel-0.40.0-pyhd8ed1ab_0
## Linking setuptools-67.6.1-pyhd8ed1ab_0
## Linking pip-23.0.1-pyhd8ed1ab_0
## Linking grampa-1.4.0-pyhdfd78af_0
## Transaction finished
mamba
installs the latest available versions of listed
packages unless versions are specified (we’ll see an example below).
Important: it’s a best practice to install all
packages that are needed in the conda environment at the same time
(e.g., in the same mamba install
command) rather than
installing packages one at a time (e.g., with separate
mamba install
commands). This allows mamba to “solve” a
compatible set of packages/dependencies. Otherwise, mamba
may have to upgrade/downgrade existing packages in the environment, and
potentially “break” software in the environment.
Pro tip: packages can be installed during environment creation by
appending the list of packages to the mamba create
command;
e.g.:
mamba env create -y -n day4 bedtools samtools
or by activating a conda environment before issuing a
mamba install
command; e.g.
mamba activate day4
mamba install -y bedtools samtools
Now that we’ve installed some packages in our
day4
environment, let’s activate it, run
bedtools --version
to verify it is available, and list the
contents of the activated environment.
Run the code block below to view packages installed in our
day4
environment:
echo "Running GRAMPA outside of the day4 environment:"
grampa.py --version
# grampa.py: A program for inferring WGDs in a phylogeny
# --version: This tells grampa to just display the current version of the software (useful for seeing if it is installed)
## NOTE: This should produce an error since we have not installed grampa outside of our day4 environment, and we have not yet activated the environment
echo "------"
echo "Activating day4 environment"
mamba activate day4
# mamba: A variant on the conda package manager
# activate: The sub-command of mamba we want to run
echo "Running GRAMPA inside of the day4 environment:"
grampa.py --version
# grampa.py: A program for inferring WGDs in a phylogeny
# --version: This tells grampa to just display the current version of the software (useful for seeing if it is installed)
echo
echo "Listing packages installed in day4:"
mamba list
# mamba: A variant on the conda package manager
# list: The sub-command of mamba we want to run
## Running GRAMPA outside of the day4 environment:
## bash: line 3: grampa.py: command not found
## ------
## Activating day4 environment
## Running GRAMPA inside of the day4 environment:
##
## /tmp/mamba/envs/day4/bin/grampa.py --version
##
## # GRAMPA version 1.4.0 released on March 2023
##
## Listing packages installed in day4:
## List of packages in environment: "/tmp/mamba/envs/day4"
##
## Name Version Build Channel
## ───────────────────────────────────────────────────────────────────
## _libgcc_mutex 0.1 conda_forge conda-forge
## _openmp_mutex 4.5 2_gnu conda-forge
## bedtools 2.30.0 h468198e_3 bioconda
## bzip2 1.0.8 h7f98852_4 conda-forge
## c-ares 1.18.1 h7f98852_0 conda-forge
## ca-certificates 2022.12.7 ha878542_0 conda-forge
## grampa 1.4.0 pyhdfd78af_0 bioconda
## htslib 1.17 h6bc39ce_0 bioconda
## keyutils 1.6.1 h166bdaf_0 conda-forge
## krb5 1.20.1 hf9c8cef_0 conda-forge
## ld_impl_linux-64 2.40 h41732ed_0 conda-forge
## libcurl 7.87.0 h6312ad2_0 conda-forge
## libdeflate 1.13 h166bdaf_0 conda-forge
## libedit 3.1.20191231 he28a2e2_2 conda-forge
## libev 4.33 h516909a_1 conda-forge
## libffi 3.4.2 h7f98852_5 conda-forge
## libgcc-ng 12.2.0 h65d4601_19 conda-forge
## libgomp 12.2.0 h65d4601_19 conda-forge
## libnghttp2 1.51.0 hdcd2b5c_0 conda-forge
## libnsl 2.0.0 h7f98852_0 conda-forge
## libsqlite 3.40.0 h753d276_0 conda-forge
## libssh2 1.10.0 haa6b8db_3 conda-forge
## libstdcxx-ng 12.2.0 h46fd767_19 conda-forge
## libuuid 2.38.1 h0b41bf4_0 conda-forge
## libzlib 1.2.13 h166bdaf_4 conda-forge
## ncurses 6.3 h27087fc_1 conda-forge
## openssl 1.1.1t h0b41bf4_0 conda-forge
## pip 23.0.1 pyhd8ed1ab_0 conda-forge
## python 3.11.0 h10a6764_1_cpython conda-forge
## readline 8.2 h8228510_1 conda-forge
## samtools 1.16.1 h00cdaf9_2 bioconda
## setuptools 67.6.1 pyhd8ed1ab_0 conda-forge
## tk 8.6.12 h27826a3_0 conda-forge
## tzdata 2023c h71feb2d_0 conda-forge
## wheel 0.40.0 pyhd8ed1ab_0 conda-forge
## xz 5.2.6 h166bdaf_0 conda-forge
## zlib 1.2.13 h166bdaf_4 conda-forge
mamba deactivate
There also exists the command mamba deactivate
to exit
your current environment and return to the base
environment. If you run mamba deactivate
while in the
base
environment you will exit Anaconda completely
and need to restart it. We will run mamba deactivate
later
on, but first we want to run some commands in the environment as well.
Deactivating a conda environment effectively
undoes the environment changes from mamba activate
,
restoring the previous environment.
Run the following code block to see that software installed inside of the environment is not found when it is deactivated:
mamba activate day4
# mamba: A variant on the conda package manager
# activate: The sub-command of mamba we want to run
type grampa.py
# type: A shell command that displays the path of the specified command
# grampa.py: A program for inferring WGDs in a phylogeny
mamba deactivate
# mamba: A variant on the conda package manager
# deactivate: The sub-command of mamba we want to run
type grampa.py
# type: A shell command that displays the path of the specified command
# grampa.py: A program for inferring WGDs in a phylogeny
# NOTE: This should produce an error since we have deactivated the environment with grampa.py installed in it
## grampa.py is /tmp/mamba/envs/day4/bin/grampa.py
## bash: line 13: type: grampa.py: not found
mamba run
Instead of activating/deactivating a conda environment, the
mamba run
command can be used to run a single command in
the specified conda environment.
Run the code block below to run a
grampa.py
command in ourday4
environment without activating that environment:
mamba run -n day4 grampa.py --version
# mamba: A variant on the conda package manager
# run: The sub-command of mamba we want to run
# -n: The name of the environment to run in
# grampa.py: A program for inferring WGDs in a phylogeny
# --version: This tells grampa to just display the current version of the software (useful for seeing if it is installed)
##
## /tmp/mamba/envs/day4/bin/grampa.py --version
##
## # GRAMPA version 1.4.0 released on March 2023
Unlike mamba activate
, mamba run
does not
alter the shell’s environment, and mamba deactivate
is not
needed afterwards to restore the original environment.
By default, environments are created in the install
folder of your Anaconda/miniconda program (usually
${HOME}/.conda/envs
). However, you can create an
environment at a specific path (via
mamba env create -p /path/to/environment
). This provides
more flexibility than named environments, allowing the environment to be
created in any directory the user has write access to, and facilitating
sharing of conda environments by members of the same lab/group.
Suppose we want to install samtools
in an envrionment
located at /tmp/samtools-env
(Note: a temporary
directory just for illustration). Furthermore, suppose we need an
old version of samtools
(0.1.19). We’ll select the version
using the =
operator.
Run the code block below to create a conda environment at a specific path, and to install a specific version of
samtools
within it:
mamba create -y -p /tmp/samtools-env samtools=0.1.19
# mamba: A variant on the conda package manager
# create: The sub-command of mamba env we want to run
# -y: don't prompt y/n to create the environment; assume "y"
# -p: This option tells mamba create to put the environment folder at the provided path
##
## __
## __ ______ ___ ____ _____ ___ / /_ ____ _
## / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
## / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
## / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
## /_/
##
## conda-forge/linux-64 Using cache
## conda-forge/noarch Using cache
## bioconda/linux-64 Using cache
## bioconda/noarch Using cache
## pkgs/main/linux-64 Using cache
## pkgs/main/noarch Using cache
## pkgs/r/linux-64 Using cache
## pkgs/r/noarch Using cache
##
## Transaction
##
## Prefix: /tmp/samtools-env
##
## Updating specs:
##
## - samtools=0.1.19
##
##
## Package Version Build Channel Size
## ──────────────────────────────────────────────────────────────────────────
## Install:
## ──────────────────────────────────────────────────────────────────────────
##
## + _libgcc_mutex 0.1 conda_forge conda-forge/linux-64 Cached
## + _openmp_mutex 4.5 2_gnu conda-forge/linux-64 Cached
## + libgcc-ng 12.2.0 h65d4601_19 conda-forge/linux-64 Cached
## + libgomp 12.2.0 h65d4601_19 conda-forge/linux-64 Cached
## + libzlib 1.2.13 h166bdaf_4 conda-forge/linux-64 Cached
## + ncurses 6.3 h27087fc_1 conda-forge/linux-64 Cached
## + samtools 0.1.19 h20b1175_10 bioconda/linux-64 4MB
## + zlib 1.2.13 h166bdaf_4 conda-forge/linux-64 Cached
##
## Summary:
##
## Install: 8 packages
##
## Total download: 4MB
##
## ──────────────────────────────────────────────────────────────────────────
##
##
##
## Transaction starting
## Linking _libgcc_mutex-0.1-conda_forge
## Linking libgomp-12.2.0-h65d4601_19
## Linking _openmp_mutex-4.5-2_gnu
## Linking libgcc-ng-12.2.0-h65d4601_19
## Linking ncurses-6.3-h27087fc_1
## Linking libzlib-1.2.13-h166bdaf_4
## Linking zlib-1.2.13-h166bdaf_4
## Linking samtools-0.1.19-h20b1175_10
## Transaction finished
Run the code block below to see the directory structure in the environment folder,
/tmp/samtools-env
:
ls /tmp/samtools-env
# ls: The Unix command to list the contents of a directory
## bin
## conda-meta
## include
## lib
## share
We can run some of the same commands, with modification, on
environments located at custom paths. For instance,
mamba list -p
treats that directory as a conda environment,
and lists installed packages.
Run the code block below to list the packages installed at the path of our envrionment:
mamba list -p /tmp/samtools-env
# mamba: A variant on the conda package manager
# list: The sub-command of mamba env we want to run
# -p: This option tells mamba to look for an environment folder at the provided path
## List of packages in environment: "/tmp/samtools-env"
##
## Name Version Build Channel
## ────────────────────────────────────────────────────
## _libgcc_mutex 0.1 conda_forge conda-forge
## _openmp_mutex 4.5 2_gnu conda-forge
## libgcc-ng 12.2.0 h65d4601_19 conda-forge
## libgomp 12.2.0 h65d4601_19 conda-forge
## libzlib 1.2.13 h166bdaf_4 conda-forge
## ncurses 6.3 h27087fc_1 conda-forge
## samtools 0.1.19 h20b1175_10 bioconda
## zlib 1.2.13 h166bdaf_4 conda-forge
Similarly, mamba activate
and mamba run
accept a -p PATH
option.
Run the code block below to activate our environment based on its path:
mamba activate /tmp/samtools-env
# mamba: A variant on the conda package manager
# activate: The sub-command of mamba env we want to run
# -p: This option tells mamba to look for an environment folder at the provided path
type samtools
# type: A shell command that displays the path of the specified command
## samtools is /tmp/samtools-env/bin/samtools
Important: conda environment directories are not relocatable; e.g., the above conda environment may not work if moved to a different directory. We’ll see how to re-create a conda environment in a different location in the next section.
Exercise: Use mamba to create a conda environment called
samtools-env
that contains the samtools package (default/latest version) in your current working directory. Note that the-p PATH
option can be an absolute path (e.g.,-p $PWD/samtools-env
) or relative path (e.g.,-p ./samtools-env
)
## Create an environment called samtools-env in the current working directory
mamba create -y -p ./samtools-env samtools
## Create an environment called samtools-env in the current working directory
##
## __
## __ ______ ___ ____ _____ ___ / /_ ____ _
## / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
## / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
## / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
## /_/
##
## conda-forge/linux-64 Using cache
## conda-forge/noarch Using cache
## bioconda/linux-64 Using cache
## bioconda/noarch Using cache
## pkgs/main/linux-64 Using cache
## pkgs/main/noarch Using cache
## pkgs/r/linux-64 Using cache
## pkgs/r/noarch Using cache
##
## Transaction
##
## Prefix: /n/home/user/repos/harvardinformatics/workshops/2023-spring/biotips/samtools-env
##
## Updating specs:
##
## - samtools
##
##
## Package Version Build Channel Size
## ─────────────────────────────────────────────────────────────────────────────────
## Install:
## ─────────────────────────────────────────────────────────────────────────────────
##
## + _libgcc_mutex 0.1 conda_forge conda-forge/linux-64 Cached
## + _openmp_mutex 4.5 2_gnu conda-forge/linux-64 Cached
## + bzip2 1.0.8 h7f98852_4 conda-forge/linux-64 Cached
## + c-ares 1.18.1 h7f98852_0 conda-forge/linux-64 Cached
## + ca-certificates 2022.12.7 ha878542_0 conda-forge/linux-64 Cached
## + htslib 1.17 h6bc39ce_0 bioconda/linux-64 Cached
## + keyutils 1.6.1 h166bdaf_0 conda-forge/linux-64 Cached
## + krb5 1.20.1 hf9c8cef_0 conda-forge/linux-64 Cached
## + libcurl 7.87.0 h6312ad2_0 conda-forge/linux-64 Cached
## + libdeflate 1.13 h166bdaf_0 conda-forge/linux-64 Cached
## + libedit 3.1.20191231 he28a2e2_2 conda-forge/linux-64 Cached
## + libev 4.33 h516909a_1 conda-forge/linux-64 Cached
## + libgcc-ng 12.2.0 h65d4601_19 conda-forge/linux-64 Cached
## + libgomp 12.2.0 h65d4601_19 conda-forge/linux-64 Cached
## + libnghttp2 1.51.0 hdcd2b5c_0 conda-forge/linux-64 Cached
## + libssh2 1.10.0 haa6b8db_3 conda-forge/linux-64 Cached
## + libstdcxx-ng 12.2.0 h46fd767_19 conda-forge/linux-64 Cached
## + libzlib 1.2.13 h166bdaf_4 conda-forge/linux-64 Cached
## + ncurses 6.3 h27087fc_1 conda-forge/linux-64 Cached
## + openssl 1.1.1t h0b41bf4_0 conda-forge/linux-64 Cached
## + samtools 1.16.1 h00cdaf9_2 bioconda/linux-64 Cached
## + xz 5.2.6 h166bdaf_0 conda-forge/linux-64 Cached
## + zlib 1.2.13 h166bdaf_4 conda-forge/linux-64 Cached
##
## Summary:
##
## Install: 23 packages
##
## Total download: 0 B
##
## ─────────────────────────────────────────────────────────────────────────────────
##
##
##
## Transaction starting
## Linking _libgcc_mutex-0.1-conda_forge
## Linking ca-certificates-2022.12.7-ha878542_0
## Linking libstdcxx-ng-12.2.0-h46fd767_19
## Linking libgomp-12.2.0-h65d4601_19
## Linking _openmp_mutex-4.5-2_gnu
## Linking libgcc-ng-12.2.0-h65d4601_19
## Linking libev-4.33-h516909a_1
## Linking c-ares-1.18.1-h7f98852_0
## Linking bzip2-1.0.8-h7f98852_4
## Linking ncurses-6.3-h27087fc_1
## Linking keyutils-1.6.1-h166bdaf_0
## Linking openssl-1.1.1t-h0b41bf4_0
## Linking xz-5.2.6-h166bdaf_0
## Linking libdeflate-1.13-h166bdaf_0
## Linking libzlib-1.2.13-h166bdaf_4
## Linking libedit-3.1.20191231-he28a2e2_2
## Linking libssh2-1.10.0-haa6b8db_3
## Linking libnghttp2-1.51.0-hdcd2b5c_0
## Linking zlib-1.2.13-h166bdaf_4
## Linking krb5-1.20.1-hf9c8cef_0
## Linking libcurl-7.87.0-h6312ad2_0
## Linking htslib-1.17-h6bc39ce_0
## Linking samtools-1.16.1-h00cdaf9_2
## Transaction finished
We’ll use this conda environment in the next section of the workshop, so be sure it was created correctly.
Verify that
samtools
was installed into a conda environment at./samtools-env
by executing the following code chunk:
mamba run -p ./samtools-env which samtools
# mamba: A variant on the conda package manager
# run: The sub-command of mamba we want to run
# -p: This option tells mamba to look for an environment folder at the provided path
# which: A command that displays the path of the specified command
# samtools: A suite of programs to process SAM/BAM files
## /n/home/user/repos/harvardinformatics/workshops/2023-spring/biotips/samtools-env/bin/samtools
Run the following code chunk to create symbolic links in your current working directory to the data files used for the exercises in this section:
mkdir -p data4
ln -s -f /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/* data4
# ln: The Unix link command, which can create shortcuts to folders and files at the provided path to the second provided path
# -s: This option tells ln to create a symbolic link rather than a hard link (original files are not changed)
# -f: This option forces ln to create the link
ls -l data4
# Show the details of the files in the new linked directory
## total 224
## lrwxrwxrwx 1 user informatics 109 Mar 31 14:04 Biotips-workshop-2023-Day4-student.Rmd -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/Biotips-workshop-2023-Day4-student.Rmd
## lrwxrwxrwx 1 user informatics 93 Mar 31 14:04 SAMEA3532870_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532870_final.bam
## lrwxrwxrwx 1 user informatics 93 Mar 31 14:04 SAMEA3532871_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532871_final.bam
## lrwxrwxrwx 1 user informatics 93 Mar 31 14:04 SAMEA3532872_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532872_final.bam
## lrwxrwxrwx 1 user informatics 93 Mar 31 14:04 SAMEA3532873_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532873_final.bam
## lrwxrwxrwx 1 user informatics 93 Mar 31 14:04 SAMEA3532874_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532874_final.bam
## lrwxrwxrwx 1 user informatics 93 Mar 31 14:04 SAMEA3532875_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532875_final.bam
Today we’ll be using the Harvard FAS Research Computing cluster, which is known as Cannon. A cluster is the name for a bunch of computers that are networked together and whose main purpose is to run computationally intensive jobs. A cluster can also be referred to as a high performance computing system or HPC. We use compute clusters because the size of biological data and the analyses we may want to perform on it are not tractable to our own personal computers or lab-owned servers. In short, without dedicated compute clusters, modern biological research would probably grind to a halt.
Clusters are great, however they are also a community resource. Multiple groups across a given institution or even across institutions may be in need of the resources of the cluster at the same time. This introduces a problem: how do you decide which user gets to use which resources and at what time? That’s where job-scheduling software comes in.
When you login to a cluster, as we have done now with our RStudio app, you connect to a login node. Login nodes are usually where people interact with the file system in order to submit jobs to the more resource heavy compute nodes. Running commands and doing some light file processing is generally ok on login nodes, but their main use is to submit jobs. Job submission consists of running a command that executes a script with the job scheduling software that also has some information about the resources requested for the job. Once a job is submitted, the scheduling software will look at the resources requested, the resources available, and the user’s and their group’s recent usage of the cluster to decide which node the commands in the script should be executed on and when it should start.
Slurm is a cluster workload manager. Slurm allows shell scripts to be augmented with special directives that specify the needed resources (such as CPUs, memory, GPUs, and nodes) to allocate. Slurm schedules submitted jobs for execution on the requested resources. If the resources are not immediately available, the job is queued and prioritized for execution based on the resources requested, as well as the fairshare associated with the Slurm account.
By default, on Cannon each lab group is associated with a single Slurm account.
The job scheduling software that is used on Cannon is Slurm.
Slurm has many commands that allow users to monitor cluster resources
and submit jobs. Let’s take a look at one of them,
squeue
.
Run the code block below to use
squeue
to see who is running jobs on the cluster right now:
squeue | head
# squeue: SLURMs queue information command
# | : The Unix pipe operator to pass output from one command as input to another command
# head: The Unix command to only display the first few lines of the input
## JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
## 48082789 bigmem submit_b lsepulve PD 0:00 8 (Resources)
## 48082115 bigmem metfYes_ flwang PD 0:00 1 (Priority)
## 48083143 bigmem interact wenjun R 13:11 1 holy7c26507
## 47936048 bigmem fasrc/sy thuang1 R 1-00:08:13 1 holy7c26403
## 47844330 bigmem fasrc/sy tharding R 1-15:59:56 1 holy7c26405
## 47844428 bigmem fasrc/sy tharding R 1-15:52:36 1 holy7c26502
## 47941194 bigmem fasrc/sy tharding R 23:52:34 1 holy7c26408
## 47941192 bigmem fasrc/sy tharding R 23:52:41 1 holy7c26406
## 47941191 bigmem fasrc/sy tharding R 23:52:53 1 holy7c26404
Here we can see several pieces of information about jobs currently being run on the cluster, including their JOBID, the PATITION they’re being run on, and the USER running them. Here are a few definitions of important terms when interacting with the SLURM software:
PARTITION is especially important because you will
need to specify it when you submit a job, and different partitions have
different resources. For instance, if you have a job
(set of commands) that you know will use several hundred gigs of RAM,
you will have to send your job to the bigmem
partition. If
you know your job needs GPU resources, then you will have to use the
gpu
partition.
The following link has a lot of useful information about the cluster, including the different partitions available:
https://docs.rc.fas.harvard.edu/kb/running-jobs/
The sinfo command can be used to see all partitions that you have access to, as well as the nodes in the partition and their state:
sinfo
## PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
## holy-cow up infinite 1 mix holy7c0907
## holy-info up infinite 1 mix holy2c0529
## holy-smokes up infinite 1 mix holy7c0908
## holy-smokes up infinite 6 alloc holy7c[0909-0914]
## holy-smokes-priority up infinite 1 mix holy7c0908
## holy-smokes-priority up infinite 6 alloc holy7c[0909-0914]
## bigmem up 7-00:00:00 1 plnd holy7c26409
## bigmem up 7-00:00:00 15 mix holy7c[26403-26406,26408,26411-26412,26501-26502,26507-26508,26510-26511,26604-26605]
## bigmem up 7-00:00:00 14 alloc holy7c[26401-26402,26407,26410,26503-26506,26509,26512,26601-26603,26606]
## gpu up 7-00:00:00 1 drng holygpu7c26105
## gpu up 7-00:00:00 14 mix holygpu7c[26103-26104,26106,26201-26206,26301-26305]
## gpu_mig up 7-00:00:00 1 mix holygpu7c26306
## gpu_requeue up 7-00:00:00 1 inval holygpu8a29402
## gpu_requeue up 7-00:00:00 2 drng holygpu7c[1315,26105]
## gpu_requeue up 7-00:00:00 98 mix holy2c1030,holygpu2a605,holygpu2c[0901-0903,0913,0917,0921,0923,0931,1121,1125],holygpu7c[0915,0920,1305,1307,1311,1313,1317,1323,1701,1706,1711,1716,1721,1726,26103-26104,26106,26201-26206,26301-26306],holygpu8a[25104-25106,25204-25206,25305-25306,25404-25406,25606,27101-27103,27201-27203,27301-27303,27401-27403,27501,27601,27605,29104-29106,29201-29203,29304-29306,29401,29605,31104-31106,31201-31203,31304-31306,31401-31402,31406,31506,31606],holyolveczkygpu01,holysabetigpu01,meade[03-05]
## gpu_requeue up 7-00:00:00 16 alloc holygpu2a609,holygpu8a[25605,27503-27504,27606,29406,29504-29506,29604,29606,31504-31505,31604-31605],holyzicklergpu01
## gpu_requeue up 7-00:00:00 2 idle holy7b[0909-0910]
## gpu_test up 8:00:00 4 mix holygpu2c[0701-0702,0709-0710]
## gpu_test up 8:00:00 1 alloc holygpu2c0704
## gpu_test up 8:00:00 5 idle holygpu2c[0703,0705-0708]
## remoteviz up 7-00:00:00 1 idle holygpu2c0711
## serial_requeue* up 7-00:00:00 5 inval holy7c[16605,20411,21206,21208],holygpu8a29402
## serial_requeue* up 7-00:00:00 1 plnd holy7c26409
## serial_requeue* up 7-00:00:00 4 drain* holy7c[04302,04402,12403-12404]
## serial_requeue* up 7-00:00:00 1 down* holy2c24214
## serial_requeue* up 7-00:00:00 7 comp holy7c[06111,10404,18308,21201-21204]
## serial_requeue* up 7-00:00:00 19 drng holy2a[20301,20305],holy7c[04208,04301,04401,06508,08405,16606,21308],holy8a[25606,27509,31209,31305,31405-31406],holygpu7c[1315,26105],holyzhuang01,huce-r940
## serial_requeue* up 7-00:00:00 4 drain holy7c[15106,20412],holy8a[31211-31212]
## serial_requeue* up 7-00:00:00 596 mix bloxham-r940,holy2a[01301-01302,01304,01306,01310-01311,01313,01315-01316,02301,02303-02306,02309-02311,02316,05302-05308,05310-05313,05316,15301,15304-15307,15309,15313,20302-20303,20311,23310],holy2c[0529,1029-1030,1129,01208,01211,01213-01214,02101-02102,02108-02109,02111-02113,02115-02116,02211,12205,12208,12210-12215,12302-12304,12306-12309,12311,12313-12316,14201-14211,14401,14403,14405,14409,16107-16108,16115-16116,16206,18101,18104,18106-18116,18201-18211,18213-18215,18302,18305-18316,24102,24201-24211,24213,24215-24216,24301-24308,24310-24313,24315-24316,093401],holy7c[0907-0908,0919,02306,02403,02405,02411-02412,02507,02610-02612,04101,04106,04307,04311,04403,04511,06105-06106,06302-06303,06405,06411,06601,06606,08211,08303,08307,08411,08502,08602,10212,10304-10305,10510,10512,12610,15101-15105,15107-15116,15201-15203,15210-15216,15301-15305,15307-15308,15310-15315,16204,16301-16302,16304,16406,16408-16410,16412,16501,16503-16504,16506-16512,16611,18101-18102,18201,18203-18204,18405,18501-18504,18506,18603-18605,18607-18609,18611-18612,19101,19106,19111,19115,20104-20106,20112,20201-20207,20209-20212,20302-20312,20401,20403-20410,20501-20512,20602,20604-20609,20611-20612,21311,21314,22605,23101-23102,23105,23108-23109,23201-23205,23207,23213-23216,23301,23303-23305,23308,24105-24110,24205-24208,24210,24306,24308-24312,24405,24408,24410,24508,24510-24511,24608-24609,24611-24612,26403-26406,26408,26411-26412,26501-26502,26507-26508,26510-26511,26604-26605,092602],holy8a[25101-25106,25201-25206,25302-25306,25401-25406,25502-25506,25601-25602,25604,27107,27111-27112,27207-27208,27307,27407-27409,27510-27511,29101,29103-29104,29106,29207,29209-29211,29304,29503-29506,29603-29606,31103-31106,31207-31208,31210,31301-31304,31306,31407-31410,31501-31506,31603-31606],holydsouza[01-04],holygpu2a605,holygpu2c[0901-0903,0913,0917,0921,0923,1121,1125],holygpu7c[0915,0920,1305,1307,1311,1313,1317,1323,1701,1706,1711,1716,1721,1726,26103-26104,26106,26201-26206,26301-26306],holygpu8a[25104-25106,25204-25206,25305-25306,25404-25406,25606,27101-27103,27201-27203,27301-27303,27401-27403,27501,27601,27605,29104-29106,29201-29203,29304-29306,29401,29605,31104-31106,31201-31203,31304-31306,31401-31402,31406,31506,31606],holyolveczkygpu01,holysabetigpu01,meade[03-05]
## serial_requeue* up 7-00:00:00 560 alloc holy2a[01308,02312-02314,05301,05314,16301-16316,20304,20306-20310,20312-20313,23301-23303,23305-23309],holy2c[01201-01205,01207,01209,02201-02210,02212-02215,12201-12204,12206,12216,12312,18216,18301,18304],holy7c[0909-0914,02103-02105,02107-02112,02201-02202,02204-02206,02209-02212,02301-02305,02307-02309,02409-02410,02505-02506,02508,02511-02512,02609,04102-04105,04107-04112,04201-04207,04209-04212,04303-04306,04308-04310,04312,04404-04412,04501-04510,04512,04603-04612,06101-06104,06107-06110,06112,06201-06212,06301,06304-06312,06401-06404,06406-06410,06412,06501-06507,06509-06512,06602-06605,06607-06612,08101-08112,08201-08210,08212,08301-08302,08304-08306,08308-08312,08401-08404,08406-08410,08412,08501,08503-08512,08601,08603-08612,10101-10112,10201-10211,10301-10303,10306-10312,10401-10403,10405-10412,10501-10509,10511,10601-10612,12401-12402,12405-12412,12501-12512,12601-12609,12611-12612,15204-15209,15306,15316,16101-16112,16201,16207,16303,16407,16411,16502,16505,16601-16604,16607-16610,18103-18112,18202,18205,18207-18209,18301-18307,18309-18311,18407-18411,18505,18610,20101-20103,20610,21301-21307,21309-21310,21312-21313,21315-21316,23206,23208-23209,23211,23302,23306-23307,24111-24112,24209,24211-24212,24305,24307,24406-24407,24409,24411-24412,24505-24507,24509,24512,24605-24607,24610,26401-26402,26407,26410,26503-26506,26509,26512,26601-26603,26606],holy8a[25501,29105,29208,29212,29301-29303,29305-29306,29405-29406,29501-29502,29601-29602,31101-31102,31601-31602],holygpu2a609,holygpu8a[25605,27503-27504,27606,29406,29504-29506,29604,29606,31504-31505,31604-31605],holyjacob[02-03,05-06],holyvulis01,holyzicklergpu01
## serial_requeue* up 7-00:00:00 195 idle holy2c[02103-02106,02110,02114,14212-14216,14402,14404,14406,14408,16201-16205,16208,24103-24106,092901-092902,093001-093002,093301-093302,093402],holy7b[0909-0910],holy7c[02106,02203,02207-02208,02310-02312,02401-02402,02404,02406-02408,02501-02504,02509-02510,02601-02608,04601-04602,16202-16203,16205-16206,16208-16212,16305-16312,16401-16405,16612,18206,18210-18212,18312,18401-18404,18406,18412,18507-18512,18601-18602,18606,19102-19105,19107-19109,19112-19114,19116,20107-20111,20208,20301,20402,20601,20603,21102-21104,21108-21116,22501-22512,22601-22604,22606-22612,23103,23310-23314,23316,092401-092402,092501-092502,092601,092701-092702],holy8a[25301,25603,25605,27108-27110,27209-27212,27308-27312,27410-27412,29102]
## shared up 7-00:00:00 1 inval holy7c16605
## shared up 7-00:00:00 4 drain* holy7c[04302,04402,12403-12404]
## shared up 7-00:00:00 3 comp holy7c[06111,10404,18308]
## shared up 7-00:00:00 6 drng holy7c[04208,04301,04401,06508,08405,16606]
## shared up 7-00:00:00 28 mix holy7c[02412,04101,04106,04307,04311,04403,04511,06105-06106,06302-06303,06405,06411,06601,06606,08211,08303,08307,08411,08502,08602,10212,10304-10305,10510,10512,12610,20104]
## shared up 7-00:00:00 320 alloc holy7c[02409-02410,02511-02512,04102-04105,04107-04112,04201-04207,04209-04212,04303-04306,04308-04310,04312,04404-04412,04501-04510,04512,06101-06104,06107-06110,06112,06201-06212,06301,06304-06312,06401-06404,06406-06410,06412,06501-06507,06509-06512,06602-06605,06607-06612,08101-08112,08201-08210,08212,08301-08302,08304-08306,08308-08312,08401-08404,08406-08410,08412,08501,08503-08512,08601,08603-08612,10101-10112,10201-10211,10301-10303,10306-10312,10401-10403,10405-10412,10501-10509,10511,10601-10612,12401-12402,12405-12412,12501-12512,12601-12609,12611-12612,16101-16110,16601-16604,16607-16610,18103-18112,18301-18307,18309-18311,20101-20103]
## test up 8:00:00 9 mix holy7c[24102-24104,24201,24504,24601-24604]
## test up 8:00:00 1 alloc holy7c24101
## test up 8:00:00 12 idle holy7c[24202-24204,24301-24304,24403-24404,24501-24503]
## ultramem up 7-00:00:00 1 drng holy8a27509
## ultramem up 7-00:00:00 2 mix holy8a[27510-27511]
## unrestricted up infinite 6 mix holy7c[18605,18607-18609,18611-18612]
## unrestricted up infinite 1 alloc holy7c18610
## unrestricted up infinite 1 idle holy7c18606
Alright, let’s say you’ve gotten a BAM file from your colleague and she asks you to summarize the coverage of it. Well, we know how to do that from Day 1 of the workshop. But the BAM file she gave you is several hundred gigabytes big! There’s no way you can run this on your computer that has 8 gigabytes of RAM, and you know you shouldn’t process such a large file on the login node since that will slow things down for everyone.
We will have to create a job script and submit it to the cluster.
A job script is just like a bash script that we learned about yesterday, except it has some extra information in it for SLURM, and it is executed in a different way. We’ll go over all of this.
First, let’s decide on the commands we want to run. From Day 1, we
know we can summarize coverage of a BAM file with
samtools coverage
. We also know that this command requires
the input file to be sorted:
samtools sort /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532870_final.bam | samtools coverage > coverage-results.txt
This is great, but would take forever if we just executed it in the Terminal. Let’s turn this into a script. Remember the only requirement for a bash script is a shebang line that tells the shell how to interpret the commands in the file:
#!/bin/bash
samtools sort /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532870_final.bam | samtools coverage > coverage-results.txt
Ok, now if we had this in a file and executed it with
./script_name.sh
, it would still run on the login node. To
make sure the script is run by SLURM, we instead execute it with the
sbatch
command. We’ll get to that in a second, but SLURM
also requires a bit more information about the resources needed. These
are provided in lines at the top of the script beginning with the string
#SBATCH
:
#!/bin/bash
#SBATCH --job-name=bam-coverage
#SBATCH --partition=test
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
#SBATCH --mem=24g
#SBATCH --time=1:00:00
##SBATCH --mail-type=END,FAIL # uncomment to send an email when the job ends or fails
##SBATCH --mail-user=<email-address> # send the email to an alternate email address
# on Cannon today (CentOS 7), loads the Anaconda3 (conda) environment module
# on Cannon after the Rocky 8 upgrade, loads the Mambaforge environment module
module load python
# On Cannon (CentOS 7), `conda activate` and `mamba activate` do not work; the older form
# (`source activate`) is required
# After Cannon (Rocky8), `conda activate` and `mamba activate` will work as well
source activate ./samtools-env
samtools sort --threads ${SLURM_CPUS_PER_TASK} -T $(mktemp -d) /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532870_final.bam | samtools coverage > coverage-results.txt
# --threads ${SLURM_CPUS_PER_TASK} : tells samtools sort to use a number of threads as
# specified by the SLURM_CPUS_PER_TASK environment variable,
# which is set to the value from #SBATCH --cpus-per-task=...
# -T $(mktemp -d) : have samtools sort write intermediate temporary sorted BAM files to
# the directory created/printed by the `mktemp -d` command, which creates
# a new directory in /tmp (node-local temporary storage) instead of the
# current working directory
Now we’ve told Slurm how many resources we want and where we want to run the job! Let’s break down these options, including their defaults (on Cannon) if they’re omitted:
--job-name
: A string identifier for the current
job.
Default: the name of the batch job script file
-–partition
: The partition on
which you want to run the current job. In this case, we’re just doing a
test, so we’ll run it on the test
partition! Remember the
list of partitions is available here
Default: --partition=serial_requeue
-–ntasks
: The number of tasks
(processes), each of which will be allocated
--cpus-per-task
processor cores. Unless you’re running a
distributed-memory parallel job that uses multiple nodes, thos option
can be omitted or explicitly set to 1
.
Default: --ntasks=1
-–cpus-per-task
: The number of CPUs each task
requires. For commands that use multiple threads or processes, request
multiple CPUs here AND remember to set those options in the command as
well!
Default: --cpus-per-task=1
-–mem
: The amount of memory the job requires.
Default: --mem=100m
(megabytes)
-–time
: The amount of time the job requires. If the
job isn’t finished at the end of this amount of time, it will time out
and be incomplete. Most jobs cannot be resumed, so make sure you give
your job enough time!! Per the sbatch manual
page:
Acceptable time formats include “minutes”, “minutes:seconds”,
“hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and
“days-hours:minutes:seconds”.
Default: --time=10
(10 minutes)
--mail-type
: (commented-out in the above
script) send an email if/when the specified event occurs (in the
above example, the job ENDs or FAILs). A list of supported events is
listed in the sbatch
manual. By default, this email goes to the email address associated
with your FAS RC account (unless --mail-user
is set; see
below).
--mail-user
: (commented-out in the above
script) if --mail-type
is set, send any email to the
specified email address instead of the default email associated with
your FAS RC account.
Let’s convert this block into an actual script.
Exercise: 1. Using the file menu, go to New file –> Shell script and create a new file. Call it
coverage-submit.sh
. This is going to be our job script that we will modify as we go, and will appear in a new tab of our text editor. 2. Copy the text from the code block above into your new script and save the file. 3. Run the following command in the Terminal below to submit your script to the cluster as a job withsbatch
:
sbatch coverage-submit.sh
- Run the code block below to check on the status of your job with
squeue
squeue --me
# squeue: SLURMs job status command
# --me: This option tells squeue to only show jobs for the current user
You will see one job that is the current RStudio session, and
hopefully another that is the bam-coverage job we just
submitted. Periodically rerun the command to check until the job is done
(i.e. disappears from the list of jobs output by
squeue --me
)
The sacct command queries the Slurm accounting database for job information.
Invoked with no arguments, sacct
displays information on
the user’s jobs that ran in the last 24 hours:
sacct
The recently-complete bam-coverage job (truncated to
bam-cover+
in the JobName column) should be listed in a
COMPLETED state. Copy this job id for the next
exercise.
Information on a specific job (including older jobs that haven’t been purged from the Slurm accounting database) can be obtained via
sacct -j <jobid>
.
Instead of submitting a batch job that runs asynchronously, one can submit an interactive job using the salloc command. The result is similar to SSH’ing into a compute node and using the allocated resources.
Exercise: From the Terminal, execute the following command:
salloc -p test --time=60 --mem=1g
Notice the host name in your shell prompt changes to indicate that on a different host.
It can be difficult to determine the resources (primarily CPUs, memory, and time) to allocate to a job. Allocate too little, and your job will run slowly (if allocated too few CPUs) or be terminated (if allocated too-little memory or wall time). Allocate too many resources, and your job may wait longer in the queue, prevent other jobs from running, and/or squander your lab group’s fairshare.
The resource utilization of previous similar jobs can help inform
resource allocation for future jobs. The sacct
command can
be used to query the Slurm accounting database for historical job
resource utilization.
Exercise: From the Terminal, execute the following command, substituting
<jobid>
with the job ID of your bam-coverage job. The left/right arrows can be used to scroll and view the contents. When done, pressq
.
sacct -lj <jobid> | less -S
The above output is a fairly comprehensive—but not user-friendly—summary of resources used by the job. The seff utility provides a more human-readable summary of relevant resources used by the job.
Exercise: From the Terminal, in an interactive job (see the
salloc
command above), execute the following command:
seff <jobid>
While we didn’t cover these concepts with hands-on examples today, a couple additional Slurm commands/concepts to be aware of:
We’ve all submitted a job, and then realized that we wanted to change
something about the job after the fact. To cancel a job(s) that’s
queued, or is currently running (instead of letting it run to
completion), use the scancel
command, supplying the Slurm
jobid(s) as arguments; e.g.:
scancel 12345678
A job array is useful for launching many jobs that run the same job script on multiple different input files. The jobs that are launched using this mechanism can be monitored and managed almost as easily as a single job.
The Submitting Large Numbers of Jobs to the FASRC cluster guide provides an overview of job arrays.