Welcome to the fourth day of the FAS Informatics Bioinformatics Tips and Tricks Workshop!

If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit and execute the code in this file. Please download the raw file here

Today you’ll learn how to put the concepts learned on day 3 into practice, installing the required software with conda, and creating SLURM job scripts to run jobs on FAS RC cluster compute nodes.

Part 1 - Installing software for reproducible science with Conda

What is Conda?

Conda is an open-source, cross-platform (Linux, macOS, Windows) command-line software utility that provides:

“Package, dependency and environment management for any language–Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.” https://docs.conda.io/
Conda allows an unprivileged user on a High-Performance Computing (HPC) cluster (e.g., the FAS RC Cannon cluster) to install software in any file system directory they have write access to.

In this context, a package is a general term for a particular program that is maintained and hosted at a particular location. For instance, samtools, which we learned about on Day 1 is a package on bioconda. As are bedtools and bcftools. The term package is one we’ll use a lot today.

For Pythonistas: Conda resembles pip + venv, but (1) only installs binary packages (doesn’t compile from source code), and (2) can be used to install software written in programming languages other than Python.

Where to get Conda

Anaconda Distribution
- Scientific Python distribution that bundles conda + over 250 software libraries/packages
- Popular choice for the “desktop data scientist”
- Anaconda Navigator - GUI for installing & launching conda packages
Miniconda

“Miniconda is a free minimal installer for conda. It is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others.”
Mamba
- Drop-in replacement for conda for faster package installation
  - Replace conda command with mamba;
  - subcommands/options about the same
  - installed packages / environments compatible with conda
- Can be installed using conda (conda install mamba), or with a standalone installer (Mambaforge)

Setup (for this workshop)

For technical expedience, for this workshop we’ll be using a minimal, single-executable (no Python distribution) version of mamba called micromamba. Micromamba only supports a subset of conda/mamba subcommands & options, but enough for this workshop. Moreover, conda environments (covered shortly) created by micromamba are compatible with/usable by conda and mamba.

To install micromamba for this session and (for typing convenience), we will define a mamba command (in this case, shell function) that calls micromamba.

Execute the following code chunk (which is very specific to this environment—don’t worry too much about trying to understand exactly what it does):


if ! command -v micromamba
then
  curl -qL https://micro.mamba.pm/api/micromamba/linux-64/1.4.1 | tar -C /usr -xj bin/micromamba
  mv /etc/profile /etc/profile.orig
  cp /etc/profile.orig /etc/profile
  mkdir -p /etc/conda
  echo "repodata_use_zst: true" > /etc/conda/.condarc
  touch ~/.condarc
  echo 'eval "$(micromamba shell hook --shell=bash)"; mamba() { micromamba "$@"; }; export MAMBA_ROOT_PREFIX=/tmp/mamba' >> /etc/profile
fi

## /usr/bin/micromamba

Subsequent commands illustrated in this tutorial are largely compatible with conda as well—just replace the mamba command with conda.

Getting help

Invoke mamba with the -h or --help option to display a list of subcommands, as well as global options that apply to all subcommands.

Run the code block below to see the help menu for mamba:


mamba --help
# mamba: A variant on the conda package manager
# --help: This option tells mamba to display a help menu

## Version: 1.4.1
## 
## Usage: /usr/bin/micromamba [OPTIONS] [SUBCOMMAND]
## 
## Options:
##   -h,--help                   Print this help message and exit
##   --version                   
## 
## 
## Configuration options:
##   --rc-file TEXT ...          Paths to the configuration files to use
##   --no-rc                     Disable the use of configuration files
##   --no-env                    Disable the use of environment variables
## 
## 
## Global options:
##   -v,--verbose                Set verbosity (higher verbosity with multiple -v, e.g. -vvv)
##   --log-level ENUM:value in {critical->5,debug->1,error->4,info->2,off->6,trace->0,warning->3} OR {5,1,4,2,6,0,3}
##                               Set the log level
##   -q,--quiet                  Set quiet mode (print less output)
##   -y,--yes                    Automatically answer yes on prompted questions
##   --json                      Report all output as json
##   --offline                   Force use cached repodata
##   --dry-run                   Only display what would have been done
##   --download-only             Only download and extract packages, do not link them into environment.
##   --experimental              Enable experimental features
## 
## 
## Prefix options:
##   -r,--root-prefix TEXT       Path to the root prefix
##   -p,--prefix TEXT            Path to the target prefix
##   --relocate-prefix TEXT      Path to the relocation prefix
##   -n,--name TEXT              Name of the target prefix
## 
## Subcommands:
##   shell                       Generate shell init scripts
##   create                      Create new environment
##   install                     Install packages in active environment
##   update                      Update packages in active environment
##   self-update                 Update micromamba
##   repoquery                   Find and analyze packages in active environment or channels
##   remove                      Remove packages from active environment
##   list                        List packages in active environment
##   package                     Extract a package or bundle files into an archive
##   clean                       Clean package cache
##   config                      Configuration of micromamba
##   info                        Information about micromamba
##   constructor                 Commands to support using micromamba in constructor
##   env                         List environments
##   activate                    Activate an environment
##   run                         Run an executable in an environment
##   ps                          Show, inspect or kill running processes
##   auth                        Login or logout of a given host
##   search                      Find packages in active environment or channels

To display the usage for an individual subcommand, add the -h or --help option.

Run the code block below to display the help menu for the mamba env command:


mamba env -h
# mamba: A variant on the conda package manager
# env: The sub-command of mamba we want to run
# -h: This option tells mamba env to display a help menu

## List environments
## Usage: /usr/bin/micromamba env [OPTIONS] [SUBCOMMAND]
## 
## Options:
##   -h,--help                   Print this help message and exit
## 
## 
## Configuration options:
##   --rc-file TEXT ...          Paths to the configuration files to use
##   --no-rc                     Disable the use of configuration files
##   --no-env                    Disable the use of environment variables
## 
## 
## Global options:
##   -v,--verbose                Set verbosity (higher verbosity with multiple -v, e.g. -vvv)
##   --log-level ENUM:value in {critical->5,debug->1,error->4,info->2,off->6,trace->0,warning->3} OR {5,1,4,2,6,0,3}
##                               Set the log level
##   -q,--quiet                  Set quiet mode (print less output)
##   -y,--yes                    Automatically answer yes on prompted questions
##   --json                      Report all output as json
##   --offline                   Force use cached repodata
##   --dry-run                   Only display what would have been done
##   --download-only             Only download and extract packages, do not link them into environment.
##   --experimental              Enable experimental features
## 
## 
## Prefix options:
##   -r,--root-prefix TEXT       Path to the root prefix
##   -p,--prefix TEXT            Path to the target prefix
##   --relocate-prefix TEXT      Path to the relocation prefix
##   -n,--name TEXT              Name of the target prefix
## 
## Subcommands:
##   list                        List known environments
##   create                      Create new environment (pre-commit.com compatibility alias for 'micromamba create')
##   export                      Export environment
##   remove                      Remove an environment

Conda channels

A conda channel is the URL of a directory that contains a set of conda packages. A few popular channels we’ll use today include:

defaults - “meta-channel” maintained by Anaconda, Inc. (the company behind the Anaconda Distribution)
conda-forge - community-curated set of high-quality conda packages
bioconda - channel focused on bioinformatics; thousands of packages available

Setting conda channels for bioinformatics

The software tools we will install and use in in subsequent sections are available in the bioconda channel. bioconda packages may have dependencies on packages in the conda-forge and defaults channels. We can specify channels to use as command-line arguments to conda operations (where applicable).

Run the code block below to search a couple channels for a package called bedtools:


mamba search -c conda-forge -c bioconda bedtools
# mamba: A variant on the conda package manager
# search: The sub-command of mamba we want to run
# -c: This option tells mamba search to search this channel for the provided package name; multiple -c options can be provided

## Getting repodata from channels...
## 
## 
## 
##  Name     Version  Build      Channel          
## ────────────────────────────────────────────────
##  bedtools 2.30.0   hc088bd4_0 bioconda/linux-64
##  bedtools 2.30.0   h7d7f7ad_2 bioconda/linux-64
##  bedtools 2.30.0   h7d7f7ad_1 bioconda/linux-64
##  bedtools 2.30.0   h468198e_3 bioconda/linux-64
##  bedtools 2.29.2   hc088bd4_0 bioconda/linux-64
##  bedtools 2.29.1   hc088bd4_1 bioconda/linux-64
##  bedtools 2.29.1   hc088bd4_0 bioconda/linux-64
##  bedtools 2.29.0   hc088bd4_3 bioconda/linux-64
##  bedtools 2.29.0   hc088bd4_2 bioconda/linux-64
##  bedtools 2.29.0   h6ed99ea_1 bioconda/linux-64
##  bedtools 2.29.0   h0da2602_0 bioconda/linux-64
##  bedtools 2.28.0   hdf88d34_0 bioconda/linux-64
##  bedtools 2.27.1   he941832_2 bioconda/linux-64
##  bedtools 2.27.1   he860b03_3 bioconda/linux-64
##  bedtools 2.27.1   he513fc3_4 bioconda/linux-64
##  bedtools 2.27.1   hd03093a_6 bioconda/linux-64
##  bedtools 2.27.1   h9a82719_5 bioconda/linux-64
##  bedtools 2.27.1   1          bioconda/linux-64
##  bedtools 2.27.1   0          bioconda/linux-64
##  bedtools 2.27.0   he941832_2 bioconda/linux-64
##  bedtools 2.27.0   he860b03_3 bioconda/linux-64
##  bedtools 2.27.0   he513fc3_4 bioconda/linux-64
##  bedtools 2.27.0   1          bioconda/linux-64
##  bedtools 2.27.0   0          bioconda/linux-64
##  bedtools 2.26.0   0          bioconda/linux-64
##  bedtools 2.26.0gx 0          bioconda/linux-64
##  bedtools 2.26.0gx 1          bioconda/linux-64
##  bedtools 2.26.0gx he513fc3_4 bioconda/linux-64
##  bedtools 2.26.0gx he860b03_3 bioconda/linux-64
##  bedtools 2.26.0gx he941832_2 bioconda/linux-64
##  bedtools 2.25.0   3          bioconda/linux-64
##  bedtools 2.25.0   he860b03_5 bioconda/linux-64
##  bedtools 2.25.0   he941832_4 bioconda/linux-64
##  bedtools 2.25.0   1          bioconda/linux-64
##  bedtools 2.25.0   0          bioconda/linux-64
##  bedtools 2.25.0   2          bioconda/linux-64
##  bedtools 2.24.0   0          bioconda/linux-64
##  bedtools 2.23.0   h5b5514e_6 bioconda/linux-64
##  bedtools 2.23.0   0          bioconda/linux-64
##  bedtools 2.23.0   h2e03b76_5 bioconda/linux-64
##  bedtools 2.23.0   h8b12597_4 bioconda/linux-64
##  bedtools 2.23.0   he860b03_2 bioconda/linux-64
##  bedtools 2.23.0   he941832_1 bioconda/linux-64
##  bedtools 2.23.0   hdbcaa40_3 bioconda/linux-64
##  bedtools 2.22     h2e03b76_5 bioconda/linux-64
##  bedtools 2.22     0          bioconda/linux-64
##  bedtools 2.22     h5b5514e_6 bioconda/linux-64
##  bedtools 2.22     h8b12597_4 bioconda/linux-64
##  bedtools 2.22     hdbcaa40_3 bioconda/linux-64
##  bedtools 2.22     he860b03_2 bioconda/linux-64
##  bedtools 2.22     he941832_1 bioconda/linux-64
##  bedtools 2.20.1   he941832_1 bioconda/linux-64
##  bedtools 2.20.1   he860b03_2 bioconda/linux-64
##  bedtools 2.20.1   0          bioconda/linux-64
##  bedtools 2.19.1   he941832_1 bioconda/linux-64
##  bedtools 2.19.1   he860b03_2 bioconda/linux-64
##  bedtools 2.19.1   0          bioconda/linux-64
##  bedtools 2.17.0   0          bioconda/linux-64
##  bedtools 2.16.2   0          bioconda/linux-64

It can be convenient to configure a default list of channels so the channel list doesn’t need to be explicitly specified for mamba commands.

The following channel setup (adapted from the bioconda documentation) updates the user conda/mamba configuration file (~/.condarc).

Run the code block below to add the specified channels to your ~/.condarc file:


mamba config remove-key channels  # reset channels in ~/.condarc (if set)
mamba config append channels conda-forge
mamba config append channels bioconda
mamba config append channels defaults
mamba config set channel_priority strict
# mamba: A variant on the conda package manager
# config: The sub-command of mamba we want to run

These commands write (or update) our ~/.condarc file.

Run the code block below to view the contents of your ~/.condarc file:


cat ~/.condarc
# cat: A Unix command to display the contents of a file to the screen

## channel_priority: strict
## channels:
##   - conda-forge
##   - bioconda
##   - defaults

Some conda packages exist in all three channels. Strict channel priority (channel_priority: strict) tells mamba (or conda) to search for specified packages in higher-priority channels first, and if found, ignore packages with the same name that exist in lower-priority channels. In this example, specifying that packages should first be searched for in conda-forge, then (if not found) in bioconda, and finally the defaults channel.

We can verify the list of channels the mamba command uses.

Run the code block below to display some information about mamba:


mamba info
# mamba: A variant on the conda package manager
# info: The sub-command of mamba we want to run

## 
##                                            __
##           __  ______ ___  ____ _____ ___  / /_  ____ _
##          / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
##         / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
##        / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
##       /_/
## 
## 
##             environment : None (not found)
##            env location : -
##       user config files : /n/home/user/.mambarc
##  populated config files : /n/home/user/.condarc
##                           /etc/conda/.condarc
##        libmamba version : 1.4.1
##      micromamba version : 1.4.1
##            curl version : libcurl/7.88.1 OpenSSL/3.1.0 zlib/1.2.13 zstd/1.5.2 libssh2/1.10.0 nghttp2/1.52.0
##      libarchive version : libarchive 3.6.2 zlib/1.2.13 bz2lib/1.0.8 libzstd/1.5.2
##        virtual packages : __unix=0=0
##                           __linux=3.10.0=0
##                           __glibc=2.35=0
##                           __archspec=1=x86_64
##                channels : https://conda.anaconda.org/conda-forge/linux-64
##                           https://conda.anaconda.org/conda-forge/noarch
##                           https://conda.anaconda.org/bioconda/linux-64
##                           https://conda.anaconda.org/bioconda/noarch
##                           https://repo.anaconda.com/pkgs/main/linux-64
##                           https://repo.anaconda.com/pkgs/main/noarch
##                           https://repo.anaconda.com/pkgs/r/linux-64
##                           https://repo.anaconda.com/pkgs/r/noarch
##        base environment : /tmp/mamba
##                platform : linux-64

Searching for packages with mamba

Now that we have our channels set-up, we can begin installing packages (aka software). We can do this in a few ways, one of which is directly in the command line with mamba search.

Run the code block below to use mamba search to search for packages by name, in this case bcftools:

E.g., to search for the package bcftools (exact match):


mamba search bcftools
# mamba: A variant on the conda package manager
# search: The sub-command of mamba we want to run

## Getting repodata from channels...
## 
## conda-forge/linux-64                                        Using cache
## conda-forge/noarch                                          Using cache
## bioconda/linux-64                                           Using cache
## bioconda/noarch                                             Using cache
## pkgs/main/linux-64                                          Using cache
## pkgs/main/noarch                                            Using cache
## pkgs/r/linux-64                                             Using cache
## pkgs/r/noarch                                               Using cache
## 
## 
##  Name     Version Build      Channel          
## ───────────────────────────────────────────────
##  bcftools 1.16    hfe4b78e_1 bioconda/linux-64
##  bcftools 1.16    hfe4b78e_0 bioconda/linux-64
##  bcftools 1.16    haef29d1_2 bioconda/linux-64
##  bcftools 1.15.1  h0ea216a_0 bioconda/linux-64
##  bcftools 1.15.1  hfe4b78e_1 bioconda/linux-64
##  bcftools 1.15    h0ea216a_1 bioconda/linux-64
##  bcftools 1.15    h0ea216a_2 bioconda/linux-64
##  bcftools 1.15    haf5b3da_0 bioconda/linux-64
##  bcftools 1.14    h88f3f91_0 bioconda/linux-64
##  bcftools 1.14    hde04aa1_1 bioconda/linux-64
##  bcftools 1.13    h3a49de5_0 bioconda/linux-64
##  bcftools 1.12    h3f113a9_0 bioconda/linux-64
##  bcftools 1.12    h45bccc9_1 bioconda/linux-64
##  bcftools 1.11    h7c999a4_0 bioconda/linux-64
##  bcftools 1.10.2  h4f4756c_2 bioconda/linux-64
##  bcftools 1.10.2  h4f4756c_3 bioconda/linux-64
##  bcftools 1.10.2  hd2cd319_0 bioconda/linux-64
##  bcftools 1.10.2  h4f4756c_1 bioconda/linux-64
##  bcftools 1.10.1  hd2cd319_0 bioconda/linux-64
##  bcftools 1.10    h5d15f04_0 bioconda/linux-64
##  bcftools 1.9     ha228f0b_4 bioconda/linux-64
##  bcftools 1.9     ha228f0b_3 bioconda/linux-64
##  bcftools 1.9     h68d8f2e_9 bioconda/linux-64
##  bcftools 1.9     h68d8f2e_8 bioconda/linux-64
##  bcftools 1.9     h68d8f2e_7 bioconda/linux-64
##  bcftools 1.9     h5c2b69b_6 bioconda/linux-64
##  bcftools 1.9     h5c2b69b_5 bioconda/linux-64
##  bcftools 1.9     h47928c2_2 bioconda/linux-64
##  bcftools 1.9     h47928c2_1 bioconda/linux-64
##  bcftools 1.8     2          bioconda/linux-64
##  bcftools 1.8     h4da6232_3 bioconda/linux-64
##  bcftools 1.8     1          bioconda/linux-64
##  bcftools 1.8     0          bioconda/linux-64
##  bcftools 1.7     0          bioconda/linux-64
##  bcftools 1.6     1          bioconda/linux-64
##  bcftools 1.6     0          bioconda/linux-64
##  bcftools 1.5     h1ff2904_4 bioconda/linux-64
##  bcftools 1.5     3          bioconda/linux-64
##  bcftools 1.5     2          bioconda/linux-64
##  bcftools 1.5     1          bioconda/linux-64
##  bcftools 1.5     0          bioconda/linux-64
##  bcftools 1.4.1   0          bioconda/linux-64
##  bcftools 1.4     0          bioconda/linux-64
##  bcftools 1.3.1   hed695b0_6 bioconda/linux-64
##  bcftools 1.3.1   ha92aebf_3 bioconda/linux-64
##  bcftools 1.3.1   h84994c4_5 bioconda/linux-64
##  bcftools 1.3.1   h84994c4_4 bioconda/linux-64
##  bcftools 1.3.1   h5bf99c6_7 bioconda/linux-64
##  bcftools 1.3.1   2          bioconda/linux-64
##  bcftools 1.3.1   1          bioconda/linux-64
##  bcftools 1.3.1   0          bioconda/linux-64
##  bcftools 1.3     h5bf99c6_6 bioconda/linux-64
##  bcftools 1.3     ha92aebf_2 bioconda/linux-64
##  bcftools 1.3     1          bioconda/linux-64
##  bcftools 1.3     0          bioconda/linux-64
##  bcftools 1.3     h7132678_7 bioconda/linux-64
##  bcftools 1.3     hed695b0_5 bioconda/linux-64
##  bcftools 1.3     h84994c4_3 bioconda/linux-64
##  bcftools 1.2     h4da6232_3 bioconda/linux-64
##  bcftools 1.2     h02bfda8_4 bioconda/linux-64
##  bcftools 1.2     2          bioconda/linux-64
##  bcftools 1.2     1          bioconda/linux-64
##  bcftools 1.2     0          bioconda/linux-64

In this case, we want to see if there is any package called bcftools. mamba will look at all of the URLs in the channels in our config file for the specified package and return anything that matches what we searched for. Here we see that it found a lot of matches exactly for the string “bcftools” and all on the bioconda channel. The difference between them is their versions, so depending on whether you want to perform an analysis with the latest version of the software, or replicate an analysis from a paper that used a specific version, you should be able to find what you need (at least for a well-maintained package).

Well, maybe we have an idea of what the name of the package is, but don’t remember it exactly. The mamba search command allows wildcards for inexact matches. For instance, the * character can be used as a wildcard.

Run the code block below to search for all packages beginning with the string “bcf”:


mamba search 'bcf*'
# mamba: A variant on the conda package manager
# search: The sub-command of mamba we want to run

## Getting repodata from channels...
## 
## conda-forge/linux-64                                        Using cache
## conda-forge/noarch                                          Using cache
## bioconda/linux-64                                           Using cache
## bioconda/noarch                                             Using cache
## pkgs/main/linux-64                                          Using cache
## pkgs/main/noarch                                            Using cache
## pkgs/r/linux-64                                             Using cache
## pkgs/r/noarch                                               Using cache
## 
## 
##  Name                    Version Build      Channel          
## ──────────────────────────────────────────────────────────────
##  bcftools                1.16    hfe4b78e_1 bioconda/linux-64
##  bcftools                1.16    hfe4b78e_0 bioconda/linux-64
##  bcftools                1.16    haef29d1_2 bioconda/linux-64
##  bcftools                1.15.1  h0ea216a_0 bioconda/linux-64
##  bcftools                1.15.1  hfe4b78e_1 bioconda/linux-64
##  bcftools                1.15    h0ea216a_1 bioconda/linux-64
##  bcftools                1.15    h0ea216a_2 bioconda/linux-64
##  bcftools                1.15    haf5b3da_0 bioconda/linux-64
##  bcftools                1.14    h88f3f91_0 bioconda/linux-64
##  bcftools                1.14    hde04aa1_1 bioconda/linux-64
##  bcftools                1.13    h3a49de5_0 bioconda/linux-64
##  bcftools                1.12    h3f113a9_0 bioconda/linux-64
##  bcftools                1.12    h45bccc9_1 bioconda/linux-64
##  bcftools                1.11    h7c999a4_0 bioconda/linux-64
##  bcftools                1.10.2  h4f4756c_3 bioconda/linux-64
##  bcftools                1.10.2  hd2cd319_0 bioconda/linux-64
##  bcftools                1.10.2  h4f4756c_2 bioconda/linux-64
##  bcftools                1.10.2  h4f4756c_1 bioconda/linux-64
##  bcftools                1.10.1  hd2cd319_0 bioconda/linux-64
##  bcftools                1.10    h5d15f04_0 bioconda/linux-64
##  bcftools                1.9     h47928c2_1 bioconda/linux-64
##  bcftools                1.9     h47928c2_2 bioconda/linux-64
##  bcftools                1.9     h5c2b69b_5 bioconda/linux-64
##  bcftools                1.9     h5c2b69b_6 bioconda/linux-64
##  bcftools                1.9     h68d8f2e_7 bioconda/linux-64
##  bcftools                1.9     h68d8f2e_8 bioconda/linux-64
##  bcftools                1.9     h68d8f2e_9 bioconda/linux-64
##  bcftools                1.9     ha228f0b_3 bioconda/linux-64
##  bcftools                1.9     ha228f0b_4 bioconda/linux-64
##  bcftools                1.8     h4da6232_3 bioconda/linux-64
##  bcftools                1.8     2          bioconda/linux-64
##  bcftools                1.8     1          bioconda/linux-64
##  bcftools                1.8     0          bioconda/linux-64
##  bcftools                1.7     0          bioconda/linux-64
##  bcftools                1.6     1          bioconda/linux-64
##  bcftools                1.6     0          bioconda/linux-64
##  bcftools                1.5     h1ff2904_4 bioconda/linux-64
##  bcftools                1.5     3          bioconda/linux-64
##  bcftools                1.5     2          bioconda/linux-64
##  bcftools                1.5     1          bioconda/linux-64
##  bcftools                1.5     0          bioconda/linux-64
##  bcftools                1.4.1   0          bioconda/linux-64
##  bcftools                1.4     0          bioconda/linux-64
##  bcftools                1.3.1   2          bioconda/linux-64
##  bcftools                1.3.1   0          bioconda/linux-64
##  bcftools                1.3.1   1          bioconda/linux-64
##  bcftools                1.3.1   hed695b0_6 bioconda/linux-64
##  bcftools                1.3.1   h5bf99c6_7 bioconda/linux-64
##  bcftools                1.3.1   h84994c4_4 bioconda/linux-64
##  bcftools                1.3.1   h84994c4_5 bioconda/linux-64
##  bcftools                1.3.1   ha92aebf_3 bioconda/linux-64
##  bcftools                1.3     hed695b0_5 bioconda/linux-64
##  bcftools                1.3     0          bioconda/linux-64
##  bcftools                1.3     1          bioconda/linux-64
##  bcftools                1.3     h5bf99c6_6 bioconda/linux-64
##  bcftools                1.3     h7132678_7 bioconda/linux-64
##  bcftools                1.3     h84994c4_3 bioconda/linux-64
##  bcftools                1.3     ha92aebf_2 bioconda/linux-64
##  bcftools                1.2     h4da6232_3 bioconda/linux-64
##  bcftools                1.2     h02bfda8_4 bioconda/linux-64
##  bcftools                1.2     2          bioconda/linux-64
##  bcftools                1.2     1          bioconda/linux-64
##  bcftools                1.2     0          bioconda/linux-64
##  bcftools-gtc2vcf-plugin 1.16    h0fdf51a_0 bioconda/linux-64
##  bcftools-gtc2vcf-plugin 1.9     hedc5323_0 bioconda/linux-64
##  bcftools-snvphyl-plugin 1.9     he673b24_1 bioconda/linux-64
##  bcftools-snvphyl-plugin 1.9     h2559242_7 bioconda/linux-64
##  bcftools-snvphyl-plugin 1.9     h34584cc_4 bioconda/linux-64
##  bcftools-snvphyl-plugin 1.9     h4da6232_0 bioconda/linux-64
##  bcftools-snvphyl-plugin 1.9     h80657d4_3 bioconda/linux-64
##  bcftools-snvphyl-plugin 1.9     ha13ca6a_2 bioconda/linux-64
##  bcftools-snvphyl-plugin 1.9     hc0af00e_5 bioconda/linux-64
##  bcftools-snvphyl-plugin 1.9     hdd6bb30_6 bioconda/linux-64
##  bcftools-snvphyl-plugin 1.8     0          bioconda/linux-64
##  bcftools-snvphyl-plugin 1.8     h4da6232_2 bioconda/linux-64
##  bcftools-snvphyl-plugin 1.6     0          bioconda/linux-64
##  bcftools-snvphyl-plugin 1.6     1          bioconda/linux-64
##  bcftools-snvphyl-plugin 1.5     0          bioconda/linux-64

Exercise: Search for an open-source software package that you currently or would like to use in your workflows. Try using wildcards if an exact match isn’t found. Note that conda packages don’t exist for much (especially lesser-used) bioinformatics software.


## Write a command to search for a conda-package version of your chosen software.

## Write a command to search for a conda-package version of your chosen software.

Searching for conda packages using a web browser

It can sometimes be more expedient (and for bioconda packages, informative) to search for packages using a web browser. This also may be more intuitive for many people, and search results on the web often display the exact command need to install a given package.

The complete list of bioconda packages is available at:
https://bioconda.github.io/conda-package_index.html

A more-comprehensive package search interface, containing packages from many channels (but with less information about bioconda packages than the above bioconda package list) is available at: https://anaconda.org .

Conda Environments

Conda packages are installed into an environment, which is a directory structure containing a set of conda packages. Environments are managed separately, and are isolated from each other—changes made to one environment will not impact the software in another environment. The main benefit of this is that the environments are set up in such a way that the user has complete write access to their entire filesystem. In essence they are a filesystem within the main filesystem. This simplifies many aspects of installing software and their dependencies.

For instance, when compiling a package from source, that is downloading the code files and trying to make them executable, the code may rely on functions that exist external libraries of code. By default, during compilation, the program looks in certain locations for these dependencies and it can be complicated to change where it looks. If these dependencies need to be installed where the user doesn’t have access, this can be an almost impossible bottelneck for most users to get past.

However, when working in an environment, the user has complete access to all paths within the environment file system. This means that dependencies can easily be installed within the environment. Packages on conda are also pre-compiled, meaning that one doesn’t have to make them executable from the raw code: the code and its dependencies simply need to be placed in certain locations in the environment to work, and conda/mamba keeps track of all of this in almost all cases.

Creating a new environment with `mamba env create`

The basic syntax for creating a conda or environment is as follows:

mamba env create -n/--name ENVIRONMENT

This creates a “named” conda environment in your envs directory (by default ${HOME}/.conda/envs).

Note: mamba create is a synonym for mamba env create

Run the following code block to create a named environment called day4:


mamba env create -y -n day4
# mamba: A variant on the conda package manager
# create: The sub-command of mamba env we want to run
# -y: don't prompt y/n to create the environment; assume "y"
# -n: This option tells mamba env create to call the environment the provided string (e.g. day4, in this case)

## 
##                                            __
##           __  ______ ___  ____ _____ ___  / /_  ____ _
##          / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
##         / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
##        / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
##       /_/
## 
## Empty environment created at prefix: /tmp/mamba/envs/day4

Listing conda environments: `mamba env list`

In the course of your work, you may end up creating a lot of environments. Some may be for specific packages or projects. In general, environments are pretty robust, but the more packages you install in one the higher the chance you may run into an incompatibility between them that may have unexpected consequences (e.g. downgrading one package because another depends on a specific version, or even breaking the environment). In these cases it is generally ok to just create a new environment that isn’t broken (e.g. acting weird, for lack of a better description), though this can be time consuming.

One useful thing you may want to do is look at the names and locations of all the environments you have created.

Run the code block below to use the mamba env list command to list all environments you have created:


mamba env list
# mamba: A variant on the conda package manager
# env: The sub-command of mamba we want to run
# list: The sub-command of mamba env we want to run

## 
##                                            __
##           __  ______ ___  ____ _____ ___  / /_  ____ _
##          / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
##         / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
##        / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
##       /_/
## 
##   Name  Active  Path                
## ──────────────────────────────────────
##   base          /tmp/mamba          
##   day4          /tmp/mamba/envs/day4

You should at least see a base and day4 environment.

A few comments regarding the base environment:

The base environment is default environment that Python and conda/mamba itself is installed in
- Not strictly true for micromamba, which is a single executable that functions without a Python installation, although there is an initially-empty “base” environment defined
On the FAS RC cluster (Cannon), the base environment is global (shared by all users) and read-only
On a local (user-installed) conda installation, the base environment may be writable—however, never install packages into the base environment to avoid breaking your conda installation.

Activating an environment: `mamba activate`

Creating an environment does not mean you can begin using it. You must first activate it.

Activating a conda environment sets environment variables (such as $PATH, which is a colon-separated list of directories the shell searches for commands) to allow software in the environment to be used, and makes the environment the default target for relevant mamba commands that operate on environments (such as mamba install).

Run the code block below to see how a system variable, $PATH, changes when you activate an environment:

echo "PATH before mamba activate: ${PATH}"
echo

mamba activate day4
# mamba: A variant on the conda package manager
# activate: The sub-command of mamba we want to run

echo "PATH after mamba activate: ${PATH}"

## PATH before mamba activate: /n/home/user/bin:/condabin:/usr/bin:/n/home/user/R/ifxrstudio/RELEASE_3_16/python-user-base/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin:/usr/local/texlive/bin/x86_64-linux:/usr/lib/rstudio-server/bin/quarto/bin:/usr/lib/rstudio-server/bin/postback/postback
## 
## PATH after mamba activate: /tmp/mamba/envs/day4/bin:/n/home/user/bin:/condabin:/usr/bin:/n/home/user/R/ifxrstudio/RELEASE_3_16/python-user-base/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/texlive/bin/x86_64-linux:/usr/lib/rstudio-server/bin/quarto/bin:/usr/lib/rstudio-server/bin/postback/postback

In this example, an envs/day4/bin directory has been pre-pended to the $PATH variable.

In an interactive shell environment (i.e., if using the Terminal), the shell prompt is prefixed with the environment name in parentheses (day4).

Note that every time you log on and want to use an environment you will have to activate it. Also, if you are in one environment and want to use another one, you must activate the new one!

Listing packages installed in an environment

The main reason to create and activate an environment is to install and use packages (i.e. programs) within it, so one of the most basic things we want to know is what packages are currently installed in our environment. We can see what packages are already installed in our current environment using mamba list command.

Run the code block below to list the packages currently installed in our current environment (day4):


mamba activate day4
# mamba: A variant on the conda package manager
# activate: The sub-command of mamba we want to run

mamba list
# mamba: A variant on the conda package manager
# list: The sub-command of mamba we want to run

## List of packages in environment: "/tmp/mamba/envs/day4"

NOTE: We use mamba activate again in this code chunk because each R markdown code chunk is a separate shell environment. In an interactive shell (Terminal) or shell script, the environment changes from mamba activate will persist in that shell / script until explicitly deactivated with mamba deactivate (described below).

Ok, nothing in our environment so far. That makes sense since we just created it. Let’s install some packages in it now.

Installing packages in an environment: `mamba install`

The mamba install command installs the listed package(s)---including dependencies—in the specified environment, or the current (activated) environment if an environment isn’t specified.

Run the code block below to install packages named bedtools and samtools in our day4 environment. This may take some time:


mamba install -y -n day4 bedtools samtools grampa
# mamba: A variant on the conda package manager
# install: The sub-command of mamba we want to run
# -y : don't prompt y/n to install packages; assume "y"
# -n: The name of the environment in which we want to install the specified packages

## 
##                                            __
##           __  ______ ___  ____ _____ ___  / /_  ____ _
##          / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
##         / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
##        / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
##       /_/
## 
## conda-forge/linux-64                                        Using cache
## conda-forge/noarch                                          Using cache
## bioconda/linux-64                                           Using cache
## bioconda/noarch                                             Using cache
## pkgs/main/linux-64                                          Using cache
## pkgs/main/noarch                                            Using cache
## pkgs/r/linux-64                                             Using cache
## pkgs/r/noarch                                               Using cache
## 
## Transaction
## 
##   Prefix: /tmp/mamba/envs/day4
## 
##   Updating specs:
## 
##    - bedtools
##    - samtools
##    - grampa
## 
## 
##   Package                  Version  Build               Channel                   Size
## ────────────────────────────────────────────────────────────────────────────────────────
##   Install:
## ────────────────────────────────────────────────────────────────────────────────────────
## 
##   + _libgcc_mutex              0.1  conda_forge         conda-forge/linux-64       3kB
##   + _openmp_mutex              4.5  2_gnu               conda-forge/linux-64      24kB
##   + bedtools                2.30.0  h468198e_3          bioconda/linux-64         16MB
##   + bzip2                    1.0.8  h7f98852_4          conda-forge/linux-64     496kB
##   + c-ares                  1.18.1  h7f98852_0          conda-forge/linux-64     115kB
##   + ca-certificates      2022.12.7  ha878542_0          conda-forge/linux-64     146kB
##   + grampa                   1.4.0  pyhdfd78af_0        bioconda/noarch           46kB
##   + htslib                    1.17  h6bc39ce_0          bioconda/linux-64          2MB
##   + keyutils                 1.6.1  h166bdaf_0          conda-forge/linux-64     118kB
##   + krb5                    1.20.1  hf9c8cef_0          conda-forge/linux-64       1MB
##   + ld_impl_linux-64          2.40  h41732ed_0          conda-forge/linux-64     705kB
##   + libcurl                 7.87.0  h6312ad2_0          conda-forge/linux-64     347kB
##   + libdeflate                1.13  h166bdaf_0          conda-forge/linux-64      80kB
##   + libedit           3.1.20191231  he28a2e2_2          conda-forge/linux-64     124kB
##   + libev                     4.33  h516909a_1          conda-forge/linux-64     106kB
##   + libffi                   3.4.2  h7f98852_5          conda-forge/linux-64      58kB
##   + libgcc-ng               12.2.0  h65d4601_19         conda-forge/linux-64     954kB
##   + libgomp                 12.2.0  h65d4601_19         conda-forge/linux-64     466kB
##   + libnghttp2              1.51.0  hdcd2b5c_0          conda-forge/linux-64     623kB
##   + libnsl                   2.0.0  h7f98852_0          conda-forge/linux-64      31kB
##   + libsqlite               3.40.0  h753d276_0          conda-forge/linux-64     810kB
##   + libssh2                 1.10.0  haa6b8db_3          conda-forge/linux-64     239kB
##   + libstdcxx-ng            12.2.0  h46fd767_19         conda-forge/linux-64       4MB
##   + libuuid                 2.38.1  h0b41bf4_0          conda-forge/linux-64      34kB
##   + libzlib                 1.2.13  h166bdaf_4          conda-forge/linux-64      66kB
##   + ncurses                    6.3  h27087fc_1          conda-forge/linux-64       1MB
##   + openssl                 1.1.1t  h0b41bf4_0          conda-forge/linux-64       2MB
##   + pip                     23.0.1  pyhd8ed1ab_0        conda-forge/noarch         1MB
##   + python                  3.11.0  h10a6764_1_cpython  conda-forge/linux-64      31MB
##   + readline                   8.2  h8228510_1          conda-forge/linux-64     281kB
##   + samtools                1.16.1  h00cdaf9_2          bioconda/linux-64        420kB
##   + setuptools              67.6.1  pyhd8ed1ab_0        conda-forge/noarch       580kB
##   + tk                      8.6.12  h27826a3_0          conda-forge/linux-64       3MB
##   + tzdata                   2023c  h71feb2d_0          conda-forge/noarch       118kB
##   + wheel                   0.40.0  pyhd8ed1ab_0        conda-forge/noarch        56kB
##   + xz                       5.2.6  h166bdaf_0          conda-forge/linux-64     418kB
##   + zlib                    1.2.13  h166bdaf_4          conda-forge/linux-64      94kB
## 
##   Summary:
## 
##   Install: 37 packages
## 
##   Total download: 71MB
## 
## ────────────────────────────────────────────────────────────────────────────────────────
## 
## 
## 
## Transaction starting
## Linking _libgcc_mutex-0.1-conda_forge
## Linking libstdcxx-ng-12.2.0-h46fd767_19
## Linking ld_impl_linux-64-2.40-h41732ed_0
## Linking ca-certificates-2022.12.7-ha878542_0
## Linking libgomp-12.2.0-h65d4601_19
## Linking _openmp_mutex-4.5-2_gnu
## Linking libgcc-ng-12.2.0-h65d4601_19
## Linking libev-4.33-h516909a_1
## Linking c-ares-1.18.1-h7f98852_0
## Linking libuuid-2.38.1-h0b41bf4_0
## Linking libffi-3.4.2-h7f98852_5
## Linking bzip2-1.0.8-h7f98852_4
## Linking ncurses-6.3-h27087fc_1
## Linking libnsl-2.0.0-h7f98852_0
## Linking keyutils-1.6.1-h166bdaf_0
## Linking openssl-1.1.1t-h0b41bf4_0
## Linking xz-5.2.6-h166bdaf_0
## Linking libdeflate-1.13-h166bdaf_0
## Linking libzlib-1.2.13-h166bdaf_4
## Linking libedit-3.1.20191231-he28a2e2_2
## Linking readline-8.2-h8228510_1
## Linking libssh2-1.10.0-haa6b8db_3
## Linking libnghttp2-1.51.0-hdcd2b5c_0
## Linking tk-8.6.12-h27826a3_0
## Linking libsqlite-3.40.0-h753d276_0
## Linking zlib-1.2.13-h166bdaf_4
## Linking krb5-1.20.1-hf9c8cef_0
## Linking libcurl-7.87.0-h6312ad2_0
## Linking tzdata-2023c-h71feb2d_0
## Linking bedtools-2.30.0-h468198e_3
## Linking htslib-1.17-h6bc39ce_0
## Linking samtools-1.16.1-h00cdaf9_2
## Linking python-3.11.0-h10a6764_1_cpython
## Linking wheel-0.40.0-pyhd8ed1ab_0
## Linking setuptools-67.6.1-pyhd8ed1ab_0
## Linking pip-23.0.1-pyhd8ed1ab_0
## Linking grampa-1.4.0-pyhdfd78af_0
## Transaction finished

mamba installs the latest available versions of listed packages unless versions are specified (we’ll see an example below).

Important: it’s a best practice to install all packages that are needed in the conda environment at the same time (e.g., in the same mamba install command) rather than installing packages one at a time (e.g., with separate mamba install commands). This allows mamba to “solve” a compatible set of packages/dependencies. Otherwise, mamba may have to upgrade/downgrade existing packages in the environment, and potentially “break” software in the environment.

Pro tip: packages can be installed during environment creation by appending the list of packages to the mamba create command; e.g.:

    mamba env create -y -n day4 bedtools samtools

or by activating a conda environment before issuing a mamba install command; e.g.

    mamba activate day4
    mamba install -y bedtools samtools

Now that we’ve installed some packages in our day4 environment, let’s activate it, run bedtools --version to verify it is available, and list the contents of the activated environment.

Run the code block below to view packages installed in our day4 environment:


echo "Running GRAMPA outside of the day4 environment:"
grampa.py --version
# grampa.py: A program for inferring WGDs in a phylogeny
# --version: This tells grampa to just display the current version of the software (useful for seeing if it is installed)
## NOTE: This should produce an error since we have not installed grampa outside of our day4 environment, and we have not yet activated the environment

echo "------"
echo "Activating day4 environment"
mamba activate day4
# mamba: A variant on the conda package manager
# activate: The sub-command of mamba we want to run

echo "Running GRAMPA inside of the day4 environment:"
grampa.py --version
# grampa.py: A program for inferring WGDs in a phylogeny
# --version: This tells grampa to just display the current version of the software (useful for seeing if it is installed)

echo
echo "Listing packages installed in day4:"
mamba list
# mamba: A variant on the conda package manager
# list: The sub-command of mamba we want to run

## Running GRAMPA outside of the day4 environment:
## bash: line 3: grampa.py: command not found
## ------
## Activating day4 environment
## Running GRAMPA inside of the day4 environment:
## 
## /tmp/mamba/envs/day4/bin/grampa.py --version
## 
## # GRAMPA version 1.4.0 released on March 2023
## 
## Listing packages installed in day4:
## List of packages in environment: "/tmp/mamba/envs/day4"
## 
##   Name              Version       Build               Channel    
## ───────────────────────────────────────────────────────────────────
##   _libgcc_mutex     0.1           conda_forge         conda-forge
##   _openmp_mutex     4.5           2_gnu               conda-forge
##   bedtools          2.30.0        h468198e_3          bioconda   
##   bzip2             1.0.8         h7f98852_4          conda-forge
##   c-ares            1.18.1        h7f98852_0          conda-forge
##   ca-certificates   2022.12.7     ha878542_0          conda-forge
##   grampa            1.4.0         pyhdfd78af_0        bioconda   
##   htslib            1.17          h6bc39ce_0          bioconda   
##   keyutils          1.6.1         h166bdaf_0          conda-forge
##   krb5              1.20.1        hf9c8cef_0          conda-forge
##   ld_impl_linux-64  2.40          h41732ed_0          conda-forge
##   libcurl           7.87.0        h6312ad2_0          conda-forge
##   libdeflate        1.13          h166bdaf_0          conda-forge
##   libedit           3.1.20191231  he28a2e2_2          conda-forge
##   libev             4.33          h516909a_1          conda-forge
##   libffi            3.4.2         h7f98852_5          conda-forge
##   libgcc-ng         12.2.0        h65d4601_19         conda-forge
##   libgomp           12.2.0        h65d4601_19         conda-forge
##   libnghttp2        1.51.0        hdcd2b5c_0          conda-forge
##   libnsl            2.0.0         h7f98852_0          conda-forge
##   libsqlite         3.40.0        h753d276_0          conda-forge
##   libssh2           1.10.0        haa6b8db_3          conda-forge
##   libstdcxx-ng      12.2.0        h46fd767_19         conda-forge
##   libuuid           2.38.1        h0b41bf4_0          conda-forge
##   libzlib           1.2.13        h166bdaf_4          conda-forge
##   ncurses           6.3           h27087fc_1          conda-forge
##   openssl           1.1.1t        h0b41bf4_0          conda-forge
##   pip               23.0.1        pyhd8ed1ab_0        conda-forge
##   python            3.11.0        h10a6764_1_cpython  conda-forge
##   readline          8.2           h8228510_1          conda-forge
##   samtools          1.16.1        h00cdaf9_2          bioconda   
##   setuptools        67.6.1        pyhd8ed1ab_0        conda-forge
##   tk                8.6.12        h27826a3_0          conda-forge
##   tzdata            2023c         h71feb2d_0          conda-forge
##   wheel             0.40.0        pyhd8ed1ab_0        conda-forge
##   xz                5.2.6         h166bdaf_0          conda-forge
##   zlib              1.2.13        h166bdaf_4          conda-forge

Deactivating an environment: `mamba deactivate`

There also exists the command mamba deactivate to exit your current environment and return to the base environment. If you run mamba deactivate while in the base environment you will exit Anaconda completely and need to restart it. We will run mamba deactivate later on, but first we want to run some commands in the environment as well. Deactivating a conda environment effectively undoes the environment changes from mamba activate, restoring the previous environment.

Run the following code block to see that software installed inside of the environment is not found when it is deactivated:

mamba activate day4
# mamba: A variant on the conda package manager
# activate: The sub-command of mamba we want to run

type grampa.py
# type: A shell command that displays the path of the specified command
# grampa.py: A program for inferring WGDs in a phylogeny

mamba deactivate
# mamba: A variant on the conda package manager
# deactivate: The sub-command of mamba we want to run

type grampa.py
# type: A shell command that displays the path of the specified command
# grampa.py: A program for inferring WGDs in a phylogeny
# NOTE: This should produce an error since we have deactivated the environment with grampa.py installed in it

## grampa.py is /tmp/mamba/envs/day4/bin/grampa.py
## bash: line 13: type: grampa.py: not found

Running a single command in a conda environment: `mamba run`

Instead of activating/deactivating a conda environment, the mamba run command can be used to run a single command in the specified conda environment.

Run the code block below to run a grampa.py command in our day4 environment without activating that environment:


mamba run -n day4 grampa.py --version
# mamba: A variant on the conda package manager
# run: The sub-command of mamba we want to run
# -n: The name of the environment to run in
# grampa.py: A program for inferring WGDs in a phylogeny
# --version: This tells grampa to just display the current version of the software (useful for seeing if it is installed)

## 
## /tmp/mamba/envs/day4/bin/grampa.py --version
## 
## # GRAMPA version 1.4.0 released on March 2023

Unlike mamba activate, mamba run does not alter the shell’s environment, and mamba deactivate is not needed afterwards to restore the original environment.

Creating an environment at a specific path

By default, environments are created in the install folder of your Anaconda/miniconda program (usually ${HOME}/.conda/envs). However, you can create an environment at a specific path (via mamba env create -p /path/to/environment). This provides more flexibility than named environments, allowing the environment to be created in any directory the user has write access to, and facilitating sharing of conda environments by members of the same lab/group.

Suppose we want to install samtools in an envrionment located at /tmp/samtools-env (Note: a temporary directory just for illustration). Furthermore, suppose we need an old version of samtools (0.1.19). We’ll select the version using the = operator.

Run the code block below to create a conda environment at a specific path, and to install a specific version of samtools within it:


mamba create -y -p /tmp/samtools-env samtools=0.1.19
# mamba: A variant on the conda package manager
# create: The sub-command of mamba env we want to run
# -y: don't prompt y/n to create the environment; assume "y"
# -p: This option tells mamba create to put the environment folder at the provided path

## 
##                                            __
##           __  ______ ___  ____ _____ ___  / /_  ____ _
##          / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
##         / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
##        / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
##       /_/
## 
## conda-forge/linux-64                                        Using cache
## conda-forge/noarch                                          Using cache
## bioconda/linux-64                                           Using cache
## bioconda/noarch                                             Using cache
## pkgs/main/linux-64                                          Using cache
## pkgs/main/noarch                                            Using cache
## pkgs/r/linux-64                                             Using cache
## pkgs/r/noarch                                               Using cache
## 
## Transaction
## 
##   Prefix: /tmp/samtools-env
## 
##   Updating specs:
## 
##    - samtools=0.1.19
## 
## 
##   Package          Version  Build        Channel                    Size
## ──────────────────────────────────────────────────────────────────────────
##   Install:
## ──────────────────────────────────────────────────────────────────────────
## 
##   + _libgcc_mutex      0.1  conda_forge  conda-forge/linux-64     Cached
##   + _openmp_mutex      4.5  2_gnu        conda-forge/linux-64     Cached
##   + libgcc-ng       12.2.0  h65d4601_19  conda-forge/linux-64     Cached
##   + libgomp         12.2.0  h65d4601_19  conda-forge/linux-64     Cached
##   + libzlib         1.2.13  h166bdaf_4   conda-forge/linux-64     Cached
##   + ncurses            6.3  h27087fc_1   conda-forge/linux-64     Cached
##   + samtools        0.1.19  h20b1175_10  bioconda/linux-64           4MB
##   + zlib            1.2.13  h166bdaf_4   conda-forge/linux-64     Cached
## 
##   Summary:
## 
##   Install: 8 packages
## 
##   Total download: 4MB
## 
## ──────────────────────────────────────────────────────────────────────────
## 
## 
## 
## Transaction starting
## Linking _libgcc_mutex-0.1-conda_forge
## Linking libgomp-12.2.0-h65d4601_19
## Linking _openmp_mutex-4.5-2_gnu
## Linking libgcc-ng-12.2.0-h65d4601_19
## Linking ncurses-6.3-h27087fc_1
## Linking libzlib-1.2.13-h166bdaf_4
## Linking zlib-1.2.13-h166bdaf_4
## Linking samtools-0.1.19-h20b1175_10
## Transaction finished

Run the code block below to see the directory structure in the environment folder, /tmp/samtools-env:


ls /tmp/samtools-env
# ls: The Unix command to list the contents of a directory

## bin
## conda-meta
## include
## lib
## share

We can run some of the same commands, with modification, on environments located at custom paths. For instance, mamba list -p treats that directory as a conda environment, and lists installed packages.

Run the code block below to list the packages installed at the path of our envrionment:


mamba list -p /tmp/samtools-env
# mamba: A variant on the conda package manager
# list: The sub-command of mamba env we want to run
# -p: This option tells mamba to look for an environment folder at the provided path

## List of packages in environment: "/tmp/samtools-env"
## 
##   Name           Version  Build        Channel    
## ────────────────────────────────────────────────────
##   _libgcc_mutex  0.1      conda_forge  conda-forge
##   _openmp_mutex  4.5      2_gnu        conda-forge
##   libgcc-ng      12.2.0   h65d4601_19  conda-forge
##   libgomp        12.2.0   h65d4601_19  conda-forge
##   libzlib        1.2.13   h166bdaf_4   conda-forge
##   ncurses        6.3      h27087fc_1   conda-forge
##   samtools       0.1.19   h20b1175_10  bioconda   
##   zlib           1.2.13   h166bdaf_4   conda-forge

Similarly, mamba activate and mamba run accept a -p PATH option.

Run the code block below to activate our environment based on its path:


mamba activate /tmp/samtools-env
# mamba: A variant on the conda package manager
# activate: The sub-command of mamba env we want to run
# -p: This option tells mamba to look for an environment folder at the provided path

type samtools
# type: A shell command that displays the path of the specified command

## samtools is /tmp/samtools-env/bin/samtools

Important: conda environment directories are not relocatable; e.g., the above conda environment may not work if moved to a different directory. We’ll see how to re-create a conda environment in a different location in the next section.

Exercise: Use mamba to create a conda environment called samtools-env that contains the samtools package (default/latest version) in your current working directory. Note that the -p PATH option can be an absolute path (e.g., -p $PWD/samtools-env) or relative path (e.g., -p ./samtools-env)


## Create an environment called samtools-env in the current working directory
mamba create -y -p ./samtools-env samtools
## Create an environment called samtools-env in the current working directory

## 
##                                            __
##           __  ______ ___  ____ _____ ___  / /_  ____ _
##          / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
##         / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
##        / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
##       /_/
## 
## conda-forge/linux-64                                        Using cache
## conda-forge/noarch                                          Using cache
## bioconda/linux-64                                           Using cache
## bioconda/noarch                                             Using cache
## pkgs/main/linux-64                                          Using cache
## pkgs/main/noarch                                            Using cache
## pkgs/r/linux-64                                             Using cache
## pkgs/r/noarch                                               Using cache
## 
## Transaction
## 
##   Prefix: /n/home/user/repos/harvardinformatics/workshops/2023-spring/biotips/samtools-env
## 
##   Updating specs:
## 
##    - samtools
## 
## 
##   Package                 Version  Build        Channel                    Size
## ─────────────────────────────────────────────────────────────────────────────────
##   Install:
## ─────────────────────────────────────────────────────────────────────────────────
## 
##   + _libgcc_mutex             0.1  conda_forge  conda-forge/linux-64     Cached
##   + _openmp_mutex             4.5  2_gnu        conda-forge/linux-64     Cached
##   + bzip2                   1.0.8  h7f98852_4   conda-forge/linux-64     Cached
##   + c-ares                 1.18.1  h7f98852_0   conda-forge/linux-64     Cached
##   + ca-certificates     2022.12.7  ha878542_0   conda-forge/linux-64     Cached
##   + htslib                   1.17  h6bc39ce_0   bioconda/linux-64        Cached
##   + keyutils                1.6.1  h166bdaf_0   conda-forge/linux-64     Cached
##   + krb5                   1.20.1  hf9c8cef_0   conda-forge/linux-64     Cached
##   + libcurl                7.87.0  h6312ad2_0   conda-forge/linux-64     Cached
##   + libdeflate               1.13  h166bdaf_0   conda-forge/linux-64     Cached
##   + libedit          3.1.20191231  he28a2e2_2   conda-forge/linux-64     Cached
##   + libev                    4.33  h516909a_1   conda-forge/linux-64     Cached
##   + libgcc-ng              12.2.0  h65d4601_19  conda-forge/linux-64     Cached
##   + libgomp                12.2.0  h65d4601_19  conda-forge/linux-64     Cached
##   + libnghttp2             1.51.0  hdcd2b5c_0   conda-forge/linux-64     Cached
##   + libssh2                1.10.0  haa6b8db_3   conda-forge/linux-64     Cached
##   + libstdcxx-ng           12.2.0  h46fd767_19  conda-forge/linux-64     Cached
##   + libzlib                1.2.13  h166bdaf_4   conda-forge/linux-64     Cached
##   + ncurses                   6.3  h27087fc_1   conda-forge/linux-64     Cached
##   + openssl                1.1.1t  h0b41bf4_0   conda-forge/linux-64     Cached
##   + samtools               1.16.1  h00cdaf9_2   bioconda/linux-64        Cached
##   + xz                      5.2.6  h166bdaf_0   conda-forge/linux-64     Cached
##   + zlib                   1.2.13  h166bdaf_4   conda-forge/linux-64     Cached
## 
##   Summary:
## 
##   Install: 23 packages
## 
##   Total download: 0 B
## 
## ─────────────────────────────────────────────────────────────────────────────────
## 
## 
## 
## Transaction starting
## Linking _libgcc_mutex-0.1-conda_forge
## Linking ca-certificates-2022.12.7-ha878542_0
## Linking libstdcxx-ng-12.2.0-h46fd767_19
## Linking libgomp-12.2.0-h65d4601_19
## Linking _openmp_mutex-4.5-2_gnu
## Linking libgcc-ng-12.2.0-h65d4601_19
## Linking libev-4.33-h516909a_1
## Linking c-ares-1.18.1-h7f98852_0
## Linking bzip2-1.0.8-h7f98852_4
## Linking ncurses-6.3-h27087fc_1
## Linking keyutils-1.6.1-h166bdaf_0
## Linking openssl-1.1.1t-h0b41bf4_0
## Linking xz-5.2.6-h166bdaf_0
## Linking libdeflate-1.13-h166bdaf_0
## Linking libzlib-1.2.13-h166bdaf_4
## Linking libedit-3.1.20191231-he28a2e2_2
## Linking libssh2-1.10.0-haa6b8db_3
## Linking libnghttp2-1.51.0-hdcd2b5c_0
## Linking zlib-1.2.13-h166bdaf_4
## Linking krb5-1.20.1-hf9c8cef_0
## Linking libcurl-7.87.0-h6312ad2_0
## Linking htslib-1.17-h6bc39ce_0
## Linking samtools-1.16.1-h00cdaf9_2
## Transaction finished

We’ll use this conda environment in the next section of the workshop, so be sure it was created correctly.

Verify that samtools was installed into a conda environment at ./samtools-env by executing the following code chunk:


mamba run -p ./samtools-env which samtools
# mamba: A variant on the conda package manager
# run: The sub-command of mamba we want to run
# -p: This option tells mamba to look for an environment folder at the provided path
# which: A command that displays the path of the specified command
# samtools: A suite of programs to process SAM/BAM files

## /n/home/user/repos/harvardinformatics/workshops/2023-spring/biotips/samtools-env/bin/samtools

Sharing and recreating conda environments

To share and reproduce a computational workflow, it is important to be able to replicate the software environment used. To do that we can export an environment. What this means is that all the information about packages installed in that environment will be written to a file. This file can then be used by someone else to create an identical environment.

We’ll use mamba env export to export the conda environment day4 to an environment YAML file that will record and can be used to recreate the conda environment.

Run the code block below to export our day4 environment:


mamba env export -n day4
# mamba: A variant on the conda package manager
# env: The sub-command of mamba we want to run
# export: The sub-command of mamba env we want to run
# -n: The name of the environment to export

## name: day4
## channels:
## - bioconda
## - conda-forge
## dependencies:
## - _libgcc_mutex=0.1=conda_forge
## - _openmp_mutex=4.5=2_gnu
## - bedtools=2.30.0=h468198e_3
## - bzip2=1.0.8=h7f98852_4
## - c-ares=1.18.1=h7f98852_0
## - ca-certificates=2022.12.7=ha878542_0
## - grampa=1.4.0=pyhdfd78af_0
## - htslib=1.17=h6bc39ce_0
## - keyutils=1.6.1=h166bdaf_0
## - krb5=1.20.1=hf9c8cef_0
## - ld_impl_linux-64=2.40=h41732ed_0
## - libcurl=7.87.0=h6312ad2_0
## - libdeflate=1.13=h166bdaf_0
## - libedit=3.1.20191231=he28a2e2_2
## - libev=4.33=h516909a_1
## - libffi=3.4.2=h7f98852_5
## - libgcc-ng=12.2.0=h65d4601_19
## - libgomp=12.2.0=h65d4601_19
## - libnghttp2=1.51.0=hdcd2b5c_0
## - libnsl=2.0.0=h7f98852_0
## - libsqlite=3.40.0=h753d276_0
## - libssh2=1.10.0=haa6b8db_3
## - libstdcxx-ng=12.2.0=h46fd767_19
## - libuuid=2.38.1=h0b41bf4_0
## - libzlib=1.2.13=h166bdaf_4
## - ncurses=6.3=h27087fc_1
## - openssl=1.1.1t=h0b41bf4_0
## - pip=23.0.1=pyhd8ed1ab_0
## - python=3.11.0=h10a6764_1_cpython
## - readline=8.2=h8228510_1
## - samtools=1.16.1=h00cdaf9_2
## - setuptools=67.6.1=pyhd8ed1ab_0
## - tk=8.6.12=h27826a3_0
## - tzdata=2023c=h71feb2d_0
## - wheel=0.40.0=pyhd8ed1ab_0
## - xz=5.2.6=h166bdaf_0
## - zlib=1.2.13=h166bdaf_4

You can see by default this just prints the environment information to the screen in a certain format. Instead, we want to save this to a file so we can share it with others.

Rune the code block below to redirect this output to a file to save it (let’s call the file day4-environment.yaml)


mamba env export -n day4 > day4-environment.yml
# mamba: A variant on the conda package manager
# env: The sub-command of mamba we want to run
# export: The sub-command of mamba env we want to run
# -n: The name of the environment to export
# The Unix redirect operator to write the output of the command to the following file

cat day4-environment.yml
# cat: A Unix command to display the contents of a file (or multiple files) to the screen

## name: day4
## channels:
## - bioconda
## - conda-forge
## dependencies:
## - _libgcc_mutex=0.1=conda_forge
## - _openmp_mutex=4.5=2_gnu
## - bedtools=2.30.0=h468198e_3
## - bzip2=1.0.8=h7f98852_4
## - c-ares=1.18.1=h7f98852_0
## - ca-certificates=2022.12.7=ha878542_0
## - grampa=1.4.0=pyhdfd78af_0
## - htslib=1.17=h6bc39ce_0
## - keyutils=1.6.1=h166bdaf_0
## - krb5=1.20.1=hf9c8cef_0
## - ld_impl_linux-64=2.40=h41732ed_0
## - libcurl=7.87.0=h6312ad2_0
## - libdeflate=1.13=h166bdaf_0
## - libedit=3.1.20191231=he28a2e2_2
## - libev=4.33=h516909a_1
## - libffi=3.4.2=h7f98852_5
## - libgcc-ng=12.2.0=h65d4601_19
## - libgomp=12.2.0=h65d4601_19
## - libnghttp2=1.51.0=hdcd2b5c_0
## - libnsl=2.0.0=h7f98852_0
## - libsqlite=3.40.0=h753d276_0
## - libssh2=1.10.0=haa6b8db_3
## - libstdcxx-ng=12.2.0=h46fd767_19
## - libuuid=2.38.1=h0b41bf4_0
## - libzlib=1.2.13=h166bdaf_4
## - ncurses=6.3=h27087fc_1
## - openssl=1.1.1t=h0b41bf4_0
## - pip=23.0.1=pyhd8ed1ab_0
## - python=3.11.0=h10a6764_1_cpython
## - readline=8.2=h8228510_1
## - samtools=1.16.1=h00cdaf9_2
## - setuptools=67.6.1=pyhd8ed1ab_0
## - tk=8.6.12=h27826a3_0
## - tzdata=2023c=h71feb2d_0
## - wheel=0.40.0=pyhd8ed1ab_0
## - xz=5.2.6=h166bdaf_0
## - zlib=1.2.13=h166bdaf_4

So, the same thing, but now written to a file! Let’s see if we can use this file to create an identical environment. First things first, we need to delete our current day4 environment, since we’re trying to create a new one with the same name.

To show that this suffices to recreate the day4 environment, we’ll first delete the day4 environment…

Run the code block below to remove the current day4 environment:


mamba env remove -y -n day4
# mamba: A variant on the conda package manager
# env: The sub-command of mamba we want to run
# remove: The sub-command of mamba env we want to run
# -y: don't prompt y/n to create the environment; assume "y"
# -n: The name of the environment to remove

## 
##                                            __
##           __  ______ ___  ____ _____ ___  / /_  ____ _
##          / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
##         / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
##        / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
##       /_/
## 
## Transaction
## 
##   Prefix: /tmp/mamba/envs/day4
## 
##   Removing specs:
## 
##    - _libgcc_mutex
##    - _openmp_mutex
##    - bedtools
##    - bzip2
##    - c-ares
##    - ca-certificates
##    - grampa
##    - htslib
##    - keyutils
##    - krb5
##    - ld_impl_linux-64
##    - libcurl
##    - libdeflate
##    - libedit
##    - libev
##    - libffi
##    - libgcc-ng
##    - libgomp
##    - libnghttp2
##    - libnsl
##    - libsqlite
##    - libssh2
##    - libstdcxx-ng
##    - libuuid
##    - libzlib
##    - ncurses
##    - openssl
##    - pip
##    - python
##    - readline
##    - samtools
##    - setuptools
##    - tk
##    - tzdata
##    - wheel
##    - xz
##    - zlib
## 
## 
##   Package                  Version  Build               Channel           Size
## ────────────────────────────────────────────────────────────────────────────────
##   Remove:
## ────────────────────────────────────────────────────────────────────────────────
## 
##   - _libgcc_mutex              0.1  conda_forge         conda-forge     Cached
##   - _openmp_mutex              4.5  2_gnu               conda-forge     Cached
##   - bedtools                2.30.0  h468198e_3          bioconda        Cached
##   - bzip2                    1.0.8  h7f98852_4          conda-forge     Cached
##   - c-ares                  1.18.1  h7f98852_0          conda-forge     Cached
##   - ca-certificates      2022.12.7  ha878542_0          conda-forge     Cached
##   - grampa                   1.4.0  pyhdfd78af_0        bioconda        Cached
##   - htslib                    1.17  h6bc39ce_0          bioconda        Cached
##   - keyutils                 1.6.1  h166bdaf_0          conda-forge     Cached
##   - krb5                    1.20.1  hf9c8cef_0          conda-forge     Cached
##   - ld_impl_linux-64          2.40  h41732ed_0          conda-forge     Cached
##   - libcurl                 7.87.0  h6312ad2_0          conda-forge     Cached
##   - libdeflate                1.13  h166bdaf_0          conda-forge     Cached
##   - libedit           3.1.20191231  he28a2e2_2          conda-forge     Cached
##   - libev                     4.33  h516909a_1          conda-forge     Cached
##   - libffi                   3.4.2  h7f98852_5          conda-forge     Cached
##   - libgcc-ng               12.2.0  h65d4601_19         conda-forge     Cached
##   - libgomp                 12.2.0  h65d4601_19         conda-forge     Cached
##   - libnghttp2              1.51.0  hdcd2b5c_0          conda-forge     Cached
##   - libnsl                   2.0.0  h7f98852_0          conda-forge     Cached
##   - libsqlite               3.40.0  h753d276_0          conda-forge     Cached
##   - libssh2                 1.10.0  haa6b8db_3          conda-forge     Cached
##   - libstdcxx-ng            12.2.0  h46fd767_19         conda-forge     Cached
##   - libuuid                 2.38.1  h0b41bf4_0          conda-forge     Cached
##   - libzlib                 1.2.13  h166bdaf_4          conda-forge     Cached
##   - ncurses                    6.3  h27087fc_1          conda-forge     Cached
##   - openssl                 1.1.1t  h0b41bf4_0          conda-forge     Cached
##   - pip                     23.0.1  pyhd8ed1ab_0        conda-forge     Cached
##   - python                  3.11.0  h10a6764_1_cpython  conda-forge     Cached
##   - readline                   8.2  h8228510_1          conda-forge     Cached
##   - samtools                1.16.1  h00cdaf9_2          bioconda        Cached
##   - setuptools              67.6.1  pyhd8ed1ab_0        conda-forge     Cached
##   - tk                      8.6.12  h27826a3_0          conda-forge     Cached
##   - tzdata                   2023c  h71feb2d_0          conda-forge     Cached
##   - wheel                   0.40.0  pyhd8ed1ab_0        conda-forge     Cached
##   - xz                       5.2.6  h166bdaf_0          conda-forge     Cached
##   - zlib                    1.2.13  h166bdaf_4          conda-forge     Cached
## 
##   Summary:
## 
##   Remove: 37 packages
## 
##   Total download: 0 B
## 
## ────────────────────────────────────────────────────────────────────────────────
## 
## 
## 
## Transaction starting
## Unlinking bedtools-2.30.0-h468198e_3
## Unlinking grampa-1.4.0-pyhdfd78af_0
## Unlinking samtools-1.16.1-h00cdaf9_2
## Unlinking pip-23.0.1-pyhd8ed1ab_0
## Unlinking htslib-1.17-h6bc39ce_0
## Unlinking python-3.11.0-h10a6764_1_cpython
## Unlinking zlib-1.2.13-h166bdaf_4
## Unlinking libdeflate-1.13-h166bdaf_0
## Unlinking libcurl-7.87.0-h6312ad2_0
## Unlinking setuptools-67.6.1-pyhd8ed1ab_0
## Unlinking libssh2-1.10.0-haa6b8db_3
## Unlinking libnghttp2-1.51.0-hdcd2b5c_0
## Unlinking krb5-1.20.1-hf9c8cef_0
## Unlinking wheel-0.40.0-pyhd8ed1ab_0
## Unlinking libev-4.33-h516909a_1
## Unlinking c-ares-1.18.1-h7f98852_0
## Unlinking libstdcxx-ng-12.2.0-h46fd767_19
## Unlinking libedit-3.1.20191231-he28a2e2_2
## Unlinking keyutils-1.6.1-h166bdaf_0
## Unlinking xz-5.2.6-h166bdaf_0
## Unlinking tzdata-2023c-h71feb2d_0
## Unlinking tk-8.6.12-h27826a3_0
## Unlinking readline-8.2-h8228510_1
## Unlinking openssl-1.1.1t-h0b41bf4_0
## Unlinking libuuid-2.38.1-h0b41bf4_0
## Unlinking libsqlite-3.40.0-h753d276_0
## Unlinking libnsl-2.0.0-h7f98852_0
## Unlinking libffi-3.4.2-h7f98852_5
## Unlinking ld_impl_linux-64-2.40-h41732ed_0
## Unlinking bzip2-1.0.8-h7f98852_4
## Unlinking ncurses-6.3-h27087fc_1
## Unlinking ca-certificates-2022.12.7-ha878542_0
## Unlinking libzlib-1.2.13-h166bdaf_4
## Unlinking libgcc-ng-12.2.0-h65d4601_19
## Unlinking _openmp_mutex-4.5-2_gnu
## Unlinking libgomp-12.2.0-h65d4601_19
## Unlinking _libgcc_mutex-0.1-conda_forge
## Transaction finished
## Environment removed at prefix: /tmp/mamba/envs/day4

…and recreate the environment using mamba env create.

Run the code block below to create our day4 enviornment from a file. The channel order in the must be explicitly specified due to mamba env export not preserving channel order; see: https://github.com/conda/conda/issues/7884)


mamba env create -y -c conda-forge -c bioconda -n day4 -f day4-environment.yml
# mamba: A variant on the conda package manager
# create: The sub-command of mamba we want to run
# -y: don't prompt y/n to create the environment; assume "y"
# -c: This option tells mamba search to search this channel for the provided package name; multiple -c options can be provided
# -n: The name of the environment to create
# -f: The file to read configuration and package installation info from to create the environment

## 
##                                            __
##           __  ______ ___  ____ _____ ___  / /_  ____ _
##          / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
##         / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
##        / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
##       /_/
## 
## conda-forge/linux-64                                        Using cache
## conda-forge/noarch                                          Using cache
## bioconda/linux-64                                           Using cache
## bioconda/noarch                                             Using cache
## 
## Transaction
## 
##   Prefix: /tmp/mamba/envs/day4
## 
##   Updating specs:
## 
##    - _libgcc_mutex==0.1=conda_forge
##    - _openmp_mutex==4.5=2_gnu
##    - bedtools==2.30.0=h468198e_3
##    - bzip2==1.0.8=h7f98852_4
##    - c-ares==1.18.1=h7f98852_0
##    - ca-certificates==2022.12.7=ha878542_0
##    - grampa==1.4.0=pyhdfd78af_0
##    - htslib==1.17=h6bc39ce_0
##    - keyutils==1.6.1=h166bdaf_0
##    - krb5==1.20.1=hf9c8cef_0
##    - ld_impl_linux-64==2.40=h41732ed_0
##    - libcurl==7.87.0=h6312ad2_0
##    - libdeflate==1.13=h166bdaf_0
##    - libedit==3.1.20191231=he28a2e2_2
##    - libev==4.33=h516909a_1
##    - libffi==3.4.2=h7f98852_5
##    - libgcc-ng==12.2.0=h65d4601_19
##    - libgomp==12.2.0=h65d4601_19
##    - libnghttp2==1.51.0=hdcd2b5c_0
##    - libnsl==2.0.0=h7f98852_0
##    - libsqlite==3.40.0=h753d276_0
##    - libssh2==1.10.0=haa6b8db_3
##    - libstdcxx-ng==12.2.0=h46fd767_19
##    - libuuid==2.38.1=h0b41bf4_0
##    - libzlib==1.2.13=h166bdaf_4
##    - ncurses==6.3=h27087fc_1
##    - openssl==1.1.1t=h0b41bf4_0
##    - pip==23.0.1=pyhd8ed1ab_0
##    - python==3.11.0=h10a6764_1_cpython
##    - readline==8.2=h8228510_1
##    - samtools==1.16.1=h00cdaf9_2
##    - setuptools==67.6.1=pyhd8ed1ab_0
##    - tk==8.6.12=h27826a3_0
##    - tzdata==2023c=h71feb2d_0
##    - wheel==0.40.0=pyhd8ed1ab_0
##    - xz==5.2.6=h166bdaf_0
##    - zlib==1.2.13=h166bdaf_4
## 
## 
##   Package                  Version  Build               Channel                    Size
## ─────────────────────────────────────────────────────────────────────────────────────────
##   Install:
## ─────────────────────────────────────────────────────────────────────────────────────────
## 
##   + _libgcc_mutex              0.1  conda_forge         conda-forge/linux-64     Cached
##   + _openmp_mutex              4.5  2_gnu               conda-forge/linux-64     Cached
##   + bedtools                2.30.0  h468198e_3          bioconda/linux-64        Cached
##   + bzip2                    1.0.8  h7f98852_4          conda-forge/linux-64     Cached
##   + c-ares                  1.18.1  h7f98852_0          conda-forge/linux-64     Cached
##   + ca-certificates      2022.12.7  ha878542_0          conda-forge/linux-64     Cached
##   + grampa                   1.4.0  pyhdfd78af_0        bioconda/noarch          Cached
##   + htslib                    1.17  h6bc39ce_0          bioconda/linux-64        Cached
##   + keyutils                 1.6.1  h166bdaf_0          conda-forge/linux-64     Cached
##   + krb5                    1.20.1  hf9c8cef_0          conda-forge/linux-64     Cached
##   + ld_impl_linux-64          2.40  h41732ed_0          conda-forge/linux-64     Cached
##   + libcurl                 7.87.0  h6312ad2_0          conda-forge/linux-64     Cached
##   + libdeflate                1.13  h166bdaf_0          conda-forge/linux-64     Cached
##   + libedit           3.1.20191231  he28a2e2_2          conda-forge/linux-64     Cached
##   + libev                     4.33  h516909a_1          conda-forge/linux-64     Cached
##   + libffi                   3.4.2  h7f98852_5          conda-forge/linux-64     Cached
##   + libgcc-ng               12.2.0  h65d4601_19         conda-forge/linux-64     Cached
##   + libgomp                 12.2.0  h65d4601_19         conda-forge/linux-64     Cached
##   + libnghttp2              1.51.0  hdcd2b5c_0          conda-forge/linux-64     Cached
##   + libnsl                   2.0.0  h7f98852_0          conda-forge/linux-64     Cached
##   + libsqlite               3.40.0  h753d276_0          conda-forge/linux-64     Cached
##   + libssh2                 1.10.0  haa6b8db_3          conda-forge/linux-64     Cached
##   + libstdcxx-ng            12.2.0  h46fd767_19         conda-forge/linux-64     Cached
##   + libuuid                 2.38.1  h0b41bf4_0          conda-forge/linux-64     Cached
##   + libzlib                 1.2.13  h166bdaf_4          conda-forge/linux-64     Cached
##   + ncurses                    6.3  h27087fc_1          conda-forge/linux-64     Cached
##   + openssl                 1.1.1t  h0b41bf4_0          conda-forge/linux-64     Cached
##   + pip                     23.0.1  pyhd8ed1ab_0        conda-forge/noarch       Cached
##   + python                  3.11.0  h10a6764_1_cpython  conda-forge/linux-64     Cached
##   + readline                   8.2  h8228510_1          conda-forge/linux-64     Cached
##   + samtools                1.16.1  h00cdaf9_2          bioconda/linux-64        Cached
##   + setuptools              67.6.1  pyhd8ed1ab_0        conda-forge/noarch       Cached
##   + tk                      8.6.12  h27826a3_0          conda-forge/linux-64     Cached
##   + tzdata                   2023c  h71feb2d_0          conda-forge/noarch       Cached
##   + wheel                   0.40.0  pyhd8ed1ab_0        conda-forge/noarch       Cached
##   + xz                       5.2.6  h166bdaf_0          conda-forge/linux-64     Cached
##   + zlib                    1.2.13  h166bdaf_4          conda-forge/linux-64     Cached
## 
##   Summary:
## 
##   Install: 37 packages
## 
##   Total download: 0 B
## 
## ─────────────────────────────────────────────────────────────────────────────────────────
## 
## 
## 
## Transaction starting
## Linking _libgcc_mutex-0.1-conda_forge
## Linking ca-certificates-2022.12.7-ha878542_0
## Linking ld_impl_linux-64-2.40-h41732ed_0
## Linking libstdcxx-ng-12.2.0-h46fd767_19
## Linking libgomp-12.2.0-h65d4601_19
## Linking _openmp_mutex-4.5-2_gnu
## Linking libgcc-ng-12.2.0-h65d4601_19
## Linking xz-5.2.6-h166bdaf_0
## Linking openssl-1.1.1t-h0b41bf4_0
## Linking ncurses-6.3-h27087fc_1
## Linking libzlib-1.2.13-h166bdaf_4
## Linking libuuid-2.38.1-h0b41bf4_0
## Linking libnsl-2.0.0-h7f98852_0
## Linking libffi-3.4.2-h7f98852_5
## Linking libev-4.33-h516909a_1
## Linking libdeflate-1.13-h166bdaf_0
## Linking keyutils-1.6.1-h166bdaf_0
## Linking c-ares-1.18.1-h7f98852_0
## Linking bzip2-1.0.8-h7f98852_4
## Linking readline-8.2-h8228510_1
## Linking libedit-3.1.20191231-he28a2e2_2
## Linking zlib-1.2.13-h166bdaf_4
## Linking tk-8.6.12-h27826a3_0
## Linking libssh2-1.10.0-haa6b8db_3
## Linking libsqlite-3.40.0-h753d276_0
## Linking libnghttp2-1.51.0-hdcd2b5c_0
## Linking krb5-1.20.1-hf9c8cef_0
## Linking libcurl-7.87.0-h6312ad2_0
## Linking tzdata-2023c-h71feb2d_0
## Linking bedtools-2.30.0-h468198e_3
## Linking htslib-1.17-h6bc39ce_0
## Linking samtools-1.16.1-h00cdaf9_2
## Linking python-3.11.0-h10a6764_1_cpython
## Linking wheel-0.40.0-pyhd8ed1ab_0
## Linking setuptools-67.6.1-pyhd8ed1ab_0
## Linking pip-23.0.1-pyhd8ed1ab_0
## Linking grampa-1.4.0-pyhdfd78af_0
## Transaction finished

And let’s make sure our day4 environment exists with all the packages we wanted.

Run the code block below to list all the packages installed in the day4 environment:


mamba list -n day4
# mamba: A variant on the conda package manager
# list: The sub-command of mamba we want to run
# -n: The name of the environment to list

## List of packages in environment: "/tmp/mamba/envs/day4"
## 
##   Name              Version       Build               Channel    
## ───────────────────────────────────────────────────────────────────
##   _libgcc_mutex     0.1           conda_forge         conda-forge
##   _openmp_mutex     4.5           2_gnu               conda-forge
##   bedtools          2.30.0        h468198e_3          bioconda   
##   bzip2             1.0.8         h7f98852_4          conda-forge
##   c-ares            1.18.1        h7f98852_0          conda-forge
##   ca-certificates   2022.12.7     ha878542_0          conda-forge
##   grampa            1.4.0         pyhdfd78af_0        bioconda   
##   htslib            1.17          h6bc39ce_0          bioconda   
##   keyutils          1.6.1         h166bdaf_0          conda-forge
##   krb5              1.20.1        hf9c8cef_0          conda-forge
##   ld_impl_linux-64  2.40          h41732ed_0          conda-forge
##   libcurl           7.87.0        h6312ad2_0          conda-forge
##   libdeflate        1.13          h166bdaf_0          conda-forge
##   libedit           3.1.20191231  he28a2e2_2          conda-forge
##   libev             4.33          h516909a_1          conda-forge
##   libffi            3.4.2         h7f98852_5          conda-forge
##   libgcc-ng         12.2.0        h65d4601_19         conda-forge
##   libgomp           12.2.0        h65d4601_19         conda-forge
##   libnghttp2        1.51.0        hdcd2b5c_0          conda-forge
##   libnsl            2.0.0         h7f98852_0          conda-forge
##   libsqlite         3.40.0        h753d276_0          conda-forge
##   libssh2           1.10.0        haa6b8db_3          conda-forge
##   libstdcxx-ng      12.2.0        h46fd767_19         conda-forge
##   libuuid           2.38.1        h0b41bf4_0          conda-forge
##   libzlib           1.2.13        h166bdaf_4          conda-forge
##   ncurses           6.3           h27087fc_1          conda-forge
##   openssl           1.1.1t        h0b41bf4_0          conda-forge
##   pip               23.0.1        pyhd8ed1ab_0        conda-forge
##   python            3.11.0        h10a6764_1_cpython  conda-forge
##   readline          8.2           h8228510_1          conda-forge
##   samtools          1.16.1        h00cdaf9_2          bioconda   
##   setuptools        67.6.1        pyhd8ed1ab_0        conda-forge
##   tk                8.6.12        h27826a3_0          conda-forge
##   tzdata            2023c         h71feb2d_0          conda-forge
##   wheel             0.40.0        pyhd8ed1ab_0        conda-forge
##   xz                5.2.6         h166bdaf_0          conda-forge
##   zlib              1.2.13        h166bdaf_4          conda-forge

Part 2 - Running Slurm jobs to the FAS RC cluster

Prerequisite: setting up symbolic links to the data files

Run the following code chunk to create symbolic links in your current working directory to the data files used for the exercises in this section:

mkdir -p data4
ln -s -f /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/* data4
# ln: The Unix link command, which can create shortcuts to folders and files at the provided path to the second provided path
# -s: This option tells ln to create a symbolic link rather than a hard link (original files are not changed)
# -f: This option forces ln to create the link

ls -l data4
# Show the details of the files in the new linked directory

## total 224
## lrwxrwxrwx 1 user informatics 109 Mar 31 14:04 Biotips-workshop-2023-Day4-student.Rmd -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/Biotips-workshop-2023-Day4-student.Rmd
## lrwxrwxrwx 1 user informatics  93 Mar 31 14:04 SAMEA3532870_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532870_final.bam
## lrwxrwxrwx 1 user informatics  93 Mar 31 14:04 SAMEA3532871_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532871_final.bam
## lrwxrwxrwx 1 user informatics  93 Mar 31 14:04 SAMEA3532872_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532872_final.bam
## lrwxrwxrwx 1 user informatics  93 Mar 31 14:04 SAMEA3532873_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532873_final.bam
## lrwxrwxrwx 1 user informatics  93 Mar 31 14:04 SAMEA3532874_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532874_final.bam
## lrwxrwxrwx 1 user informatics  93 Mar 31 14:04 SAMEA3532875_final.bam -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532875_final.bam

What is a “cluster”?

Today we’ll be using the Harvard FAS Research Computing cluster, which is known as Cannon. A cluster is the name for a bunch of computers that are networked together and whose main purpose is to run computationally intensive jobs. A cluster can also be referred to as a high performance computing system or HPC. We use compute clusters because the size of biological data and the analyses we may want to perform on it are not tractable to our own personal computers or lab-owned servers. In short, without dedicated compute clusters, modern biological research would probably grind to a halt.

Clusters are great, however they are also a community resource. Multiple groups across a given institution or even across institutions may be in need of the resources of the cluster at the same time. This introduces a problem: how do you decide which user gets to use which resources and at what time? That’s where job-scheduling software comes in.

Job scheduler

When you login to a cluster, as we have done now with our RStudio app, you connect to a login node. Login nodes are usually where people interact with the file system in order to submit jobs to the more resource heavy compute nodes. Running commands and doing some light file processing is generally ok on login nodes, but their main use is to submit jobs. Job submission consists of running a command that executes a script with the job scheduling software that also has some information about the resources requested for the job. Once a job is submitted, the scheduling software will look at the resources requested, the resources available, and the user’s and their group’s recent usage of the cluster to decide which node the commands in the script should be executed on and when it should start.

Slurm

Slurm is a cluster workload manager. Slurm allows shell scripts to be augmented with special directives that specify the needed resources (such as CPUs, memory, GPUs, and nodes) to allocate. Slurm schedules submitted jobs for execution on the requested resources. If the resources are not immediately available, the job is queued and prioritized for execution based on the resources requested, as well as the fairshare associated with the Slurm account.

By default, on Cannon each lab group is associated with a single Slurm account.

The job scheduling software that is used on Cannon is Slurm. Slurm has many commands that allow users to monitor cluster resources and submit jobs. Let’s take a look at one of them, squeue.

Run the code block below to use squeue to see who is running jobs on the cluster right now:


squeue | head
# squeue: SLURMs queue information command
# | : The Unix pipe operator to pass output from one command as input to another command
# head: The Unix command to only display the first few lines of the input

##              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
##           48082789    bigmem submit_b lsepulve PD       0:00      8 (Resources)
##           48082115    bigmem metfYes_   flwang PD       0:00      1 (Priority)
##           48083143    bigmem interact   wenjun  R      13:11      1 holy7c26507
##           47936048    bigmem fasrc/sy  thuang1  R 1-00:08:13      1 holy7c26403
##           47844330    bigmem fasrc/sy tharding  R 1-15:59:56      1 holy7c26405
##           47844428    bigmem fasrc/sy tharding  R 1-15:52:36      1 holy7c26502
##           47941194    bigmem fasrc/sy tharding  R   23:52:34      1 holy7c26408
##           47941192    bigmem fasrc/sy tharding  R   23:52:41      1 holy7c26406
##           47941191    bigmem fasrc/sy tharding  R   23:52:53      1 holy7c26404

Here we can see several pieces of information about jobs currently being run on the cluster, including their JOBID, the PATITION they’re being run on, and the USER running them. Here are a few definitions of important terms when interacting with the SLURM software:

JOBID: A unique number given to a particular job. This is automatically assigned by Slurm.
NODE: A single computer in the cluster. In the above output, node IDs are listsed under the NODELIST column
PARTITION: A group of compute nodes (computers) that have similar resources or belong to a certain group.

PARTITION is especially important because you will need to specify it when you submit a job, and different partitions have different resources. For instance, if you have a job (set of commands) that you know will use several hundred gigs of RAM, you will have to send your job to the bigmem partition. If you know your job needs GPU resources, then you will have to use the gpu partition.

The following link has a lot of useful information about the cluster, including the different partitions available:

https://docs.rc.fas.harvard.edu/kb/running-jobs/

The sinfo command can be used to see all partitions that you have access to, as well as the nodes in the partition and their state:


sinfo

## PARTITION            AVAIL  TIMELIMIT  NODES  STATE NODELIST
## holy-cow                up   infinite      1    mix holy7c0907
## holy-info               up   infinite      1    mix holy2c0529
## holy-smokes             up   infinite      1    mix holy7c0908
## holy-smokes             up   infinite      6  alloc holy7c[0909-0914]
## holy-smokes-priority    up   infinite      1    mix holy7c0908
## holy-smokes-priority    up   infinite      6  alloc holy7c[0909-0914]
## bigmem                  up 7-00:00:00      1   plnd holy7c26409
## bigmem                  up 7-00:00:00     15    mix holy7c[26403-26406,26408,26411-26412,26501-26502,26507-26508,26510-26511,26604-26605]
## bigmem                  up 7-00:00:00     14  alloc holy7c[26401-26402,26407,26410,26503-26506,26509,26512,26601-26603,26606]
## gpu                     up 7-00:00:00      1   drng holygpu7c26105
## gpu                     up 7-00:00:00     14    mix holygpu7c[26103-26104,26106,26201-26206,26301-26305]
## gpu_mig                 up 7-00:00:00      1    mix holygpu7c26306
## gpu_requeue             up 7-00:00:00      1  inval holygpu8a29402
## gpu_requeue             up 7-00:00:00      2   drng holygpu7c[1315,26105]
## gpu_requeue             up 7-00:00:00     98    mix holy2c1030,holygpu2a605,holygpu2c[0901-0903,0913,0917,0921,0923,0931,1121,1125],holygpu7c[0915,0920,1305,1307,1311,1313,1317,1323,1701,1706,1711,1716,1721,1726,26103-26104,26106,26201-26206,26301-26306],holygpu8a[25104-25106,25204-25206,25305-25306,25404-25406,25606,27101-27103,27201-27203,27301-27303,27401-27403,27501,27601,27605,29104-29106,29201-29203,29304-29306,29401,29605,31104-31106,31201-31203,31304-31306,31401-31402,31406,31506,31606],holyolveczkygpu01,holysabetigpu01,meade[03-05]
## gpu_requeue             up 7-00:00:00     16  alloc holygpu2a609,holygpu8a[25605,27503-27504,27606,29406,29504-29506,29604,29606,31504-31505,31604-31605],holyzicklergpu01
## gpu_requeue             up 7-00:00:00      2   idle holy7b[0909-0910]
## gpu_test                up    8:00:00      4    mix holygpu2c[0701-0702,0709-0710]
## gpu_test                up    8:00:00      1  alloc holygpu2c0704
## gpu_test                up    8:00:00      5   idle holygpu2c[0703,0705-0708]
## remoteviz               up 7-00:00:00      1   idle holygpu2c0711
## serial_requeue*         up 7-00:00:00      5  inval holy7c[16605,20411,21206,21208],holygpu8a29402
## serial_requeue*         up 7-00:00:00      1   plnd holy7c26409
## serial_requeue*         up 7-00:00:00      4 drain* holy7c[04302,04402,12403-12404]
## serial_requeue*         up 7-00:00:00      1  down* holy2c24214
## serial_requeue*         up 7-00:00:00      7   comp holy7c[06111,10404,18308,21201-21204]
## serial_requeue*         up 7-00:00:00     19   drng holy2a[20301,20305],holy7c[04208,04301,04401,06508,08405,16606,21308],holy8a[25606,27509,31209,31305,31405-31406],holygpu7c[1315,26105],holyzhuang01,huce-r940
## serial_requeue*         up 7-00:00:00      4  drain holy7c[15106,20412],holy8a[31211-31212]
## serial_requeue*         up 7-00:00:00    596    mix bloxham-r940,holy2a[01301-01302,01304,01306,01310-01311,01313,01315-01316,02301,02303-02306,02309-02311,02316,05302-05308,05310-05313,05316,15301,15304-15307,15309,15313,20302-20303,20311,23310],holy2c[0529,1029-1030,1129,01208,01211,01213-01214,02101-02102,02108-02109,02111-02113,02115-02116,02211,12205,12208,12210-12215,12302-12304,12306-12309,12311,12313-12316,14201-14211,14401,14403,14405,14409,16107-16108,16115-16116,16206,18101,18104,18106-18116,18201-18211,18213-18215,18302,18305-18316,24102,24201-24211,24213,24215-24216,24301-24308,24310-24313,24315-24316,093401],holy7c[0907-0908,0919,02306,02403,02405,02411-02412,02507,02610-02612,04101,04106,04307,04311,04403,04511,06105-06106,06302-06303,06405,06411,06601,06606,08211,08303,08307,08411,08502,08602,10212,10304-10305,10510,10512,12610,15101-15105,15107-15116,15201-15203,15210-15216,15301-15305,15307-15308,15310-15315,16204,16301-16302,16304,16406,16408-16410,16412,16501,16503-16504,16506-16512,16611,18101-18102,18201,18203-18204,18405,18501-18504,18506,18603-18605,18607-18609,18611-18612,19101,19106,19111,19115,20104-20106,20112,20201-20207,20209-20212,20302-20312,20401,20403-20410,20501-20512,20602,20604-20609,20611-20612,21311,21314,22605,23101-23102,23105,23108-23109,23201-23205,23207,23213-23216,23301,23303-23305,23308,24105-24110,24205-24208,24210,24306,24308-24312,24405,24408,24410,24508,24510-24511,24608-24609,24611-24612,26403-26406,26408,26411-26412,26501-26502,26507-26508,26510-26511,26604-26605,092602],holy8a[25101-25106,25201-25206,25302-25306,25401-25406,25502-25506,25601-25602,25604,27107,27111-27112,27207-27208,27307,27407-27409,27510-27511,29101,29103-29104,29106,29207,29209-29211,29304,29503-29506,29603-29606,31103-31106,31207-31208,31210,31301-31304,31306,31407-31410,31501-31506,31603-31606],holydsouza[01-04],holygpu2a605,holygpu2c[0901-0903,0913,0917,0921,0923,1121,1125],holygpu7c[0915,0920,1305,1307,1311,1313,1317,1323,1701,1706,1711,1716,1721,1726,26103-26104,26106,26201-26206,26301-26306],holygpu8a[25104-25106,25204-25206,25305-25306,25404-25406,25606,27101-27103,27201-27203,27301-27303,27401-27403,27501,27601,27605,29104-29106,29201-29203,29304-29306,29401,29605,31104-31106,31201-31203,31304-31306,31401-31402,31406,31506,31606],holyolveczkygpu01,holysabetigpu01,meade[03-05]
## serial_requeue*         up 7-00:00:00    560  alloc holy2a[01308,02312-02314,05301,05314,16301-16316,20304,20306-20310,20312-20313,23301-23303,23305-23309],holy2c[01201-01205,01207,01209,02201-02210,02212-02215,12201-12204,12206,12216,12312,18216,18301,18304],holy7c[0909-0914,02103-02105,02107-02112,02201-02202,02204-02206,02209-02212,02301-02305,02307-02309,02409-02410,02505-02506,02508,02511-02512,02609,04102-04105,04107-04112,04201-04207,04209-04212,04303-04306,04308-04310,04312,04404-04412,04501-04510,04512,04603-04612,06101-06104,06107-06110,06112,06201-06212,06301,06304-06312,06401-06404,06406-06410,06412,06501-06507,06509-06512,06602-06605,06607-06612,08101-08112,08201-08210,08212,08301-08302,08304-08306,08308-08312,08401-08404,08406-08410,08412,08501,08503-08512,08601,08603-08612,10101-10112,10201-10211,10301-10303,10306-10312,10401-10403,10405-10412,10501-10509,10511,10601-10612,12401-12402,12405-12412,12501-12512,12601-12609,12611-12612,15204-15209,15306,15316,16101-16112,16201,16207,16303,16407,16411,16502,16505,16601-16604,16607-16610,18103-18112,18202,18205,18207-18209,18301-18307,18309-18311,18407-18411,18505,18610,20101-20103,20610,21301-21307,21309-21310,21312-21313,21315-21316,23206,23208-23209,23211,23302,23306-23307,24111-24112,24209,24211-24212,24305,24307,24406-24407,24409,24411-24412,24505-24507,24509,24512,24605-24607,24610,26401-26402,26407,26410,26503-26506,26509,26512,26601-26603,26606],holy8a[25501,29105,29208,29212,29301-29303,29305-29306,29405-29406,29501-29502,29601-29602,31101-31102,31601-31602],holygpu2a609,holygpu8a[25605,27503-27504,27606,29406,29504-29506,29604,29606,31504-31505,31604-31605],holyjacob[02-03,05-06],holyvulis01,holyzicklergpu01
## serial_requeue*         up 7-00:00:00    195   idle holy2c[02103-02106,02110,02114,14212-14216,14402,14404,14406,14408,16201-16205,16208,24103-24106,092901-092902,093001-093002,093301-093302,093402],holy7b[0909-0910],holy7c[02106,02203,02207-02208,02310-02312,02401-02402,02404,02406-02408,02501-02504,02509-02510,02601-02608,04601-04602,16202-16203,16205-16206,16208-16212,16305-16312,16401-16405,16612,18206,18210-18212,18312,18401-18404,18406,18412,18507-18512,18601-18602,18606,19102-19105,19107-19109,19112-19114,19116,20107-20111,20208,20301,20402,20601,20603,21102-21104,21108-21116,22501-22512,22601-22604,22606-22612,23103,23310-23314,23316,092401-092402,092501-092502,092601,092701-092702],holy8a[25301,25603,25605,27108-27110,27209-27212,27308-27312,27410-27412,29102]
## shared                  up 7-00:00:00      1  inval holy7c16605
## shared                  up 7-00:00:00      4 drain* holy7c[04302,04402,12403-12404]
## shared                  up 7-00:00:00      3   comp holy7c[06111,10404,18308]
## shared                  up 7-00:00:00      6   drng holy7c[04208,04301,04401,06508,08405,16606]
## shared                  up 7-00:00:00     28    mix holy7c[02412,04101,04106,04307,04311,04403,04511,06105-06106,06302-06303,06405,06411,06601,06606,08211,08303,08307,08411,08502,08602,10212,10304-10305,10510,10512,12610,20104]
## shared                  up 7-00:00:00    320  alloc holy7c[02409-02410,02511-02512,04102-04105,04107-04112,04201-04207,04209-04212,04303-04306,04308-04310,04312,04404-04412,04501-04510,04512,06101-06104,06107-06110,06112,06201-06212,06301,06304-06312,06401-06404,06406-06410,06412,06501-06507,06509-06512,06602-06605,06607-06612,08101-08112,08201-08210,08212,08301-08302,08304-08306,08308-08312,08401-08404,08406-08410,08412,08501,08503-08512,08601,08603-08612,10101-10112,10201-10211,10301-10303,10306-10312,10401-10403,10405-10412,10501-10509,10511,10601-10612,12401-12402,12405-12412,12501-12512,12601-12609,12611-12612,16101-16110,16601-16604,16607-16610,18103-18112,18301-18307,18309-18311,20101-20103]
## test                    up    8:00:00      9    mix holy7c[24102-24104,24201,24504,24601-24604]
## test                    up    8:00:00      1  alloc holy7c24101
## test                    up    8:00:00     12   idle holy7c[24202-24204,24301-24304,24403-24404,24501-24503]
## ultramem                up 7-00:00:00      1   drng holy8a27509
## ultramem                up 7-00:00:00      2    mix holy8a[27510-27511]
## unrestricted            up   infinite      6    mix holy7c[18605,18607-18609,18611-18612]
## unrestricted            up   infinite      1  alloc holy7c18610
## unrestricted            up   infinite      1   idle holy7c18606

Submitting a job

Alright, let’s say you’ve gotten a BAM file from your colleague and she asks you to summarize the coverage of it. Well, we know how to do that from Day 1 of the workshop. But the BAM file she gave you is several hundred gigabytes big! There’s no way you can run this on your computer that has 8 gigabytes of RAM, and you know you shouldn’t process such a large file on the login node since that will slow things down for everyone.

We will have to create a job script and submit it to the cluster.

A job script is just like a bash script that we learned about yesterday, except it has some extra information in it for SLURM, and it is executed in a different way. We’ll go over all of this.

First, let’s decide on the commands we want to run. From Day 1, we know we can summarize coverage of a BAM file with samtools coverage. We also know that this command requires the input file to be sorted:

samtools sort /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532870_final.bam | samtools coverage > coverage-results.txt

This is great, but would take forever if we just executed it in the Terminal. Let’s turn this into a script. Remember the only requirement for a bash script is a shebang line that tells the shell how to interpret the commands in the file:

#!/bin/bash

samtools sort /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532870_final.bam | samtools coverage > coverage-results.txt

Ok, now if we had this in a file and executed it with ./script_name.sh, it would still run on the login node. To make sure the script is run by SLURM, we instead execute it with the sbatch command. We’ll get to that in a second, but SLURM also requires a bit more information about the resources needed. These are provided in lines at the top of the script beginning with the string #SBATCH:

#!/bin/bash
#SBATCH --job-name=bam-coverage
#SBATCH --partition=test
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
#SBATCH --mem=24g
#SBATCH --time=1:00:00
##SBATCH --mail-type=END,FAIL    # uncomment to send an email when the job ends or fails
##SBATCH --mail-user=<email-address>  # send the email to an alternate email address

# on Cannon today (CentOS 7), loads the Anaconda3 (conda) environment module
# on Cannon after the Rocky 8 upgrade, loads the Mambaforge environment module
module load python
# On Cannon (CentOS 7), `conda activate` and `mamba activate` do not work; the older form
# (`source activate`) is required
# After Cannon (Rocky8), `conda activate` and `mamba activate` will work as well
source activate ./samtools-env

samtools sort --threads ${SLURM_CPUS_PER_TASK} -T $(mktemp -d) /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day4/SAMEA3532870_final.bam | samtools coverage > coverage-results.txt
# --threads ${SLURM_CPUS_PER_TASK} : tells samtools sort to use a number of threads as
#                                    specified by the SLURM_CPUS_PER_TASK environment variable,
#                                    which is set to the value from #SBATCH --cpus-per-task=...
# -T $(mktemp -d) : have samtools sort write intermediate temporary sorted BAM files to 
#                   the directory created/printed by the `mktemp -d` command, which creates
#                   a new directory in /tmp (node-local temporary storage) instead of the
#                   current working directory

Now we’ve told Slurm how many resources we want and where we want to run the job! Let’s break down these options, including their defaults (on Cannon) if they’re omitted:

--job-name : A string identifier for the current job.
Default: the name of the batch job script file
-–partition : The partition on which you want to run the current job. In this case, we’re just doing a test, so we’ll run it on the test partition! Remember the list of partitions is available here

Default: --partition=serial_requeue
-–ntasks : The number of tasks (processes), each of which will be allocated --cpus-per-task processor cores. Unless you’re running a distributed-memory parallel job that uses multiple nodes, thos option can be omitted or explicitly set to 1.
Default: --ntasks=1
-–cpus-per-task : The number of CPUs each task requires. For commands that use multiple threads or processes, request multiple CPUs here AND remember to set those options in the command as well!

Default: --cpus-per-task=1
-–mem : The amount of memory the job requires.

Default: --mem=100m (megabytes)
-–time : The amount of time the job requires. If the job isn’t finished at the end of this amount of time, it will time out and be incomplete. Most jobs cannot be resumed, so make sure you give your job enough time!! Per the sbatch manual page:
Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.
Default: --time=10 (10 minutes)
--mail-type : (commented-out in the above script) send an email if/when the specified event occurs (in the above example, the job ENDs or FAILs). A list of supported events is listed in the sbatch manual. By default, this email goes to the email address associated with your FAS RC account (unless --mail-user is set; see below).
--mail-user : (commented-out in the above script) if --mail-type is set, send any email to the specified email address instead of the default email associated with your FAS RC account.

Let’s convert this block into an actual script.

Exercise: 1. Using the file menu, go to New file –> Shell script and create a new file. Call it coverage-submit.sh. This is going to be our job script that we will modify as we go, and will appear in a new tab of our text editor. 2. Copy the text from the code block above into your new script and save the file. 3. Run the following command in the Terminal below to submit your script to the cluster as a job with sbatch:

sbatch coverage-submit.sh

Run the code block below to check on the status of your job with squeue


squeue --me
# squeue: SLURMs job status command
# --me: This option tells squeue to only show jobs for the current user

You will see one job that is the current RStudio session, and hopefully another that is the bam-coverage job we just submitted. Periodically rerun the command to check until the job is done (i.e. disappears from the list of jobs output by squeue --me)

Listing completed jobs

The sacct command queries the Slurm accounting database for job information.

Invoked with no arguments, sacct displays information on the user’s jobs that ran in the last 24 hours:


sacct

The recently-complete bam-coverage job (truncated to bam-cover+ in the JobName column) should be listed in a COMPLETED state. Copy this job id for the next exercise.

Information on a specific job (including older jobs that haven’t been purged from the Slurm accounting database) can be obtained via sacct -j <jobid>.

Interactive jobs

Instead of submitting a batch job that runs asynchronously, one can submit an interactive job using the salloc command. The result is similar to SSH’ing into a compute node and using the allocated resources.

Exercise: From the Terminal, execute the following command:

    salloc -p test --time=60 --mem=1g

Notice the host name in your shell prompt changes to indicate that on a different host.

Determining job efficiency

It can be difficult to determine the resources (primarily CPUs, memory, and time) to allocate to a job. Allocate too little, and your job will run slowly (if allocated too few CPUs) or be terminated (if allocated too-little memory or wall time). Allocate too many resources, and your job may wait longer in the queue, prevent other jobs from running, and/or squander your lab group’s fairshare.

The resource utilization of previous similar jobs can help inform resource allocation for future jobs. The sacct command can be used to query the Slurm accounting database for historical job resource utilization.

Exercise: From the Terminal, execute the following command, substituting <jobid> with the job ID of your bam-coverage job. The left/right arrows can be used to scroll and view the contents. When done, press q.

    sacct -lj <jobid> | less -S

The above output is a fairly comprehensive—but not user-friendly—summary of resources used by the job. The seff utility provides a more human-readable summary of relevant resources used by the job.

Exercise: From the Terminal, in an interactive job (see the salloc command above), execute the following command:

    seff <jobid>

Appendix

While we didn’t cover these concepts with hands-on examples today, a couple additional Slurm commands/concepts to be aware of:

Cancelling a job

We’ve all submitted a job, and then realized that we wanted to change something about the job after the fact. To cancel a job(s) that’s queued, or is currently running (instead of letting it run to completion), use the scancel command, supplying the Slurm jobid(s) as arguments; e.g.:

scancel 12345678

Job Arrays

A job array is useful for launching many jobs that run the same job script on multiple different input files. The jobs that are launched using this mechanism can be monitored and managed almost as easily as a single job.

The Submitting Large Numbers of Jobs to the FASRC cluster guide provides an overview of job arrays.

Harvard Informatics Bioinformatics Tips Workshop

Day 4: Conda and Slurm

Nathan Weeks, Gregg Thomas

March 30, 2023

Part 1 - Installing software for reproducible science with Conda

What is Conda?

Where to get Conda

Setup (for this workshop)

Getting help

Conda channels

Setting conda channels for bioinformatics

Searching for packages with mamba

Searching for conda packages using a web browser

Conda Environments

Creating a new environment with `mamba env create`

Listing conda environments: `mamba env list`

Activating an environment: `mamba activate`

Listing packages installed in an environment

Installing packages in an environment: `mamba install`

Deactivating an environment: `mamba deactivate`

Running a single command in a conda environment: `mamba run`

Creating an environment at a specific path

Part 2 - Running Slurm jobs to the FAS RC cluster

Prerequisite: setting up symbolic links to the data files

What is a “cluster”?

Job scheduler

Slurm

Submitting a job

Listing completed jobs

Interactive jobs

Determining job efficiency

Appendix

Cancelling a job

Job Arrays

Harvard Informatics Bioinformatics Tips Workshop

Day 4: Conda and Slurm

Nathan Weeks, Gregg Thomas

March 30, 2023

Part 1 - Installing software for reproducible science with Conda

What is Conda?

Where to get Conda

Setup (for this workshop)

Getting help

Conda channels

Setting conda channels for bioinformatics

Searching for packages with mamba

Searching for conda packages using a web browser

Conda Environments

Creating a new environment with mamba env create

Listing conda environments: mamba env list

Activating an environment: mamba activate

Listing packages installed in an environment

Installing packages in an environment: mamba install

Deactivating an environment: mamba deactivate

Running a single command in a conda environment: mamba run

Creating an environment at a specific path

Sharing and recreating conda environments

Part 2 - Running Slurm jobs to the FAS RC cluster

Prerequisite: setting up symbolic links to the data files

What is a “cluster”?

Job scheduler

Slurm

Submitting a job

Listing completed jobs

Interactive jobs

Determining job efficiency

Appendix

Cancelling a job

Job Arrays

Creating a new environment with `mamba env create`

Listing conda environments: `mamba env list`

Activating an environment: `mamba activate`

Installing packages in an environment: `mamba install`

Deactivating an environment: `mamba deactivate`

Running a single command in a conda environment: `mamba run`