Welcome to the first day of the FAS Informatics Bioinformatics Tips & Tricks workshop!
This is the workshop file that should be opened in RStudio. This is an RMarkdown file, meaning that both formatted text and code blocks can be added to it, and the code blocks can be executed from the RStudio interface. Today we will be going through this document and executing the code blocks within it to demonstrate some basic concepts.
The first two days of this workshop are designed to get you familiar with the different types of common file formats you will often come across in many bioinformatics pipelines, and are designed to mimic the flow of a typical pipeline from raw sequence files to variants called against a reference genome. In this first session we will discuss sequence files and alignment files, as well as some tools and commands used to manipulate them!
We should ground ourselves a bit before we get too into the weeds,
simply by defining what we mean when we say the word
command. A command is basically just a
program or an app. It is a chunk of code that someone has written that
takes input, processes that input, and produces output. The really
common and useful chunks of code (e.g. ls
or
cd
) have become mainstays in modern operating systems to
the extent that we don’t even need to think about the underlying code,
but it is there.
Seeing a bunch of these commands with weird names (like
grep
and awk
and samtools
) and
lots of options (e.g. -l
, -c
,
-A10
, etc.) can make things really confusing for someone
that’s not used to dealing with them. Many commands, and a lot of
bioinformatics in general, simply rely on plain-text formatted files.
When it comes down to it, most of the commands we’ll use over the next
few days, no matter how complicated they look, really just does the
following:
formatted text -> command -> processed text
(Technically, this kind of command is called a ‘filter’. Some
commands, such as ls
, don’t operate on text, and others,
such as cd
, don’t even produce output. But these details
aren’t that important for our class today.)
This is central to the Unix philosophy. And this means that two things are really important:
Some common file formats in data science simply encode tables
of data with rows being observations and columns being features
of each observation. Columns are usually are usually designated by a
separator character, commonly comma (,
) with files that
have .csv
extensions or a tab (often
encoded as \t
) with files that have .tsv
or
.tab
extensions. File
extensions are a nice way to give us a clue to how the data in
the file is formatted. But note that the extension does
not necessarily define how the data is formatted – the agreement between
the file’s extension and its format is not enforced by the machine and
is only meant to be descriptive. You may sometimes find a tab delimited
file with the .txt
extension, or even with
.csv
! Best practices dictate that the extension match the
formatting, but best practices are not always followed.
In this workshop we will also see file formats specific for genomic
data: sequences are encoded in .fasta
or
.fastq
files, alignments of sequenced reads are encoded in
.sam
files, intervals in a genome are encoded in
.bed
files, and genomic variation relative to a reference
genome is encoded in .vcf
files. Though I will note that,
with the exception of .fasta
and .fastq
files,
these are all simply tab delimited files with columns
specific to genomic data!
For mor information on some common genomics file formats, see here: https://harvardinformatics.github.io/workshops/2023-spring/biotips/terms.html#formats
One of the most important things I can tell you about bioinformatics is to always remember what file formats your data are in!
In the section above and throughout this workshop we use a lot of context dependent terms. For instance, you probably know what the word “command” means in general usage, but in the context of computer science it has a specific meaning that may not be obvious. This brings up an important point I try to remember when teaching. When learning a new skill or set of skills there’s usually a whole new vocabulary to learn that goes along with it. What makes it difficult is that those teaching the new skill will use this vocabulary, and since they are so familiar with it they oftentimes won’t even realize people don’t know what the words they’re using mean in the context of the new skillset.
In an attempt to offset this unintentional language barrier we provide tables with some contextual definitions of terms we may use throughout the workshop at the following link:
https://harvardinformatics.github.io/workshops/2023-spring/biotips/terms.html
If you see hear any terms you think should be added to these tables, please let us know.
To avoid having to either type out out long paths or copy workshop data to your local directory, we’ll create a symbolic link (analogous to a Windows shortcut) in your current working directory called “data” that points to the workshop data directory.
Run the following command to create a link to our data directory in your current directory:
Note that whenever you see the > character followed by green text, this is an exercise or action to be done by you!
ln -s -f /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day1 data
# ln: The Unix link command, which can create shortcuts to folders and files at the provided path to the second provided path
# -s: This option tells ln to create a symbolic link rather than a hard link (original files are not changed)
# -f: This option forces ln to create the link
ls -l data
# Show the details of the new linked directory
## lrwxrwxrwx 1 gthomas informatics 70 Mar 22 18:33 data -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day1
Now you can access a given file in that directory by simply
referencing data/<filename>
in your code.
Probably the most common file format used to represent biological sequences is the FASTA format (whose name is derived from a software package written in the 1980s). A FASTA file is composed of a number of entries, and can be either protein or nucleotide sequences. An “entry” can represent a number of things; for instance, if you were working with a FASTA file that you got back from the sequencing center, each entry would correspond to a single read off the sequencer. Alternatively, if you downloaded a whole genome assembly off NCBI, each entry would correspond to a scaffold or contig in the assembly.
Each entry is comprised of two lines:
- Header line, which starts with a >
symbol.
This contains the unique name for each entry (e.g. >chr1), optionally
followed by additional metadata.
- Sequence line, which follows a header line. This is a string
of either nucleotides or amino acids and represents the actual
sequence
Note: for some FASTA files, the sequence line can be set to wrap every certain number of characters (e.g. every 50 nucleotides). These are referred to as multi-line FASTA, and can make your life difficult…make sure you know whether your FASTA is multi-line or single-line! For example,
>sequence1
ATGGACGCTAGTCAGTAGATGCATGCTGACCCAACATAACG
vs.
>sequence1
ATGGACGCTAG
TCAGTAGATGC
ATGCTGACCCA
ACATAACG
The sequences in these two examples are identical, but one is broken up by lines. Most programs can automatically handle both types of FASTA files, but as we stated above, if you parse your own FASTA files, you need to be aware of the difference!
Now, let’s look at a small test FASTA file as an example. There are many different ways to view the contents of a file at the command line. Here are a few tools that are included in a typical Unix-like system (e.g. the Cannon computing cluster here at Harvard) that we will be using today:
cat
: prints entire file to screen (short for
conCATenate)head
: prints the first 10 lines of a file by default.
We can change how many lines printed by adding the -n #
argument, e.g. head -n 5
prints only the first five
linestail
: the opposite of head
, it prints the
last 10 lines of a fileless
: rather than printing the entire file contents to
screen at once, less
allows you to scroll through using the
arrow keys (close less
by pressing ‘q’). This isn’t useful
with this .Rmd
environment that we have for the workshop,
but at the command line it is very useful for when working with large
fileswc -l
: technically this is not a way to view a file’s
contents, but is still good to know! wc
stands for ‘word
count’, which as the name suggests counts the number of words in the
file, and the -l
argument instead makes it count the number
of lines in the fileOK, now let’s actually look at the file! Since it is only a small
test file, we will use cat
.
Run the following command in the code block below to view an example FASTA formatted file:
cat data/test.fa
# cat: A Unix command to display the contents of a file (or multiple files) to the screen
## >chr1
## AAAAAAAAAAAAAAAAAA
## >chr1a
## GGGGGTCGTCGTCCCCCC
## >chr2
## TTTTTTTTTTTTTTTTTT
## >Chr3
## AAAAAAAAAAAAAATTT
## >chromosome4
## AAAAATTTTTTGGGGCC
## >contig1
## GGGGGGGGGGGGGGGGG
## >contig2
## cCCCCCCCCCCCcccccc
## >unplaced
## CGCGCGCGCGCGCGCGCG
Let’s also see how many lines are in the file.
Run the code block below to count the lines in the FASTA file:
wc -l data/test.fa
# wc: the Unix word count command
# -l: tells wc to only return the line count
## 16 data/test.fa
Exercise: In the code block below, write a command using one of the tools listed above to display only the first 4 lines of the FASTA file.
## Write a command to print the first four lines of the FASTA file
# data/test.fa
head -n 4 data/test.fa
## Write a command to print the first four lines of the FASTA file
## >chr1
## AAAAAAAAAAAAAAAAAA
## >chr1a
## GGGGGTCGTCGTCCCCCC
We can see with the wc
command that we are using the
-l
option to tell the command to count lines in our file.
How do we find out about what other options this command might have?
Well, searching the internet is fine, but most command line tools also
have built in help.
Run the following command in the Terminal below to see the help menu for the
wc
command:
man wc
man
is itself a command, short for
manual. Most command line tools that come built in with
Linux have a manual page that you can read with the man
command.
Tools that you install yourself, such as samtools
that
we will discuss later, usually don’t have man pages. However, many
programs have a command line help that can be accessed with the
-h
or --help
options, or alternatively will
display their help output if run with no arguments or options.
Because they are so universal to bioinformatics, it is worthwhile to get comfortable with FASTA files and all the useful information we can pull out of them! To do so, let’s look at our first command line tool:
grep
grep
is a powerful command-line search tools that is
included as part of Unix-like systems. At the most basic level,
grep
searches for a string of characters that match a
pattern and will print lines containing a match. The basic syntax is as
follow:
grep 'pattern' file_to_search.txt
This may seem simple, but grep
is one of the most useful
tools in bioinformatics! For example, if we wanted the sequence headers
for every entry in our FASTA file, we could do the
following.
Run the code block below to print all the sequence headers in the FASTA file to the screen:
grep '>' data/test.fa
# grep: The Unix string search command
# '>': The string to search for in the provided file
## >chr1
## >chr1a
## >chr2
## >Chr3
## >chromosome4
## >contig1
## >contig2
## >unplaced
By default, grep
will return a match if any part of
the string matches your pattern. For instance, say we wanted to
pull out the headers that correspond only to
chromosomes. If you attempt to match pattern
c
what would happen?
Exercise: In the code block below, write a grep command to print all lines that contain the ‘c’ character in the FASTA file:
## Write a command to display all lines with the 'c' character
# data/test.fa
grep 'c' data/test.fa
## Write a command to display all lines with the 'c' character
## >chr1
## >chr1a
## >chr2
## >chromosome4
## >contig1
## >contig2
## cCCCCCCCCCCCcccccc
## >unplaced
You can see that not only are we pulling out headers that do not correspond to chromosomes, we are even getting a sequence line that contains a lowercase ‘c’! We would instead need be more specific with the string we are trying to search for.
Run the code block below to print all the lines that contain the ‘>chr’ string in the FASTA file:
grep '>chr' data/test.fa
# grep: The Unix string search command
# '>chr': The string to search for in the provided file
## >chr1
## >chr1a
## >chr2
## >chromosome4
This is getting better…notice that by matching >chr
we are correctly getting the line ‘>chromosome4’, as it is still a
partial match. However, we are still missing a sequence, ‘>Chr3’!
This is because by default grep
is case-sensitive.
Thankfully, we can fix that.
grep
grep
can take a huge number of arguments that modify how
it behaves (which you can always check by typing man grep
),
but here we will highlight just a few that are especially useful.
grep -i
allows case-insensitive matches. So to return to
our above problem, we can specify to ignore the case.
Run the code block below to print all lines that contain the ‘>chr’ string in the FASTA file, ignoring the case of the letters in the string:
grep -i '>chr' data/test.fa
# grep: The Unix string search command
# -i: An option the tells grep to ignore the case of the matches, e.g. >chr will match >CHr and >Chr, etc., as well as >chr
# '>chr': The string to search for in the provided file
## >chr1
## >chr1a
## >chr2
## >Chr3
## >chromosome4
grep -c
counts the number of times a match occurs. One
of the most useful applications of this is to determine how many entries
there are in a FASTA file.
Run the code block below to use grep to count the number of sequences in a FASTA file:
grep -c '>' data/test.fa
# grep: The Unix string search command
# -c: An option the tells grep to simply count the number of lines that contain the provided string
# '>': The string to search for in the provided file
## 8
grep -v
inverts grep, printing every line that
does NOT match the pattern. E.g. we want to pull out just the sequences
from a FASTA file and not the headers.
Exercise: In the code block below, write a command to use grep to print out only the lines that contain sequence and not the headers in the FASTA file:
## Use grep to display only sequence lines (EXCLUDE header lines)
# data/test.fa
grep -v '>' data/test.fa
## Use grep to display only sequence lines (EXCLUDE header lines)
## AAAAAAAAAAAAAAAAAA
## GGGGGTCGTCGTCCCCCC
## TTTTTTTTTTTTTTTTTT
## AAAAAAAAAAAAAATTT
## AAAAATTTTTTGGGGCC
## GGGGGGGGGGGGGGGGG
## cCCCCCCCCCCCcccccc
## CGCGCGCGCGCGCGCGCG
There are also several options that display not only the line that contains the matching string, but the lines before and/or after it:
grep -A [n]
returns matching line and n lines
after matchgrep -B [n]
returns matching line and n lines
before matchgrep -C [n]
returns matching line and n lines
before and after matchWe can use grep -A
to pull out both the header and the
sequence for a particular entry of interest (assuming that the FASTA
file is single-line and not multi-line!).
Run the code block below to print both the headers that contain a certain string as well as the sequences (since this is not a multi-line FASTA file):
grep -A 1 '>chr1' data/test.fa
# grep: The Unix string search command
# -A 1: An option the tells grep to display the line right after each line that contains the provided string as well as the line with the match
# '>chr1': The string to search for in the provided file
## >chr1
## AAAAAAAAAAAAAAAAAA
## >chr1a
## GGGGGTCGTCGTCCCCCC
Note that this is actually pulling out two entries,
due to the partial matching of the pattern we used. To get around this
problem, we can use grep -w
, which forces grep
to match entire words, in combination with the -A argument.
Run the code block below to print both the headers that contain an exact match of a certain string as well as the sequences (since this is not a multi-line FASTA file):
grep -A 1 -w '>chr1' data/test.fa
# grep: The Unix string search command
# -A 1: An option the tells grep to display the line right after each line that contains the provided string as well as the line with the match
# -w: This option tells grep to only print lines that EXACTLY match the provided string
# '>chr1': The string to search for in the provided file
## >chr1
## AAAAAAAAAAAAAAAAAA
Exercise: In the code block below, write a grep command that searches for a particular sequence motif in our FASTA file and prints the whole line containing that sequence as well as the sequence header associated with that sequence. Remember that the FASTA sequence is not multi-line. Search for the following sequence motif: GGGTCGTCGT
## Write a grep command to search for a sequence motif and display the matched sequence and the header
# data/test.fa
grep -B 1 'GGGTCGTCGT' data/test.fa
## Write a grep command to search for a sequence motif and display the matched sequence and the header
## >chr1a
## GGGGGTCGTCGTCCCCCC
The last argument we will discuss is
grep -f patterns.txt
, which has slightly different syntax.
This takes a text file of patterns (with a single pattern per line) and
prints every line that matches each pattern in the file, which is useful
if you have multiple patterns!
There are numerous ways to make the text file, such as a text editor
like Vim or Nano, but for purposes of demonstration we will just use the
printf
command.
Run the code block below to generate a file containing a set of strings to search for:
printf '>%s\n' chr2 Chr3 > matches.txt
# printf: A Unix command to print formatted text to the screen
# > : The Unix redirect operator to write the output of the command to the following file
cat matches.txt
# Display the contents of the new file to the screen
## >chr2
## >Chr3
We can see that the printf
command prints a
>
character followed by a specified string, separated by
a newline character (\n
). Now we can use this file of
matches with the -f
argument.
Run the code block below to search for lines in our FASTA file that contain any of a set of strings in a provided file:
grep -f matches.txt data/test.fa
# grep: The Unix string search command
# -f: This option tells grep to read each line in the following file and search for lines in the provided file that contain any of the strings
## >chr2
## >Chr3
One last way that we can modify how grep
behaves is with
regular expressions, a.k.a regex. Regex are patterns
that describe a set of strings. In other words, they allow you
to match complex patterns with grep
(and other Unix
commands), not just exact matches! Regex are extremely powerful and
customizable, but can get very complicated, so we will just go over a
few that are especially useful.
^
matches pattern at start of string$
matches pattern at end of string.
matches any character (except a newline)|
‘or’ operator[ ]
matches any of enclosed characters. Can use in
conjunction with ‘A-Z’ or ‘0-9’, i.e.:
[A-Z|a-z]
matches any alphabetical character, upper or
lower case[0-9]
matches any numeric characterSo a more careful way to count the number of entries in a FASTA file
would be by matching all the lines that start with a >
character.
Run the code block below to search for FASTA sequence headers:
grep -c '^>' data/test.fa
# grep: The Unix string search command
# '^>': The string to search for in the provided file with ^ being the regular expression that matches the beginning of a line in a file
## 8
(Technically there shouldn’t be any >
characters
outside the start of the headers so you wouldn’t need the
^
, but it is good to be thorough!)
Exercise: In the code block below, write a grep command with regular expression to find all of the lines in our FASTA file that end with a number:
## Use grep and regex to display lines in the FASTA file that end with a number
# data/test.fa
grep '[0-9]$' data/test.fa
## Use grep and regex to display lines in the FASTA file that end with a number
## >chr1
## >chr2
## >Chr3
## >chromosome4
## >contig1
## >contig2
grep
practiceLet’s take a look at a FASTA file that more closely matches one that
you might encounter in the wild. The file
data/dmel-all-chromosome-r6.50.simple.fasta
is the
Drosophila melanogaster genome assembly, though a simplified
version in order to save space in our github repository :). We’ll use
this for the following exercises.
Run the following code block just to get a look at the top of this file:
head data/dmel-all-chromosome-r6.50.simple.fasta
# head: a Unix command to display the first 10 lines of the provided file
## >2L type=golden_path; loc=2L:1..23513712; ID=2L; dbxref=GB:AE014134,GB:AE014134,REFSEQ:NT_033779; MD5=b6a98b7c676bdaa11ec9521ed15aff2b; length=23513712; release=r6.50; species=Dmel;
## CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT
## >2R type=golden_path; loc=2R:1..25286936; ID=2R; dbxref=GB:AE013599,GB:AE013599,REFSEQ:NT_033778; MD5=2ecce4ca1c0106bb7c63a78b93ab49ba; length=25286936; release=r6.50; species=Dmel;
## CTCAAGATACCTTCTACAGATTATTTAAAGCTAGTGCACAACAACAATAAATTGACTAAGTTATGTCATTTTAAGCGGTCAAAATGGGTGATTTTTCGATTTCAAGTACCAGGCGAACAGAAGACACCTTCTAGAGATTCTGTTCAACTGGTAAGCAAAAACAGTAAATTGCCTAAGTTTAACAATTTAAGCGGTCAAAATGGGTGATTTTCCGATTTCAAGTACCAGACAAACAGAAGATACCTTATACAGATTATTTAAACCTGGTACACAAAAACAATAAATTGACTAAGTTATGTCATTTTAAGCGGTCAAAATGGGTGAATTTCCGATTTCAAGTAATAGGCGAACTCAAGATACCTTCTACAGATTATTTAAAGCTAGTGCACAACAACAATAA
## >3L type=golden_path; loc=3L:1..28110227; ID=3L; dbxref=GB:AE014296,GB:AE014296,REFSEQ:NT_037436; MD5=3c3ea06b22af8cc59809dbf8d154791e; length=28110227; release=r6.50; species=Dmel;
## TAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTCTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATG
## >3R type=golden_path; loc=3R:1..32079331; ID=3R; dbxref=GB:AE014297,GB:AE014297,REFSEQ:NT_033777; MD5=420540d26d86fbf14185d2f2d68af9c4; length=32079331; release=r6.50; species=Dmel;
## ACGGGACCGAGTATAGTACCAGTACGCGACCAGTACGGGAGCAGTACGGAACCAATACGGGTCCAGTACGGGTCCAGTACGGGTCCAGTACGGGTCCAGTACGGGACCAGTACGGGACCGAGTATAGTACCAGTACGCGACCAGTACGGGACCAGTACGGGTCCAGTACGGGTCCAGTACGGGACCAGTACGGGACCGAGTATAGTACCAGTACGCGACCAGTACGGGACCAGTACGGGACCGAGTACGGGACCGAGTACGGAACCGAGTACGGGATCGAGTACGGACCAGAACGGGACCAGTACGGGACCGAGTATAGTACCAGTACGCGACCAGTACGGGACCAGTACGGGACCAGTACGGGACCAGTACGGGACCAGTACGGAACCAATACGGGTCC
## >4 type=golden_path; loc=4:1..1348131; ID=4; dbxref=GB:AE014135,GB:AE014135,REFSEQ:NC_004353; MD5=8dbdf26f38049dd9df4076b6d480ae61; length=1348131; release=r6.50; species=Dmel;
## TTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTATTATATTAGAATTTACTAAGTACTTCTATGAATGGAATTATTATTGGAAACTC
So you can see an actual FASTA file may be a lot more complicated than our test one. The header contains a lot of information, and the sequence is very long. Importantly, note that the sequences are on a single line, which is important if we do something like counting sequences.
Exercise: Use
grep
to pull out headers corresponding to ’Unmapped_Scaffold’s:
## Use grep to display all "Unmapped_Scaffold"s
# data/dmel-all-chromosome-r6.50.simple.fasta
grep '>Unmapped_Scaffold' data/dmel-all-chromosome-r6.50.simple.fasta
## Use grep to display all "Unmapped_Scaffold"s
## >Unmapped_Scaffold_4_D1555_D1692 type=golden_path; loc=Unmapped_Scaffold_4_D1555_D1692:1..86267; ID=Unmapped_Scaffold_4_D1555_D1692; dbxref=GB:CP007080,GB:CP007080,REFSEQ:NW_007931082; MD5=8ef571135a75b0dc25064dedecf6374f; length=86267; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_8_D1580_D1567 type=golden_path; loc=Unmapped_Scaffold_8_D1580_D1567:1..88768; ID=Unmapped_Scaffold_8_D1580_D1567; dbxref=GB:CP007081,GB:CP007081,REFSEQ:NW_007931083; MD5=6123986fdc2d2f167c97ad3c4a840b4c; length=88768; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_11_D1754 type=golden_path; loc=Unmapped_Scaffold_11_D1754:1..36482; ID=Unmapped_Scaffold_11_D1754; dbxref=GB:CP007082,GB:CP007082,REFSEQ:NW_007931084; MD5=b2b01601309614a94c2718d7331cd570; length=36482; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_13_D1782 type=golden_path; loc=Unmapped_Scaffold_13_D1782:1..25537; ID=Unmapped_Scaffold_13_D1782; dbxref=GB:CP007083,GB:CP007083,REFSEQ:NW_007931085; MD5=b4e9a4d8c5308778e256385e64e90b10; length=25537; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_17_D1756_D1775 type=golden_path; loc=Unmapped_Scaffold_17_D1756_D1775:1..62570; ID=Unmapped_Scaffold_17_D1756_D1775; dbxref=GB:CP007084,GB:CP007084,REFSEQ:NW_007931086; MD5=3c435d15e6aa003b4faba4b731dabb6f; length=62570; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_22_D1753 type=golden_path; loc=Unmapped_Scaffold_22_D1753:1..45120; ID=Unmapped_Scaffold_22_D1753; dbxref=GB:CP007085,GB:CP007085,REFSEQ:NW_007931087; MD5=ddc3132781042dc4c34a3e7ebce0deb4; length=45120; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_24_D1707 type=golden_path; loc=Unmapped_Scaffold_24_D1707:1..22882; ID=Unmapped_Scaffold_24_D1707; dbxref=GB:CP007086,GB:CP007086,REFSEQ:NW_007931088; MD5=7402e75a48397dbf5b878987f07cf1c2; length=22882; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_28_D1723 type=golden_path; loc=Unmapped_Scaffold_28_D1723:1..46986; ID=Unmapped_Scaffold_28_D1723; dbxref=GB:CP007087,GB:CP007087,REFSEQ:NW_007931089; MD5=54bdef0a8d3da6fee5f3a386d2430bf9; length=46986; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_29_D1705 type=golden_path; loc=Unmapped_Scaffold_29_D1705:1..37106; ID=Unmapped_Scaffold_29_D1705; dbxref=GB:CP007088,GB:CP007088,REFSEQ:NW_007931090; MD5=f2872702d44416497a5fc932321a1203; length=37106; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_32_D1773 type=golden_path; loc=Unmapped_Scaffold_32_D1773:1..16157; ID=Unmapped_Scaffold_32_D1773; dbxref=GB:CP007089,GB:CP007089,REFSEQ:NW_007931091; MD5=3674f616046729ef8bbd7c2fb971be0f; length=16157; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_35_D1599 type=golden_path; loc=Unmapped_Scaffold_35_D1599:1..57785; ID=Unmapped_Scaffold_35_D1599; dbxref=GB:CP007090,GB:CP007090,REFSEQ:NW_007931092; MD5=56f8271d4e13de120d8b608284b0b56b; length=57785; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_37_D1608 type=golden_path; loc=Unmapped_Scaffold_37_D1608:1..20763; ID=Unmapped_Scaffold_37_D1608; dbxref=GB:CP007091,GB:CP007091,REFSEQ:NW_007931093; MD5=e7da198ad9d0dedfe1fa84645e1e8f32; length=20763; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_38_D1625 type=golden_path; loc=Unmapped_Scaffold_38_D1625:1..28305; ID=Unmapped_Scaffold_38_D1625; dbxref=GB:CP007092,GB:CP007092,REFSEQ:NW_007931094; MD5=947b3aa3ad5f662c276ed880082584b3; length=28305; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_44_D1670 type=golden_path; loc=Unmapped_Scaffold_44_D1670:1..25698; ID=Unmapped_Scaffold_44_D1670; dbxref=GB:CP007093,GB:CP007093,REFSEQ:NW_007931095; MD5=758396f7ab8349b998ee4ffdb2528459; length=25698; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_45_D1673 type=golden_path; loc=Unmapped_Scaffold_45_D1673:1..29583; ID=Unmapped_Scaffold_45_D1673; dbxref=GB:CP007094,GB:CP007094,REFSEQ:NW_007931096; MD5=d2e324b31aa444e08b055ed51756811b; length=29583; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_46_D1675 type=golden_path; loc=Unmapped_Scaffold_46_D1675:1..25560; ID=Unmapped_Scaffold_46_D1675; dbxref=GB:CP007095,GB:CP007095,REFSEQ:NW_007931097; MD5=6342bae92fcdcba94586fec487ea54a5; length=25560; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_48_D1678 type=golden_path; loc=Unmapped_Scaffold_48_D1678:1..26115; ID=Unmapped_Scaffold_48_D1678; dbxref=GB:CP007096,GB:CP007096,REFSEQ:NW_007931098; MD5=1a4b2f69d52b30f2a2588bd28f83ee53; length=26115; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_51_D1697 type=golden_path; loc=Unmapped_Scaffold_51_D1697:1..13455; ID=Unmapped_Scaffold_51_D1697; dbxref=GB:CP007097,GB:CP007097,REFSEQ:NW_007931099; MD5=9963b5137ee8f9f8f7311fe4d437d592; length=13455; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_52_D1739 type=golden_path; loc=Unmapped_Scaffold_52_D1739:1..43383; ID=Unmapped_Scaffold_52_D1739; dbxref=GB:CP007098,GB:CP007098,REFSEQ:NW_007931100; MD5=13c1a31bf71416951dcddd64b35e693d; length=43383; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_54_D1776 type=golden_path; loc=Unmapped_Scaffold_54_D1776:1..12632; ID=Unmapped_Scaffold_54_D1776; dbxref=GB:CP007099,GB:CP007099,REFSEQ:NW_007931101; MD5=4b5bf2a313fad4e969ddf7de7aa5517f; length=12632; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_58_D1862 type=golden_path; loc=Unmapped_Scaffold_58_D1862:1..10091; ID=Unmapped_Scaffold_58_D1862; dbxref=GB:CP007100,GB:CP007100,REFSEQ:NW_007931102; MD5=ea131d20c1f51ef8ed77eeb80fac9b1e; length=10091; release=r6.50; species=Dmel;
## >Unmapped_Scaffold_60_D1601 type=golden_path; loc=Unmapped_Scaffold_60_D1601:1..24503; ID=Unmapped_Scaffold_60_D1601; dbxref=GB:CP007101,GB:CP007101,REFSEQ:NW_007931103; MD5=f3773f568c4ebc9af9625b12f231324e; length=24503; release=r6.50; species=Dmel;
Exercise: Use
grep
to make sure that the FASTA file contains the same amount of headers and sequences lines. There a few ways you could do this, but it will take more than one command!
## Count the number of lines that contain sequences and lines that are headers
# data/dmel-all-chromosome-r6.50.simple.fasta
grep -c '^>' data/dmel-all-chromosome-r6.50.simple.fasta
grep -c -v '^>' data/dmel-all-chromosome-r6.50.simple.fasta
grep -c '^[A-Z|a-z]' data/dmel-all-chromosome-r6.50.simple.fasta
## Count the number of lines that contain sequences and lines that are headers
## 1870
## 1870
## 1870
Exercise: Use
grep
to pull out the header and sequence line for the chromosome2R
and only 2R
## Write a command to display the header and sequence for chromosome 2R
# data/dmel-all-chromosome-r6.50.simple.fasta
grep -A1 -w '2R' data/dmel-all-chromosome-r6.50.simple.fasta
## Write a command to display the header and sequence for chromosome 2R
## >2R type=golden_path; loc=2R:1..25286936; ID=2R; dbxref=GB:AE013599,GB:AE013599,REFSEQ:NT_033778; MD5=2ecce4ca1c0106bb7c63a78b93ab49ba; length=25286936; release=r6.50; species=Dmel;
## CTCAAGATACCTTCTACAGATTATTTAAAGCTAGTGCACAACAACAATAAATTGACTAAGTTATGTCATTTTAAGCGGTCAAAATGGGTGATTTTTCGATTTCAAGTACCAGGCGAACAGAAGACACCTTCTAGAGATTCTGTTCAACTGGTAAGCAAAAACAGTAAATTGCCTAAGTTTAACAATTTAAGCGGTCAAAATGGGTGATTTTCCGATTTCAAGTACCAGACAAACAGAAGATACCTTATACAGATTATTTAAACCTGGTACACAAAAACAATAAATTGACTAAGTTATGTCATTTTAAGCGGTCAAAATGGGTGAATTTCCGATTTCAAGTAATAGGCGAACTCAAGATACCTTCTACAGATTATTTAAAGCTAGTGCACAACAACAATAA
The other type of sequence file format we will discuss is FASTQ, which as the name suggests, is similar to FASTA format but also contains quality information about the sequence. You will commonly encounter FASTQ files when working with sequencing pipelines, as second generation technologies like Illumina as well as 3rd generation technologies like PacBio and Nanopore output their data in this format. Each entry is comprised of four lines, as opposed to the two of FASTA:
@
symbol and
contains the sequence ID+
character (an
optionally the sequence ID again)Let’s look at an example file.
Run the code block below to view the first few lines of a FASTQ file:
head -n 8 data/test.fq
# head: a Unix command to display the first few lines of the provided file
# -n 8: This option tells head to only display the first 8 lines of the file
## @ERR1013163.116808442/1
## TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA
## +
## 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688
## @ERR1013163.116808442/2
## TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA
## +
## 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688
Here we see two sequences, @ERR1013163.116808442/1
and
@ERR1013163.116808442/2
, which based on these IDs are
probably to ends of a paired-end sequence read. The line directly
following the header contains the familiar ATCG nucleotide symbols, and
then there is a line with a +
, simply meant to provide some
space. Next, we see a line with seemingly random characters.
These characters correspond to the Phred quality score of
the base at the same position in the sequence string. So the first
quality score of the first sequence, 7
, matches with the
first base in the sequence string, T
, and so on. The
quality score is actually a numerical value encoded as
a single character and is expressed by a Q score. These
reflecte the probability that that base is accurate. Q scores are
calculated by:
Q = -10 * log10(P)
where P = the error calling probability for that base. In short,
a higher Q score == a higher confidence call, e.g. Q10 = 90%
base call accuracy, Q30 = 99.9% accuracy, etc. To get its corresponding
symbol, add +33 to the Q score to get the ASCII code and take the symbol
that corresponds to that code; e.g. Q10 == +
, Q30 ==
?
. For reference, here
is a table that lists Q scores and their corresponding ASCII codes and
values. Most programs that handle FASTQ files translate these symbols
for you, so don’t worry about having to convert back and forth
manually!
Let’s take a look at parsing FASTQ files for
information. We can try to use grep
to count the number of
entries in the file, remembering that header lines start with
@
and not >
.
Run the code block below to count the number of times the
@
symbol appears in the FASTQ file:
grep -c '^@' data/test.fq
# grep: The Unix string search command
# -c: This option tells grep to simply count the number of lines that contain the provided string, rather than display the lines
# '^@': The string to search for in the provided file with ^ being the regular expression that matches the beginning of a line in a file
## 638
However, this is technically not a reliable way to do it! If we were
to look at the table of ASCII quality scores, we can see that
@
is actually a permitted value in the quality line
(corresponding to a Q score of 31). Granted, we would have to be unlucky
enough to have the first base in the sequence to have a Q31 score, but
with large datasets it is possible that it would skew our results.
To reliably count the entries in our file, let’s look at our next
bioinformatic tool, awk
.
Invented in the 1970’s, awk
is a scripting language
included in Unix-like operating systems. It specializes in one-liner
programs and manipulating text files, and like most scripting languages
it is also capable of various mathematical and logical operations.
In many cases, if you’re parsing information from a text file, you
could write a Python script… or you could do it with awk
in
a single line! This is intended just as a quick introduction to
awk
, we will go into more details in later workshops when
we talk about interval files, where the power of awk
really
shines.
awk
scripts are organized as:
awk 'pattern { action; other action }'
Meaning that every time that the pattern is true, awk will execute the action in the brackets. By default, if no pattern is specified it matches every line, so the action will be taken every line, e.g. the following command:
awk '{print}' data/test.fa
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## >chr1
## AAAAAAAAAAAAAAAAAA
## >chr1a
## GGGGGTCGTCGTCCCCCC
## >chr2
## TTTTTTTTTTTTTTTTTT
## >Chr3
## AAAAAAAAAAAAAATTT
## >chromosome4
## AAAAATTTTTTGGGGCC
## >contig1
## GGGGGGGGGGGGGGGGG
## >contig2
## cCCCCCCCCCCCcccccc
## >unplaced
## CGCGCGCGCGCGCGCGCG
Because we did not specify any pattern, the print
action
will execute every line of the file.
Similarly, if a pattern is specified without an
action, the default action is {print}
. This is
useful to concisely select lines from a file where the pattern
expression evaluates to true, and is similar in function to
grep
.
For our purposes today, the two most important patterns for our
purposes are BEGIN
and END
, which tell the
action to take place before any lines are read and after the last
line.
awk
has several built-in variables that are very useful
for parsing text. We’ll use more of these variables in a later workshop
when we talk about interval files, but for now let’s focus one variable
called NR
. NR
refers to ‘Number of
Records’ in our file. By default, a record refers to a single
line of the file, so NR
is the number of lines seen so
far in the program, or a count of the total number of lines in the file
if evaluated in an action that matches the END
pattern.
Going back to our original question, we know that each entry in a
FASTQ file has four lines! To count how many sequences are in our
dataset, we can combine the NR
variable and the
END
pattern with some basic division. Try running the
following:
awk 'END{print NR / 4}' data/test.fq
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## 638
Let’s move on to the next step of our “pipeline.” Once we have our sequence files, a next step would typically be to map them against a reference genome. This could be in the form of mapping RNAseq reads to calculate differential expression, mapping genomic DNA from a population to call variants, or numerous other applications. There are a number of different programs for mapping (e.g. BWA, STAR, minimap2) and which you choose will vary based on your data type and experimental design, but the alignment file created will likely be interchangeable.
SAM (Sequence Alignment/Map) format is one of the most common file
formats produced by many different pieces of alignment software, both
for long and short read sequence data. It is a tab delineated text file,
with 11 mandatory fields, or columns, (listed below), plus header lines
denoted with an @
at the start of the line.
SAM files are human readable, but can be quite large. An alternate format is the Binary Alignment/Map (BAM) file, which is binary compressed and not human readable, but is more compact and efficient to work with. Most pipelines will use BAM format over SAM.
Let’s take a look at a SAM file. We could use the typical bash
commands like cat
or less
to view it, but
there is a better way. Namely using our next bioinformatic tool,
samtools
!
SAMtools is a
suite of programs that are extremely useful for processing mapped reads
and for downstream analysis. As stated above, SAM/BAM files from
different programs are (mostly) interchangeable, so
samtools
will work with a file SAM/BAM file no matter what
program produced it. It has a ton of functions (which you can check out
on the manual
page), but we will go through several of the most common uses.
As the name suggests, this command lets you view the content of a SAM
or BAM file (whereas if you tried opening a BAM file
with something like less
, it would be unreadable). Let’s
take a look at a file.
Run the code block below to display a few lines of a SAM file to the screen:
samtools view data/file.sam | head -n 5
# samtools: A suite of programs to process SAM/BAM files
# view: The sub-program of samtools to execute
# | : The Unix pipe operator to pass output from one command as input to another command
# head: a Unix command to display the first few lines of the provided file
# -n 5: This option tells head to only display the first 5 lines of the file
## NOTE: ignore the "Broken pipe" and "error closing standard output" errors! This is just an artifact of our setup for the workshop today.
## ERR1013163.116808442 65 1 13 60 37S63M = 13 0 TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688 NM:i:1 MD:Z:10T52 MC:Z:37S63M AS:i:58 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808442 129 1 13 60 37S63M = 13 0 TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688 NM:i:1 MD:Z:10T52 MC:Z:37S63M AS:i:58 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808448 65 1 13 60 24S76M = 13 0 AAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACCGC 7757999:98:8799:89997999:78989999898789989788997898989999:88899:9:988:8:9;;:98:<;<;<==<<<9;9::9:7517 NM:i:2 MD:Z:10T63A1 MC:Z:24S76M AS:i:69 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808448 129 1 13 60 24S76M = 13 0 AAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACCGC 7757999:98:8799:89997999:78989999898789989788997898989999:88899:9:988:8:9;;:98:<;<;<==<<<9;9::9:7517 NM:i:2 MD:Z:10T63A1 MC:Z:24S76M AS:i:69 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808440 113 1 13 60 26S74M = 13 0 GAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACC 876679:<<<<=;<==;9;<::;;8;89::::9:9988:9:89887997898989998988997989877977789987897979897899:9:987566 NM:i:1 MD:Z:10T63 MC:Z:26S74M AS:i:69 XS:i:0 RG:Z:ERR1013163
A quick note on pipes: in the above command you’ll
see the |
character, which is a pipe. This is a
part of the bash command line, and allows output from one command to be
passed directly into a subsequent command as input. These can be chained
together, passing input between a whole string of commands! This is
especially useful with samtools
, as the commands are
designed to feed into one another and pipes will cut down on the number
of intermediate files created.
We will look at creating efficient pipelines more in depth in a later
workshop. In this case however, we are just passing the output to the
head
command to view the first several lines.
As stated above, our file is tab-separated with 11 columns, plus a series of optional tags containing information about the sequence.
Column | Description |
---|---|
1 | Read name |
2 | Bitwise flag |
3 | Reference name |
4 | Leftmost mapping position |
5 | MAPQ quality score |
6 | CIGAR string |
7 | Name of 2nd read in pair |
8 | Position of 2nd read in pair |
9 | Length of mapping segment |
10 | Sequence of segment |
11 | Phred33 quality score at each position |
SAM/BAM files can be intimidating to look at as they are very dense in information, so let’s focus in on a few important parts.
The second column in a BAM/SAM file is the bitwise flag. The flag value is an integer, which is the sum of a series of decimal values that give information about how a read is mapped.
Integer | Description |
---|---|
1 | read is paired |
2 | read mapped in proper pair |
4 | read unmapped |
8 | mate is unmapped |
16 | read on reverse strand |
32 | mate on reverse strand |
64 | first read in pair |
128 | second read in pair |
256 | not primary alignment |
512 | alignment fails quality checks |
1024 | PCR or optical duplicate |
2048 | supplementary alignment |
So e.g., for a paired-end mapping data set, a flag = 99 (1+2+32+64) means the read is mapped along with its mate (1 and 2) and in the proper orientation (32 and 64). Don’t worry about memorizing these, there are plenty of tools online that decode these flags for you, such as right here.
While you don’t need to know all the SAM flags, if there is one flag
that is useful to have memorized it is 4, which means
the read is unmapped. Unmapped reads are most often
filtered out, as many programs used in downstream analysis of SAM/BAM
files only want mapped reads (and also to save space on disk!). You can
filter reads containing a given flag using the -f
(only
take reads that match given flags) and -F
(only take reads
that do NOT match given flag) options in
samtools view
.
Run the code block below to remove unmapped reads from the SAM file and display the first few reads retained:
samtools view -F 4 data/file.sam | head
# samtools: A suite of programs to process SAM/BAM files
# view: The sub-program of samtools to execute
# -F 4: This option tells samtools to filter and reads with the "4" flag
# | : The Unix pipe operator to pass output from one command as input to another command
# head: a Unix command to display the first few lines of the provided file
## NOTE: ignore the "Broken pipe" and "error closing standard output" errors! This is just an artifact of our setup for the workshop today.
## ERR1013163.116808442 65 1 13 60 37S63M = 13 0 TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688 NM:i:1 MD:Z:10T52 MC:Z:37S63M AS:i:58 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808442 129 1 13 60 37S63M = 13 0 TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688 NM:i:1 MD:Z:10T52 MC:Z:37S63M AS:i:58 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808448 65 1 13 60 24S76M = 13 0 AAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACCGC 7757999:98:8799:89997999:78989999898789989788997898989999:88899:9:988:8:9;;:98:<;<;<==<<<9;9::9:7517 NM:i:2 MD:Z:10T63A1 MC:Z:24S76M AS:i:69 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808448 129 1 13 60 24S76M = 13 0 AAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACCGC 7757999:98:8799:89997999:78989999898789989788997898989999:88899:9:988:8:9;;:98:<;<;<==<<<9;9::9:7517 NM:i:2 MD:Z:10T63A1 MC:Z:24S76M AS:i:69 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808440 113 1 13 60 26S74M = 13 0 GAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACC 876679:<<<<=;<==;9;<::;;8;89::::9:9988:9:89887997898989998988997989877977789987897979897899:9:987566 NM:i:1 MD:Z:10T63 MC:Z:26S74M AS:i:69 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808440 177 1 13 60 26S74M = 13 0 GAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACC 876679:<<<<=;<==;9;<::;;8;89::::9:9988:9:89887997898989998988997989877977789987897979897899:9:987566 NM:i:1 MD:Z:10T63 MC:Z:26S74M AS:i:69 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808443 113 1 69 60 100M = 69 0 TAACTAACATGTAAAACCGCATGCTGGAAATGGAGGGGCAGTTACCAACACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCT 87689::;99:::99982;::98::8:::9;7::8879;:79:98;8:99:99999978888899988894879979987888999:9899999:88587 NM:i:2 MD:Z:18A74G6 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808443 177 1 69 60 100M = 69 0 TAACTAACATGTAAAACCGCATGCTGGAAATGGAGGGGCAGTTACCAACACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCT 87689::;99:::99982;::98::8:::9;7::8879;:79:98;8:99:99999978888899988894879979987888999:9899999:88587 NM:i:2 MD:Z:18A74G6 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808445 65 1 70 60 100M = 70 0 AACTAACATGTAAAACCGCATGCTGGAAATGGAGGGGCAGTTACCAACACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCTA 7646::9::99:98:87019698:987:889486&732.97887)896577784996997788889.87847%889969:997706999::9:####### NM:i:2 MD:Z:17A74G7 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808445 129 1 70 60 100M = 70 0 AACTAACATGTAAAACCGCATGCTGGAAATGGAGGGGCAGTTACCAACACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCTA 7646::9::99:98:87019698:987:889486&732.97887)896577784996997788889.87847%889969:997706999::9:####### NM:i:2 MD:Z:17A74G7 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
This removes any read that contains the 4 flag (e.g. 77, 141, etc.). You can filter on any other criteria using flags as well, e.g. only gets reads that map in proper pair:
samtools view -f 2 -h data/file.sam`
Note this uses -f
, not -F
, which
RETAINS reads with those flags rather than filtering
them!
Exercise: In the code block below, count how many reads in the SAM file are mapped in their proper pairs vs not proper pairs? Hint: remember, we can
pipe
towc -l
with|
to count the number of lines of a file!
## Write two commands to count the number of reads in the file that are properly paired and those that are not properly paired
# data/file.sam
samtools view -f 2 data/file.sam | wc -l
samtools view -F 2 data/file.sam | wc -l
## Write two commands to count the number of reads in the file that are properly paired and those that are not properly paired
## 0
## 644
Remember that SAM files are plain-text tab-delimited files that can be easily read. However, they can be huge and take up a lot of space on our servers, so a compressed version is used, called BAM files.
By default samtools view
outputs in SAM format, so
converting from BAM to SAM is as easy
as running
samtools view -h -o outfile.sam file.bam
For converting SAM to BAM, we can still use
samtools view
but also include the -b
option.
Run the code block below to convert samtools view output to BAM format and save it to a file:
samtools view -b -h -o file.bam data/file.sam
# samtools: A suite of programs to process SAM/BAM files
# view: The sub-program of samtools to execute
# -b: This option tells samtools view to convert the output to BAM format
# -h: Retain the header lines in the input SAM/BAM file in the output
# -o: This option tells samtools view to print the output to the provided file rather than to the screen
ls -l file.bam
# List the info for the BAM file we created to confirm it exists
## -rw-r--r-- 1 gthomas informatics 27274 Mar 22 18:42 file.bam
Note the use of the -h
argument in the command above. As
a reminder, in addition to the 11 tab separated fields, SAM files also
contain header lines that start with @
and
describe information about the sequences found in the file. The
-h
argument adds a header, which many programs require to
recognize a SAM/BAM file as properly formatted. Remember to include
it!
Let’s see what adding the -h
argument does, looking at
the BAM file we just created.
Run the code block below to display the first few lines of the BAM file to the screen
samtools view -h file.bam | head -n15
# samtools: A suite of programs to process SAM/BAM files
# view: The sub-program of samtools to execute
# -h: Retain the header lines in the input SAM/BAM file in the output
# | : The Unix pipe operator to pass output from one command as input to another command
# head: a Unix command to display the first few lines of the provided file
# -n 15: This option tells head to only display the first 15 lines of the file
## NOTE: ignore the "Broken pipe" and "error closing standard output" errors! This is just an artifact of our setup for the workshop today.
## @HD VN:1.6 SO:coordinate
## @SQ SN:1 LN:200000
## @SQ SN:2 LN:200000
## @SQ SN:3 LN:200000
## @RG ID:ERR1013163 SM:ERR1013163 PL:ILLUMINA
## @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -M -t 3 -R @RG\tID:ERR1013163\tSM:ERR1013163\tPL:ILLUMINA data/genome/Tgut_subseg_renamed.fa 00_fastqFiltered/ERR1013163_fastp_1.fastq.gz 00_fastqFiltered/ERR1013163_fastp_2.fastq.gz
## @PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:samtools view -Sb -
## @PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools sort -o ERR1013163.sorted.bam ERR1013163.bam
## @PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.10 CL:samtools view -h -o ERR1013163.sorted.subset.bam -L subset.bed ERR1013163.sorted.bam
## @PG ID:samtools.3 PN:samtools PP:samtools.2 VN:1.13 CL:samtools view -h -o file.sam file.sorted.bam
## @PG ID:samtools.4 PN:samtools PP:samtools.3 VN:1.13 CL:samtools view -b -h -o file.bam data/file.sam
## @PG ID:samtools.5 PN:samtools PP:samtools.4 VN:1.13 CL:samtools view -h file.bam
## ERR1013163.116808442 65 1 13 60 37S63M = 13 0 TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688 NM:i:1 MD:Z:10T52 MC:Z:37S63M AS:i:58 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808442 129 1 13 60 37S63M = 13 0 TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688 NM:i:1 MD:Z:10T52 MC:Z:37S63M AS:i:58 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808448 65 1 13 60 24S76M = 13 0 AAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACCGC 7757999:98:8799:89997999:78989999898789989788997898989999:88899:9:988:8:9;;:98:<;<;<==<<<9;9::9:7517 NM:i:2 MD:Z:10T63A1 MC:Z:24S76M AS:i:69 XS:i:0 RG:Z:ERR1013163
We can see the files starts with a series of lines starting with
@
, meaning the header was properly added. Now let’s omit
the -h
and see what happens.
Run the code block below to view the first few lines of the BAM file without the header:
samtools view file.bam | head -n15
# samtools: A suite of programs to process SAM/BAM files
# view: The sub-program of samtools to execute
# | : The Unix pipe operator to pass output from one command as input to another command
# head: a Unix command to display the first few lines of the provided file
# -n 15: This option tells head to only display the first 15 lines of the file
## NOTE: ignore the "Broken pipe" and "error closing standard output" errors! This is just an artifact of our setup for the workshop today.
## ERR1013163.116808442 65 1 13 60 37S63M = 13 0 TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688 NM:i:1 MD:Z:10T52 MC:Z:37S63M AS:i:58 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808442 129 1 13 60 37S63M = 13 0 TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688 NM:i:1 MD:Z:10T52 MC:Z:37S63M AS:i:58 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808448 65 1 13 60 24S76M = 13 0 AAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACCGC 7757999:98:8799:89997999:78989999898789989788997898989999:88899:9:988:8:9;;:98:<;<;<==<<<9;9::9:7517 NM:i:2 MD:Z:10T63A1 MC:Z:24S76M AS:i:69 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808448 129 1 13 60 24S76M = 13 0 AAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACCGC 7757999:98:8799:89997999:78989999898789989788997898989999:88899:9:988:8:9;;:98:<;<;<==<<<9;9::9:7517 NM:i:2 MD:Z:10T63A1 MC:Z:24S76M AS:i:69 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808440 113 1 13 60 26S74M = 13 0 GAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACC 876679:<<<<=;<==;9;<::;;8;89::::9:9988:9:89887997898989998988997989877977789987897979897899:9:987566 NM:i:1 MD:Z:10T63 MC:Z:26S74M AS:i:69 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808440 177 1 13 60 26S74M = 13 0 GAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACC 876679:<<<<=;<==;9;<::;;8;89::::9:9988:9:89887997898989998988997989877977789987897979897899:9:987566 NM:i:1 MD:Z:10T63 MC:Z:26S74M AS:i:69 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808443 113 1 69 60 100M = 69 0 TAACTAACATGTAAAACCGCATGCTGGAAATGGAGGGGCAGTTACCAACACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCT 87689::;99:::99982;::98::8:::9;7::8879;:79:98;8:99:99999978888899988894879979987888999:9899999:88587 NM:i:2 MD:Z:18A74G6 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808443 177 1 69 60 100M = 69 0 TAACTAACATGTAAAACCGCATGCTGGAAATGGAGGGGCAGTTACCAACACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCT 87689::;99:::99982;::98::8:::9;7::8879;:79:98;8:99:99999978888899988894879979987888999:9899999:88587 NM:i:2 MD:Z:18A74G6 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808445 65 1 70 60 100M = 70 0 AACTAACATGTAAAACCGCATGCTGGAAATGGAGGGGCAGTTACCAACACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCTA 7646::9::99:98:87019698:987:889486&732.97887)896577784996997788889.87847%889969:997706999::9:####### NM:i:2 MD:Z:17A74G7 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808445 129 1 70 60 100M = 70 0 AACTAACATGTAAAACCGCATGCTGGAAATGGAGGGGCAGTTACCAACACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCTA 7646::9::99:98:87019698:987:889486&732.97887)896577784996997788889.87847%889969:997706999::9:####### NM:i:2 MD:Z:17A74G7 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808449 65 1 74 60 100M = 74 0 AACATGTAAAACCGCATGCTGGAAATGGAGGGGCAGTTACCAACACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCTATTAA 78588:8:998872999999977788978977688987:769789898989:8989889:8:988877898799:::999:99:::;9;;6778997773 NM:i:2 MD:Z:13A74G11 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808449 129 1 74 60 100M = 74 0 AACATGTAAAACCGCATGCTGGAAATGGAGGGGCAGTTACCAACACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCTATTAA 78588:8:998872999999977788978977688987:769789898989:8989889:8:988877898799:::999:99:::;9;;6778997773 NM:i:2 MD:Z:13A74G11 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808453 65 1 117 60 100M = 117 0 CACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCTATTAATGAGTTAGTAGAGATGAACATGGAAGAAGACTGCATAACCTGC 77489:9::9777977989898987798799979989889789987898999997979:9:99::9:99:9::99:;:;9;:=<<=<;<;;<<;996787 NM:i:2 MD:Z:45G42A11 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808453 129 1 117 60 100M = 117 0 CACATGCTGAAAATTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCTATTAATGAGTTAGTAGAGATGAACATGGAAGAAGACTGCATAACCTGC 77489:9::9777977989898987798799979989889789987898999997979:9:99::9:99:9::99:;:;9;:=<<=<;<;;<<;996787 NM:i:2 MD:Z:45G42A11 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.116808447 113 1 130 60 100M = 130 0 TTTACTCACACCCTCCTAAAATCAGCTTCTGAAAGAGCTATTAATGAGTTAGTAGAGATGAACATGGAAGAAGACTGCATAACCTGCATTGTAGCCAGCT 87689999:75,948;<;;;;::99957:9;88:9::::97:8:::888999::99898787:89798998:889989987979879:899::6898577 NM:i:3 MD:Z:32G42A13C10 MC:Z:100M AS:i:85 XS:i:0 RG:Z:ERR1013163
Note that if we don’t specify that we want to print the header,
samtools view
will omit it!
Exercise: In the code block below, use
samtools view
to display ONLY the header of the BAM file (Hint: Read the help menu or man page ofsamtools view
!).
## Use samtools view to display ONLY the header of the BAM file
# file.bam
samtools view -H file.bam
## Use samtools view to display ONLY the header of the BAM file
## @HD VN:1.6 SO:coordinate
## @SQ SN:1 LN:200000
## @SQ SN:2 LN:200000
## @SQ SN:3 LN:200000
## @RG ID:ERR1013163 SM:ERR1013163 PL:ILLUMINA
## @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -M -t 3 -R @RG\tID:ERR1013163\tSM:ERR1013163\tPL:ILLUMINA data/genome/Tgut_subseg_renamed.fa 00_fastqFiltered/ERR1013163_fastp_1.fastq.gz 00_fastqFiltered/ERR1013163_fastp_2.fastq.gz
## @PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:samtools view -Sb -
## @PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools sort -o ERR1013163.sorted.bam ERR1013163.bam
## @PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.10 CL:samtools view -h -o ERR1013163.sorted.subset.bam -L subset.bed ERR1013163.sorted.bam
## @PG ID:samtools.3 PN:samtools PP:samtools.2 VN:1.13 CL:samtools view -h -o file.sam file.sorted.bam
## @PG ID:samtools.4 PN:samtools PP:samtools.3 VN:1.13 CL:samtools view -b -h -o file.bam data/file.sam
## @PG ID:samtools.5 PN:samtools PP:samtools.4 VN:1.13 CL:samtools view -H file.bam
Many functions of samtools
, as well as many programs
that do downstream analysis on BAM files, require that your BAM file be
sorted by sequence (e.g. chromosome, if mapping to an assembly) and
position, and also indexed to be searchable. We can accomplish both of
these using two other functions of samtools
,
sort
and index
. Their syntax is as
follows.
Run the code block below to create a sorted BAM file from our original, and then indexes the sorted file.
samtools sort -o file.sorted.bam file.bam
# samtools: A suite of programs to process SAM/BAM files
# sort: The sub-program of samtools to execute
# -o: This option tells samtools view to print the output to the provided file rather than to the screen
ls -l file.sorted.bam
# List the info for the BAM file we created to confirm it exists
samtools index file.sorted.bam
# samtools: A suite of programs to process SAM/BAM files
# index: The sub-program of samtools to execute
ls -l file.sorted.bam.bai
# List the info for the index file we created to confirm it exists
## -rw-r--r-- 1 gthomas informatics 27290 Mar 22 18:42 file.sorted.bam
## -rw-r--r-- 1 gthomas informatics 256 Mar 22 18:42 file.sorted.bam.bai
For the sort
, the -o
argument gives the
name of the desired output file. For index
, we only have to
provide the name of the BAM file we want to index and it will
automatically create an output file with the same name as the input plus
the .bai
suffix, indicating that it is the corresponding
index.
PRACTICE: Putting it all together!
That was a lot of info! Let’s take everything that we have learned and organize it into what a typical workflow might look like. Assume that we have already aligned the data and the aligner we used outputs the alignment file in SAM format. We want to go from our initial SAM file and end up with a sorted, indexed BAM file with only the mapped reads retained. Try inputting the commands yourself, then we will walk through it together.
Exercise: 1. Convert the
data/file.sam
SAM file to BAM format while retaining the header and removing unmapped reads, then sort the file. Call the new filefile.mapped.sorted.bam
. Hint:samtools
commands can also make use of pipes (|
) to avoid writing intermediate files! 2. Index the newly created sorted BAM file.
## Convert to BAM with header and without unmapped reads and then sort
# data/file.sam
samtools view -h -b -F 4 data/file.sam | samtools sort -o file.mapped.sorted.bam
## Convert to BAM with header and without unmapped reads and then sort
ls -l file.mapped.sorted.bam
# List the info for the BAM file we created to confirm it exists
## Index the new BAM file
samtools index file.mapped.sorted.bam
##
ls -l file.mapped.sorted.bam.bai
# List the info for the index file we created to confirm it exists
## -rw-r--r-- 1 gthomas informatics 27295 Mar 22 18:42 file.mapped.sorted.bam
## -rw-r--r-- 1 gthomas informatics 256 Mar 22 18:42 file.mapped.sorted.bam.bai
You should see files named file.mapped.sorted.bam
and
file.mapped.sorted.bam.bai
!
Now that our BAM is sorted and indexed, we can begin pulling useful information out of it. Let’s look at several applications, using some of SAMtools (many) other functions.
The third and fourth columns of a SAM/BAM file denote the name of the
reference sequence and the starting position that the query sequence
(i.e. the sequence from your FASTA/FASTQ file) maps to. By default
samtools view
prints all alignments, but you can specify a
specific chromosome by adding the chromosome name at the end of the
command.
Run the following command to display only alignments from chromosome 2:
samtools view file.mapped.sorted.bam 2 | head
# samtools: A suite of programs to process SAM/BAM files
# view: The sub-program of samtools to execute
# 2: This tells samtools view to only display alignments from this region
# | : The Unix pipe operator to pass output from one command as input to another command
# head: a Unix command to display the first few lines of the provided file
## NOTE: ignore the "Broken pipe" and "error closing standard output" errors! This is just an artifact of our setup for the workshop today.
## ERR1013163.20585831 65 2 1 60 57S43M = 1 0 CAGAGAGAACAAGGAGAAGCTTTGTCCCACAGGGAAACCTTCCAGATGCACCACGAGGACTTGGGTGACTGAACAACTTCCCAAGCCCCAAATCCTTGTG 78689::89989959:9998986:9967988986889869797998:9999698165679:987647669:9:99:799988589*5698:8::797679 NM:i:1 MD:Z:11A31 MC:Z:57S43M AS:i:38 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585831 129 2 1 60 57S43M = 1 0 CAGAGAGAACAAGGAGAAGCTTTGTCCCACAGGGAAACCTTCCAGATGCACCACGAGGACTTGGGTGACTGAACAACTTCCCAAGCCCCAAATCCTTGTG 78689::89989959:9998986:9967988986889869797998:9999698165679:987647669:9:99:799988589*5698:8::797679 NM:i:1 MD:Z:11A31 MC:Z:57S43M AS:i:38 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585832 65 2 1 60 28S72M = 1 0 CAGGGAAACCTTCCAGATGCACCACGAGGACTTGGGTGACTGAACAACTTCCCAAGCCCCAAATCCTTGTGGTGCCTTTGGAAGTTTTCTCACTTCAGAC 7767799997978788899898798289788979778:8899878978978779999777999:98:9:8:79987988:8:9:9:<:;;;<:;9:8778 NM:i:2 MD:Z:11A40A19 MC:Z:28S72M AS:i:62 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585832 129 2 1 60 28S72M = 1 0 CAGGGAAACCTTCCAGATGCACCACGAGGACTTGGGTGACTGAACAACTTCCCAAGCCCCAAATCCTTGTGGTGCCTTTGGAAGTTTTCTCACTTCAGAC 7767799997978788899898798289788979778:8899878978978779999777999:98:9:8:79987988:8:9:9:<:;;;<:;9:8778 NM:i:2 MD:Z:11A40A19 MC:Z:28S72M AS:i:62 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585833 65 2 1 60 55S45M = 1 0 GAGAGAACAAGGAGAAGCTTTGTCCCACAGGGAAACCTTCCAGATGCACCACGAGGACTTGGGTGACTGAACAACTTCCCAAGCCCCAAATCCTTGTGGT 776799999897898899977:987798997787787979799899888688299899:8;889:99::8:8::8:9:78:9:9888:98;;9;9:7767 NM:i:1 MD:Z:11A33 MC:Z:55S45M AS:i:40 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585833 129 2 1 60 55S45M = 1 0 GAGAGAACAAGGAGAAGCTTTGTCCCACAGGGAAACCTTCCAGATGCACCACGAGGACTTGGGTGACTGAACAACTTCCCAAGCCCCAAATCCTTGTGGT 776799999897898899977:987798997787787979799899888688299899:8;889:99::8:8::8:9:78:9:9888:98;;9;9:7767 NM:i:1 MD:Z:11A33 MC:Z:55S45M AS:i:40 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585838 65 2 1 60 6S92M2S = 1 0 CACGAGGACTTGGGTGACTGAACAACTTCCCAAGCCCCAAATCCTTGTGGTGCCTTTGAAAGTTTTCTCACTTCAGACTCAGGGCTGGAATCTGACATGA 77508:8:9:99877:889:89897988967985876688799799:7:77996:899877:859999999;989:996999779:699;:;9:98869# NM:i:1 MD:Z:11A80 MC:Z:6S92M2S AS:i:87 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585838 129 2 1 60 6S92M2S = 1 0 CACGAGGACTTGGGTGACTGAACAACTTCCCAAGCCCCAAATCCTTGTGGTGCCTTTGAAAGTTTTCTCACTTCAGACTCAGGGCTGGAATCTGACATGA 77508:8:9:99877:889:89897988967985876688799799:7:77996:899877:859999999;989:996999779:699;:;9:98869# NM:i:1 MD:Z:11A80 MC:Z:6S92M2S AS:i:87 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585843 65 2 1 60 62S38M = 1 0 TTCACCAGAGAGAACAAGGAGAAGCTTTGTCCCACAGGGAAACCTTCCAGATGCACCACGAGGACTTGGGTGACTGAACAACTTCCCAAGCCCCAAATCC 775988::99888789797898799977998779899778788797879989989879838:799:9:8678888:988889:9:8899;:999977686 NM:i:1 MD:Z:11A26 MC:Z:62S38M AS:i:33 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585843 129 2 1 60 62S38M = 1 0 TTCACCAGAGAGAACAAGGAGAAGCTTTGTCCCACAGGGAAACCTTCCAGATGCACCACGAGGACTTGGGTGACTGAACAACTTCCCAAGCCCCAAATCC 775988::99888789797898799977998779899778788797879989989879838:799:9:8678888:988889:9:8899;:999977686 NM:i:1 MD:Z:11A26 MC:Z:62S38M AS:i:33 XS:i:0 RG:Z:ERR1013163
This pulls out reads from the file that map to the chromosome named
‘2’. If we wanted to be even more specific, we can add coordinates of a
sub-region on the chromosome, the syntax for which looks like this:
name:start-end
.
Exercise: In the code block below, write a command to display only regions that align to chromosome 2 from position 200 to position 1000. Pipe the output to head to cut down on screen output
## View only regions mapped to chromosome 2, position 200 to 1000. Pipe to head.
# file.mapped.sorted.bam
samtools view file.mapped.sorted.bam 2:200-1000 | head
## View only regions mapped to chromosome 2, position 200 to 1000. Pipe to head.
## NOTE: ignore the "Broken pipe" and "error closing standard output" errors! This is just an artifact of our setup for the workshop today.
## ERR1013163.20585841 65 2 102 60 100M = 102 0 CAAGGTGCACTTTTAAAAAGCATTTCAAGGTTTTCTGCACAATCCTTGAATATTTGCTTTGGCTTCAGAACTTAACCAGGAGCAAACCAAACCTACCTCC 776979;9:997779777798997789897989889999897987979879997799977:89999::989:8;::8;<:;<;=9<;8;:::9;:96787 NM:i:3 MD:Z:5C8T33G51 MC:Z:100M AS:i:85 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585841 129 2 102 60 100M = 102 0 CAAGGTGCACTTTTAAAAAGCATTTCAAGGTTTTCTGCACAATCCTTGAATATTTGCTTTGGCTTCAGAACTTAACCAGGAGCAAACCAAACCTACCTCC 776979;9:997779777798997789897989889999897987979879997799977:89999::989:8;::8;<:;<;=9<;8;:::9;:96787 NM:i:3 MD:Z:5C8T33G51 MC:Z:100M AS:i:85 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585839 113 2 117 60 100M = 117 0 AAAAGCATTTCAAGGTTTTCTGCACAATCCTTGAATATTTGCTTTGGCTTCAGAACTTAACCAGGAGCAAACCAAACCTCCCTCCTGTCCTGGCAGTAAG #############################79887885:87*7/90(48988995705.898::09899888797887:849988998,7:989;988587 NM:i:2 MD:Z:33G45A20 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585839 177 2 117 60 100M = 117 0 AAAAGCATTTCAAGGTTTTCTGCACAATCCTTGAATATTTGCTTTGGCTTCAGAACTTAACCAGGAGCAAACCAAACCTCCCTCCTGTCCTGGCAGTAAG #############################79887885:87*7/90(48988995705.898::09899888797887:849988998,7:989;988587 NM:i:2 MD:Z:33G45A20 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585840 113 2 131 60 100M = 131 0 GTTTTCTGCACAATCCTTGAATATTTGCTTTGGCTTCAGAACTTAACCAGGAGCAAACCAAACCTACCTCCTGTCCTGGCAGTAAGGAAACCTAAAGTCA 887789<::::::9:;;9:;9<<8:::;::979998::98:99:897:97999:7877:87879897998998979979998988698888::9997587 NM:i:1 MD:Z:19G80 MC:Z:100M AS:i:95 XS:i:0 RG:Z:ERR1013163
## ERR1013163.20585840 177 2 131 60 100M = 131 0 GTTTTCTGCACAATCCTTGAATATTTGCTTTGGCTTCAGAACTTAACCAGGAGCAAACCAAACCTACCTCCTGTCCTGGCAGTAAGGAAACCTAAAGTCA 887789<::::::9:;;9:;9<<8:::;::979998::98:99:897:97999:7877:87879897998998979979998988698888::9997587 NM:i:1 MD:Z:19G80 MC:Z:100M AS:i:95 XS:i:0 RG:Z:ERR1013163
## ERR1013163.118319370 65 2 132 60 100M = 132 0 TTTTCTGCACAATCCTTGAATATTTGCTTTGGCTTCAGAACTTAACCAGGAGCAAACCAAACCTACCTCCTGTCCTGGCAGTAAGGAAACCTAAAGTCAT 77679:9898:::9798:8798988:898897997969:889788879:78989871788797:99798):99889:87;8999:7:::77::9996584 NM:i:1 MD:Z:18G81 MC:Z:100M AS:i:95 XS:i:0 RG:Z:ERR1013163
## ERR1013163.118319370 129 2 132 60 100M = 132 0 TTTTCTGCACAATCCTTGAATATTTGCTTTGGCTTCAGAACTTAACCAGGAGCAAACCAAACCTACCTCCTGTCCTGGCAGTAAGGAAACCTAAAGTCAT 77679:9898:::9798:8798988:898897997969:889788879:78989871788797:99798):99889:87;8999:7:::77::9996584 NM:i:1 MD:Z:18G81 MC:Z:100M AS:i:95 XS:i:0 RG:Z:ERR1013163
## ERR1013163.118319364 113 2 137 60 100M = 137 0 TGCACAATCCTTGAATATTTGCTTTGGCTTCAGAACTTAACCAGGAGTAAACCAAACCTACCTCCTGTCCTGGCAGTAAGGAAACCTAAAGTCATGATGA 9777:9;;9;<<<<<;;;;;;<;::8:;89::98::8:897997898977979878799:7997998879979:989797877979977889::989577 NM:i:2 MD:Z:13G33C52 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
## ERR1013163.118319364 177 2 137 60 100M = 137 0 TGCACAATCCTTGAATATTTGCTTTGGCTTCAGAACTTAACCAGGAGTAAACCAAACCTACCTCCTGTCCTGGCAGTAAGGAAACCTAAAGTCATGATGA 9777:9;;9;<<<<<;;;;;;<;::8:;89::98::8:897997898977979878799:7997998879979:989797877979977889::989577 NM:i:2 MD:Z:13G33C52 MC:Z:100M AS:i:90 XS:i:0 RG:Z:ERR1013163
One of the most common things you will want to know about your mapped reads is their coverage and depth, as this can impact your confidence in the assembly, the validity of your SNP calls, etc. “Coverage” is defined as the percentage of positions that have at least one base aligned to it (think of it as how much sequence is covered by mapped reads), while “depth” can be thought of as the redundancy of coverage (i.e. how many bases are aligned to a particular sequence). There are many approaches you can take to calculate coverage and depth, several of which you can do with SAMtools.
samtools coverage
: for each contig/scaffold in the
BAM/SAM file, outputs several useful summary stats as a table.
Run the code block below to calculate coverage in the BAM file:
samtools coverage file.mapped.sorted.bam
# samtools: A suite of programs to process SAM/BAM files
# coverage: The sub-program of samtools to execute
## #rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
## 1 1 200000 214 1064 0.532 0.10194 23.6 59.4
## 2 1 200000 148 1071 0.5355 0.07122 23.2 59.9
## 3 1 200000 276 1098 0.549 0.13597 23.1 60
Like with samtools view
, can also specify coordinates,
although (annoyingly) the syntax is slightly different, in that you have
to specify the -r
option before the region.
Exercise: In the code block below, write a command calculate coverage only of regions that align to chromosome 2 from position 200 to position 1000. Don’t forget -r here!
## Use the -r option with samtools coverage to display coverage info only for chromosome 2, positions 200 to 1000
# file.mapped.sorted.bam
samtools coverage file.mapped.sorted.bam -r 2:200-1000
## Use the -r option with samtools coverage to display coverage info only for chromosome 2, positions 200 to 1000
## #rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
## 2 200 1000 116 801 100 12.4694 23.2 59.8
As a quick way to visualize coverage, you can use the -m
option create a histogram of coverage over a contig.
Run the code block below to calculate coverage for a region and display a text-based histogram:
samtools coverage -m file.mapped.sorted.bam -r 1:1-1000
# samtools: A suite of programs to process SAM/BAM files
# coverage: The sub-program of samtools to execute
# -m: This option tells samtools coverage to display a text histogram, rather than a table
## 1 (200.0Kbp)
## > 90.00% │ ████████▅█████████████████████████ ████│ Number of reads: 214
## > 80.00% │ ██████████████████████████████████ ████│ (6 filtered)
## > 70.00% │ ██████████████████████████████████ ████│ Covered bases: 969bp
## > 60.00% │ ██████████████████████████████████ ████│ Percent covered: 96.9%
## > 50.00% │▂██████████████████████████████████ ████│ Mean coverage: 19.7x
## > 40.00% │███████████████████████████████████ ████│ Mean baseQ: 23.6
## > 30.00% │███████████████████████████████████ ████│ Mean mapQ: 59.4
## > 20.00% │███████████████████████████████████▆████│
## > 10.00% │████████████████████████████████████████│ Histo bin width: 25bp
## > 0.00% │████████████████████████████████████████│ Histo max bin: 100%
## 1 250 500 750 1.0K
This is a useful first-pass analysis or sanity check. However, a more
thorough way to evaluate coverage is to look at per-base coverage rather
than the average. For this you can use samtools depth
.
Run the code block below to calculate per-base coverage for a given region of the BAM file:
samtools depth -a file.mapped.sorted.bam -r 1:1000-2000 | head
# samtools: A suite of programs to process SAM/BAM files
# depth: The sub-program of samtools to execute
# -a: This tells samtools depth to display information for every position, even if coverage is 0
# | : The Unix pipe operator to pass output from one command as input to another command
# head: a Unix command to display the first few lines of the provided file
## NOTE: ignore the "Broken pipe" and "error closing standard output" errors! This is just an artifact of our setup for the workshop today.
## 1 1000 10
## 1 1001 10
## 1 1002 10
## 1 1003 10
## 1 1004 10
## 1 1005 10
## 1 1006 10
## 1 1007 10
## 1 1008 10
## 1 1009 10
This outputs a three column table, where the 1st column is the contig/scaffold/chromosome name, the 2nd is the position on that scaffold, and the 3rd is the depth (number of reads) over that base. This list is convenient for importing to programs like R, where you can plot e.g. a histogram showing the distribution of per-base depth, or distribution of depth over a contig.
Another useful function built into SAMtools is
samtools stats
, which gives some quick summary statistics
about your mapping reads. The amount of information it generates is
somewhat overkill in most cases, so we will just look at the summary,
with the help of our old friend grep
.
Run the code block below to calculate stats on our BAM file and view only certain lines:
samtools stats file.mapped.sorted.bam | grep '^SN' | cut -f 2-
# samtools: A suite of programs to process SAM/BAM files
# stats: The sub-program of samtools to execute
# | : The Unix pipe operator to pass output from one command as input to another command
# grep: The Unix string search command
# '^SN': The string to search for in the provided file with ^ being the regular expression that matches the beginning of a line in a file
## raw total sequences: 638 # excluding supplementary and secondary reads
## filtered sequences: 0
## sequences: 638
## is sorted: 1
## 1st fragments: 319
## last fragments: 319
## reads mapped: 638
## reads mapped and paired: 638 # paired-end technology bit set + both mates mapped
## reads unmapped: 0
## reads properly paired: 0 # proper-pair bit set
## reads paired: 638 # paired-end technology bit set
## reads duplicated: 0 # PCR or optical duplicate bit set
## reads MQ0: 0 # mapped and MQ=0
## reads QC failed: 0
## non-primary alignments: 6
## supplementary alignments: 0
## total length: 63800 # ignores clipping
## total first fragment length: 31900 # ignores clipping
## total last fragment length: 31900 # ignores clipping
## bases mapped: 63800 # ignores clipping
## bases mapped (cigar): 61918 # more accurate
## bases trimmed: 0
## bases duplicated: 0
## mismatches: 1052 # from NM fields
## error rate: 1.699021e-02 # mismatches / bases mapped (cigar)
## average length: 100
## average first fragment length: 100
## average last fragment length: 100
## maximum length: 100
## maximum first fragment length: 100
## maximum last fragment length: 100
## average quality: 23.3
## insert size average: 0.0
## insert size standard deviation: 0.0
## inward oriented pairs: 0
## outward oriented pairs: 0
## pairs with other orientation: 319
## pairs on different chromosomes: 0
## percentage of properly paired reads (%): 0.0
(cut
is another command line tool. Don’t worry too much
about what it is doing here, but in short it is just trimming off a
useless column at the start of the output)
A common task you might want to do is to extract a single sequence
from a FASTA file, e.g. you just want to work on a
small test chromosome for a pilot analysis. Although we learned how to
do this with grep
earlier today, samtools
also
has a way to do this, using the faidx
command. This command
also creates an index of a FASTA file
with the .fai
extension. This index is required for many
downstream analyses. However, if you provided with a region in the same
format we’ve seen before (scaffold:start-end), it will also extract that
sequence.
Exercise: Use samtools faidx to extract the sequence from chromsome 2R of the Drosophila melanogaster genome, positions 200 to 300.
## Extract chromsome 2R, positions 200 to 300 from D.mel genome
# data/dmel-all-chromosome-r6.50.simple.fasta
samtools faidx data/dmel-all-chromosome-r6.50.simple.fasta 2R:200-300
## Extract chromsome 2R, positions 200 to 300 from D.mel genome
## >2R:200-300
## ATGGGTGATTTTCCGATTTCAAGTACCAGACAAACAGAAGATACCTTATACAGATTATTT
## AAACCTGGTACACAAAAACAATAAATTGACTAAGTTATGTC
This is a very efficient way to extract sequences from a
FASTA file (because samtools
is very
efficient!).
As SAM/BAM files contain both the sequence and the quality
information for the aligned reads, it is very easy to convert back to
sequence files! Just use samtools fastq
or
samtools fasta
.
E.g. the following command will create a FASTQ file of only the mapped reads (as we are pulling from the BAM file where we filtered out the unmapped ones).
Run the following code block to convert the sequences in our BAM to FASTQ format and view only the first few lines of the output:
samtools fastq file.mapped.sorted.bam | head -n 12
# samtools: A suite of programs to process SAM/BAM files
# fastq: The sub-program of samtools to execute
# | : The Unix pipe operator to pass output from one command as input to another command
# head: a Unix command to display the first few lines of the provided file
# -n 12: This option tells head to only display the first 12 lines of the file
## NOTE: ignore the "Broken pipe" and "error closing standard output" errors! This is just an artifact of our setup for the workshop today.
## @ERR1013163.116808442/1
## TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA
## +
## 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688
## @ERR1013163.116808442/2
## TTTCCCACAGTGAAAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAA
## +
## 77577899:99999899988988988:7989998998:7988998:888889988889::88:::8::;9999::99;:;8699<<<::8:<<:997688
## @ERR1013163.116808448/1
## AAGCTTTGAAGTTATGCTATTAGTGGAGCAGTGAAAACTGAGGACAGGATCTTATGTGAAGAATCTCCCTTTTCAGTTCTTAACTAACATGTAAAACCGC
## +
## 7757999:98:8799:89997999:78989999898789989788997898989999:88899:9:988:8:9;;:98:<;<;<==<<<9;9::9:7517
You can combine multiple sorted BAM/SAM files, which can be useful if you have done multiple rounds of mapping:
samtools merge file.bam file2.bam ...
Unless otherwise specified, the headers will also be merged.
That’s it for day 1! Join us tomorrow to learn about bed files and awk!