Welcome to the third day of the FAS Informatics Bioinformatics Tips and Tricks Workshop!
If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit and execute the code in this file. Please download the raw file here
Today we’re going to continue our tour and explanation of common genomics file formats and their associated tools, starting with GFF files, which are typically used to store gene annotations. We’ll then talk about VCF files, which are used to store variants.
In the second half of the workshop, we’ll introduce some basic concepts about bash scripting, in preparation for day 4 which will be all about scripting.
Like previous days, we’ll make a link to the source data files for
the day. Though for the beginning of the workshop we’ll still be working
with day 2 data in your data2
folder.
Run the code block below to create your link for the day 3 data:
mkdir -p data3
ln -s -f /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2024/day3/* data3
# ln: The Unix link command, which can create shortcuts to folders and files at the provided path to the second provided path
# -s: This option tells ln to create a symbolic link rather than a hard link (original files are not changed)
# -f: This option forces ln to create the link
ls -l data3
# Show the details of the files in the new linked directory
The format for encoding information about genic regions (commonly called a genome annotation) is the GFF format. GFF stands for General Feature Format. The specification has undergone two major revisions, with the most recent (and commonly-used) being GFF3. There is a related format, the GTF format, which stands for General Transfer Format but it is very similar to GFF (being an extension to GFF2) and not as commonly used, so we will only talk about GFF files today. While GFF/GTF files are commonly used to store information about genes, they can store any location information, and share some similarities with bed files (which we discussed last week).
GFF files are also tab delimited files, with each row in the file referencing a particular region in the genome and each column a piece of information about that feature This probably sounds similar to the bed format, but contains more required columns. GFF files by definition have the following columns:
.
+
(forward strand) or -
(reverse strand), but not occasionally one will see strand-ambiguous
annotations coded as .
or somewhat rarely
?
.0
, 1
, or 2
For more detailed information on GFF files, see the following links:
Let’s take a look at a GFF file and talk about it a bit.
Run the code block below to view the first few lines of a GFF file. Note that this is in Day 2’s data folder.
grep -v "biological_region" -m50 data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
# grep: The Unix string search command
# -v: This option tells grep to print lines that DO NOT contain the following string
# "biological_region": The string to search for in the provided file - we just don't want to display these for this demonstration
# -m50: This option tells grep to only display the first 50 matches
We’ll just point out a couple of things. First, this file also has a
header, like a BAM file, though this
is not required for GFF files (note that the
##gff-version
directive is required by the
GFF3 specification, though in practice many parsers
accept GFF3 files without it). In general, the GFF
format is less standardized than others we’ve gone over in the workshop.
Next you’ll note that columns 1, 3, and 4 are the same three columns
(ALTHOUGH WITH DIFFERENT INTERVAL ENCODING) that define a
bed file, so GFF files are (sort of)
easy to convert to bed files, though with loss of
information. This also means some bedtools
programs can
process GFF files as well.
Features in a GFF file are generally
nested: genes are comprised of transcripts and transcripts are
comprised of exons, and CDS features have intervals that are a subset of
the exon intervals. One may also see features for stop and start codons,
UTRs and other non-coding sequence intervals of interest. All of these
features are encoded in this file and are usually linked to each other
by IDs in the last column, though this is not always standardized. This
can make the strand
column slightly confusing to work with
for features nested under the same parental feature. For features on the
positive strand (+
), it is straightforward: they are
ordered by start coordinate. For features nested under the same parental
feature on the negative strand (-
) though, the correct
order is the reverse sorting by the end coordinate. Many of the tools we
work with will consider and correct for strand, but it is always a good
thing to consider if you ever parse GFF files on your
own.
Because of all the quirks with GFF files, there are many tools out there to help process and analyze them, such as gffread and AGAT. We won’t be demonstrating these today though.
In the previous workshop, we demonstrated how to use
awk
, a scripting language that is very useful for
processing text. Similarly to how we used awk
for
BED files, we can also use it for GFF
files.
Run the following piece of code in the terminal:
awk 'BEGIN{FS="\t"} $3=="gene"{print}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
Let’s go through each piece of the above command to refresh our
memories: - We use the BEGIN
pattern to tell
awk
to execute that command before any lines are read in
the file
- We set the field separator (i.e. FS
) as a tab
character, as GFF files are tab-delineated
- We provide the pattern $3=="gene"
, meaning that the
following code will only execute when the third column (indicated by
$3
) exactly matched the string gene
-
print
simply prints the line (only when the preceding
pattern is true)
In other words, this code simply prints all lines in the GFF file that are a gene annotation! Now let’s get a little more advanced:
Exercise: In the code block below, write an
awk
command that counts the number of genes in the macaque annotation. Be sure to only check the feature name column (third column) because any feature that has a gene as a parent will also have a “gene id” in the last column that would return that line if it was searched for the string “gene”. Also be sure to only get exact matches for the word “gene”, else pseudogenes might be included in the count:
## Write awk command to count the number of genes in the macaque annotation
# data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
awk 'BEGIN{g=0} $3=="gene"{g++}; END{print g}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
## Write awk command to count the number of genes in the macaque annotation
BONUS Exercise: In the code block below, write an
awk
command that calculates the average number of transcripts per gene in the macaque annotation. This requires initializing 2 counter variables at the beginning and searching for 2 patterns separately within yourawk
script, and then doing some math at the end.
## Write awk command to calculate average number of transcripts per gene
# data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
awk 'BEGIN{g=0;t=0} {if($3=="gene"){g++};if($3=="mRNA"){t++}} END{print t, g, t/g}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
awk 'BEGIN{g=0;t=0} $3=="gene"{g++}; $3=="mRNA"{t++} END{print t, g, t/g}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
## Write awk command to calculate average number of transcripts per gene
Last week, we talked about bed files, and some tools to work with
them, especially the program bedtools
. The
bedtools
program can also work with GFF files natively.
We’ll return to the macaque SV dataset from last week for an
example.
In the context of our macaque SVs, a natural question would be how
many of the mutations affect genic regions, and may therefore affect
some cellular function. To know this, we’ll compare the SVs in our
.bed
file to the gene annotations in our .gff
file.
bedtools intersect
So, how many of our SVs in our macaque population overlap with genes?
For this we can use bedtools intersect
, which takes two
interval files (either bed or GFF) and
calculates how many of the features overlap. Even though it takes
GFF as input, we need to parse out the gene
coordinates only.
Run the code block below to retrieve only the genes from the macaque annotation GFF file:
awk 'BEGIN{OFS="\t"} $3=="gene"{print "chr"$0}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3 > macaque-genes.gff
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
# > : The Unix redirect operator to write the output of the command to the following file
head macaque-genes.gff
# Display the first few lines of the new file with head
Note that we had to add the “chr” prefix to the beginning of the line. This is because the chromosome names are different between the two files. ID mismatches like this are a common cause of problems when working with different files from the same assembly.
Now we can get the overlaps between genes and SVs in our sample of macaques.
Run the code block below to use
bedtools intersect
to get the overlapping regions between two interval files:
bedtools intersect -a data2/macaque-svs-filtered.bed -b macaque-genes.gff > macaque-svs-genes-intersect.bed
# bedtools: A suite of programs to process bed files
# intersect: The sub-program of bedtools to execute
# -a : The first interval file to check for overlaps
# -b : The second interval file to check overlaps
wc -l data2/macaque-svs-filtered.bed
wc -l macaque-svs-genes-intersect.bed
# Use wc -l to count the number of lines in the original bed file and those in the bed file that overlaps with genes
Ok great, we’ve got only the SVs that overlap with genes in the macaque genome. Let’s take a look at this file.
Run the code block below to view the first few lines of the bed file with SVs that overlap with genes:
head macaque-svs-genes-intersect.bed
# Display the first few lines of the bed file containing SVs that overlap with genes
Exactly the same format as the input bed file, just with fewer lines.
bedtools intersect
can add additional columns with more
information about the overlap and overlaps can be defined more clearly.
Let’s try it out.
Exercise: *Read the documenation of
bedtools intersect
and do the following. Don’t save the output to a file, just pipe it towc -l
:
Count only the SVs that DO NOT overlap with any genes.
Count only SVs that have at least 90% of their sequence overlapping a gene.
Count only SVs that have at least 90% of their sequence overlapping a gene AND for which that overlap also encompasses at least 90% of the gene.
# data2/macaque-svs-filtered.bed
## Count SVs that DO NOT overlap with genes
bedtools intersect -v -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | wc -l
## Count SVs that have at least 90% of their sequence overlap with a gene
bedtools intersect -f 0.9 -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | wc -l
## Count SVs that have at least 90% of their sequence overlap with 90% of a gene's sequence (reciprocal overlap)
bedtools intersect -f 0.9 -r -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | wc -l
bedtools intersect
can also output the actual features
that are overlapped with the amount of overlap with the -wo
option.
Run the code block below to perform an intersect between macaque SVs and genes with the
-wo
option:
bedtools intersect -wo -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | head
# bedtools: A suite of programs to process bed files
# intersect: The sub-program of bedtools to execute
# -wo : A bedtools intersect option that specifies to write both features and the number of overlapping bases to the output file
# -a : The first interval file to check for overlaps
# -b : The second interval file to check overlaps
# | : The Unix pipe operator to pass output from one command as input to another command
Another important type of genomic coordinate file is the VCF file, or variant call format file. This is another tab-delimited file that contains variants between a single genome or a sample of genomes and a reference genome. These variants can include single nucleotide changes, small insertions and deletions, or even larger structural variants. The most common type of variation you’ll see in a VCF file are the single nucleotide ones, so the content of a VCF file is commonly referred to as SNP calls.
We’ll now be working with a VCF file that contains SNP data from some Amazon molly (Poecilia formosa), a freshwater fish. The original data consists of hundreds of thousands of SNPs across the molly genome among 16 individuals, but we’ll be looking only at SNPs on a single scaffold today (about 100,000).
Let’s take a look at our file and explain a bit about the format.
Run the code below to view the beginning of our VCF file:
head -n 43 data3/poeFor_NW_006799939.vcf
# head: A Unix command to display the first few lines of a file
# -n 43: This tells head to display exactly the first 43 lines of the provided file
So what we see here at the top of the file is the
header of the VCF. Much like headers in
BAM and SAM files, headers in
VCF files contain information about how this file was
created and the contents of the file. For example, the lines that begin
with ##FORMAT
describe the contents of a particular column
of the file, while the lines begin with ##FILTER
describe
what the filter labels in the Filter column mean. Unlike the
headers in BAM and SAM files, which
begin with the @
character, the headers in a
VCF file begin with the ##
string (like
GFF files).
Now that we know a bit about the header, let’s look at the actual content of the file, the tab-delimited rows containing SNP calls.
Run the code below to view the beginning of our VCF file while skipping header lines:
grep -v "##" data3/poeFor_NW_006799939.vcf | head -n 4 | less -S
# grep: The Unix string search command
# "##": The string to search for in the provided file - we just don't want to display these for this demonstration
# head: A Unix command to display the first few lines of a file
# | : The Unix pipe operator to pass output from one command as input to another command
# -n 2: This tells head to display exactly the first 2 lines of the provided file
# less -S: This brings up a "reader" that you can scroll horizontall. Press "q" to exit
The first line here, beginning with a single #
shows us
the names of the columns, which may be referred to in the file header
like we saw above. Each subsequent line represents one
variant relative to the reference genome, with the
following column definitions:
.
character.PASS
(or sometimes
.
).;
) delimited list of information.:
) delimited list of the keys for the
information in the subsequent columns. These keys are defined in the
file header. 10+. SAMPLE GENOTYPES: Each subsequent column contains a
colon (:
) delimited list that follows the keys in the
FORMAT column for the genotypes of the samples. Each column corresponds
to one sample with the name being the column header.The genotypes are encoded in the sample columns with the
GT
key. In this file, it is the first entry. For example,
with the keys defined in column 9, the genotype of the first sample:
./.:0,0:0:.:.:.:0,0,0:.
means that the GT (GENOTYPE from the header) is ./.
, the
AD (ALLELIC DEPTH) is 0,0, the DP (DEPTH) is 0, and so on.
Genotypes use the following syntax, with 0
being the
reference allele and 1
being the alternate allele:
0/0
means this sample is homozygous for the reference
allele.0/1
or 1/0
means this sample is
heterozygous for the reference allele (0) and the alternate allele
(1).1/1
means this sample is homozygous for the alternate
allele..
is used for missing; this site was not genotyped for
this individual, which happens to be the case for the sample outlined
above.For VCF files with multiple samples (like this one),
sites can have more than two alleles, in which case the genotypes
0/2
or 0/3
or 2/2
may be
seen.
There are many quirks to VCF files, but they are also highly standardized, so if you find yourself working with them be sure to read the very detailed specifications here:
.bcf
file
extension. BCF files are more efficient to read than VCF files, and are
typically compressed.There are a myriad of ways to use and analyze a VCF,
but for general processing of them, we will use bcftools
.
bcftools
springs under the same development umbrella as samtools
and
thus shares much of the same usage and underlying philosophy to act like
a native Unix command. It was preceded by vcftools, which still retains
some useful functionality, but is no longer developed.
Much like bedtools
, bcftools
has a wide
range of functions so we won’t be able to get to everything today. Be
sure to read the
documentation to find out if it can accomplish the task you need
with your VCF or BCF file.
One of the most basic things you can do to a file is to view it. Indeed, looking at your data is an important part of each step of a bioinformatics workflow.
Like samtools
, bcftools
has a dedicated
view
command.
Exercise:
Run the code below to
view
the top of the Amazon molly VCF file.Edit the
bcftools view
command to exclude the header from the output.
## Use the view command to display the VCF file, then edit it to exclude the header
## data3/poeFor_NW_006799939.vcf
bcftools view -H data3/poeFor_NW_006799939.vcf | head -n 10 | less -S
## NOTE: ignore the "cannot write" error! This is just an artifact of our setup for the workshop today.
So view
acts much the same way as the Unix command
cat
, but has specific functionality for the
VCF format, and also handles compressed
BCF files natively.
One useful function when view
ing files is to be able to
subset or filter them. We’ve provided a file that lists the sample IDs
for a made-up population and can use bcftools view
to
extract genotypes from specific samples.
Run the code below to extract SNPs only from a subset of samples:
cat data3/pop1.txt
# cat: A Unix command that displays the contents of a file in the Terminal
# This is just to see the pop1.txt file
bcftools view -S data3/pop1.txt -H data3/poeFor_NW_006799939.vcf | head | less -S
# bcftools: A suite of programs to process VCF files
# view: The sub-program of bcftools to execute
# -S : A bcftools view option that specifies to only extract genotype columns from the samples listed in the provided file
# -H : A bcftools view option that specifies to exclude the VCF header from the output
## NOTE: ignore the "cannot write" error! This is just an artifact of our setup for the workshop today.
This can be very handy, but be warned - the INFO fields are not automatically recalculated, and will no longer be correct once you remove some individuals. In general, the INFO column and its information displays information about the site while the FORMAT fields display information about each sample.
bcftools
and
bedtools
Many of the tools we’ve demonstrated, samtools
,
bedtools
, bcftools
, are general purpose tools
for basic processing and manipulation of specific file formats used in
genomics. Many of the tasks they do could be done with native
bash
commands, but that would require a lot of piping and
probably some custom scripting on uncompressed files. However, if you
really start to get to know these tools and their versatility, you can
do some more advanced analyses purely with them. Let’s demonstrate by
calculating SNP density across 100k windows the specific 7 Mb scaffold
we are working with from the Amazon molly genome.
To do this, we’ll need:
Since we’ve already downloaded the genome and generated the index
file, let’s start with the second one, getting the window coordinates.
We can do this easily with the bedtools getwindows
command:
Run the code below to generate a bed file with 100kb windows in the Amazon molly genome:
bedtools makewindows -g data3/poeFor_NW_006799939.fasta.fai -w 100000 > poeFor-windows-100k.bed
# bedtools: A suite of programs to process bed files
# makewindows: The sub-program of bedtools to execute
# -g : A genome size file, which needs two columns, the scaffold/chromosome name and the size of that feature
# -w : The size of the windows to partition each scaffold in -g into
# -b : The second interval file to check overlaps
# > : The Unix redirect operator to write the output of the command to the following file
head poeFor-windows-100k.bed
# Display the first few lines of the bed file containing windows
Now, we can use bedtools intersect
, which also works
with VCF files, to count how many SNPs appear in each
window.
Exercise: Use
bedtools intersect
to count the number of SNPs in each 100kb window in the Amazon molly genome. You will need to look in the help menu ofbedtools intersect
to find the right option to output a column with the number of overlaps per region! Save this file aspoeFor-windows-100k-snps.bed
.
## Write a bedtools intersect command to count how many SNPs overlap each 100kb window in the molly genome
## data3/poeFor_NW_006799939.vcf
## poeFor-windows-100k.bed
bedtools intersect -c -a poeFor-windows-100k.bed -b data3/poeFor_NW_006799939.vcf > poeFor-windows-100k-snps.bed
## Write a bedtools intersect command to count how many SNPs overlap each 100kb window in the molly genome
head poeFor-windows-100k-snps.bed
# Display the first few lines of the bed file containing windows and overlapping SNPs
BONUS Exercise: If you attended our R workshops this past Fall or know R already, use the code block below to read in the file we just generated with SNPs per 100k window and plot the density along this scaffold.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the tidyverse
## Write code in R to plot SNP density across the scaffold in our bed file
# poeFor-windows-100k-snps.bed
am_snps = read_tsv("poeFor-windows-100k-snps.bed", col_names=c("scaffold", "start", "end", "num.snps"))
## Rows: 74 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): scaffold
## dbl (3): start, end, num.snps
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Read the bed file containing 1Mb windows and number of SNPs per window, and name the columns
snp_plot = ggplot(am_snps, aes(x=start, y=num.snps)) +
# Initialize a ggplot grob and load the data and aesthetics into it and store it as the snp_plot object
geom_line() +
# Plot the provided data as a line graph
#facet_wrap(~scaffold, scales="free") +
# Split the plot by scaffold
theme_classic()
# Set a different theme
## Write code in R to plot SNP density across the scaffold in our bed file
print(snp_plot)
# Display the plot
So far, we’ve been focused on interactive analysis of small files. However, in real bioinformatics work, you will likely find yourself with many samples that you want to run the same commands on. And often these commands can require significant compute time, such that you want to be able to run these commands non-interactively. As nice as these markdown notebooks are for teaching and data exploration, they are not the ideal format for this task. Instead, what we want are scripts.
A bash script (like an R script, if you attended the R workshop earlier), is just a text file that executes a series of commands in order. Bash scripts can get exceedingly complicated, but today we are just going to introduce a few basic concepts.
We are ultimately going to make a script that takes as input a VCF file and one or more bed files, and reports as output the average SNP density in regions of the genome inside bed intervals. Let’s start simple.
Note: To follow every step for this instructor version, we
provide a separate script for each step titled
snp-density-N.sh
. In the student version you will update a
single script, snp-density.sh
. The intermediate scripts can
be downloaded individually at https://harvardinformatics.github.io/workshops/2024-spring/biotips/snp-density-0.sh
and by incrementing the number in the filename, (e.g. -0.sh, -1.sh,
etc)
Using the file menu, go to New file –> Shell script and create a new file. Call it
snp-density.sh
.This is going to be our script in a new tab of our text editor, that we will modify as we go.
The first thing we need to do is a little bit of shell magic to get
the shell to treat the script we are writing as a command. This is
called a shebang (#!
) and basically says “run the script
with the following interpreter”. Type the following
line at the top of your script:
#!/bin/bash
This means that when we run snp-density.sh
on the
command line, the commands in the script will be interpreted as shell
commands.
We need to do one more thing, which is make our script
executable. We’ll use the chmod
command
for this.
In the Terminal below, make sure you are in the same directory as your script file, and then type the following command.
chmod +x snp-density-0.sh
We now have a script we can run.
Try running your (empty) script. Note that the
./
specifies the relative path to the script, and means “execute the script snp-density.sh in the current working directory”. If omitted, you’ll get a “command not found” error.
chmod +x snp-density.sh
# Make the script executable
./snp-density.sh
# ./ : Execute the following file as a shell script
However, nothing will happen if we do, since there are currently no commands in the file to run.
Exercise: In order to have our shell script do anything, we need to put some commands in it. 1. Copy the
bedtools intersect
command from above into the script, but don’t write the output of the command to a file
Now, try running your script again, piping the output to head.
Try running your script.
./snp-density.sh | head
# ./ : Execute the following file as a shell script
If you’re lost, this is what your snp-density script should look like right now:
#!/bin/bash
bedtools intersect -c -a poeFor-windows-100k.bed -b data3/poeFor_NW_006799939.vcf
Now, all the commands to create our snp density windows are in the
script. Let’s say we wanted to calculate the average snp density across
all windows, with awk
. Here is an awk
command
that can do that:
./snp-density.sh | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{print snps/lens}'
If we run this, we should get a number now.
Let’s add the awk
command to our script now, so that
when we run snp-density.sh
the output will be a number with
the average snp density across windows.
#!/bin/bash
bedtools intersect -c -a poeFor-windows-100k.bed -b data3/poeFor_NW_006799939.vcf | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{print snps/lens}'
However, we have a problem. Ideally, we’d like to be able to do this calculation on any VCF file and any bed file. Right now those files are hard-coded, which means to change what file the script operates on requires changing the script.
To make our script flexible, we need to introduce the concept of
variables. We’ve seen this in awk
commands; we can also
create variables in the shell. We do this just by assigning something
(text or the output of a command, usually) to a string.
Run the commands in the block below in the Terminal to create a variable called TEXT, and print it out with the
echo
command.
TEXT="Hello World!"
# = : Assign the variable name on the left to the value on the right
echo $TEXT
# echo: A Unix command that simply prints the provided input to the screen
# $ : A symbol that the shell understands to mean that the following is the name of a variable
Note that when we access what is stored in the variable we use the
$
notation, but we don’t use this when we assign to a
variable. Note also we cannot have spaces before or after the equal sign
or the shell will thing we are trying to run the TEXT
command, not create a variable called TEXT
.
Exercise: 1. Modify our shell script,
snp-density.sh
, so that it includes two variables, one calledVCF
and one calledBED
. For now, define these in the second and third lines of the script with the paths to the BED file and the VCF file we currently have. 2. Replace the hard-coded file names in thebedtools intersect
command options-a
and-b
with theBED
andVCF
variables. 3. Copy the result into the code block below and run it to make sure it still works.
#!/bin/bash
BED="poeFor-windows-100k.bed"
VCF="data3/poeFor_NW_006799939.vcf"
bedtools intersect -c -a $BED -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{print snps/lens}'
You should get the same answer as before.
Note that right now, we aren’t actually changing anything, we’ve just moved the files we are going to operate on to the top of the script, but we still would need to change the script each time we run it on different files. To make our script more general (able to take various inputs more easily), we should have a way to change the input file without having to edit the script. The way to do this is with command line options. You’ve seen this a lot in the tools we’ve run up to this point. For instance, when we run almost any shell command, it requires a filename, e.g.:
head poeFor-windows-100k.bed
In this case, the file name poeFor-windows-100k.bed
, is
a command line option, also known as a command
line argument or just option,
argument, or parameter. Commands and
scripts can take multiple arguments. Some are
positional, meaning that the way they are read into the
program depends on their placement in the command. Others have specific
flags, meaning the way they are read into the program
depends on the flag they come after in the command. For instance:
head -n 20 poeFor-windows-100k.bed
This is a command that hopefully looks pretty understandable to you
by now. This command, head
has two command line
options as we’ve written it:
-n 20
, which changes the
number of lines the command displays. Internally, this sets the value of
some variable to a different number.poeFor-windows-100k.bed
, which is the last listed argument.
Again this internally sets the value of some variable to the path to
that file (actually, for most bash commands, the first un-flagged option
is usually reserved for the file name).Well, we can modify our script to take input from the command line as
well. Shell scripts have access to special variables that can be used to
create simple, positional, command line options.
Specifically, the variables $1
, $2
,
$3
, etc will contain the first, second, third, etc.
arguments specified after the script is called.
Exercise: 1. Modify your script to replace the path to the bed file with
$1
, and the path to the VCF file with$2
using positional command line arguments. 2. Run your script using./snp-density.sh poeFor-windows-100k.bed data3/poeFor_NW_006799939.vcf
3. Copy your working script to the code block below
#!/bin/bash
BED=$1
VCF=$2
bedtools intersect -c -a $BED -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{print snps/lens}'
If everything is working, you should get the same answer as before. But now we can give our script any bed file and any VCF file, right from the command line. Note that we don’t do any error checking - if you accidentally specify a VCF file that uses a different reference genome than your bed file, nothing in the script will prevent this.
This is better, but it still requires a lot of typing if we want to run this on 10 files. If, for example, we wanted to compute SNP density separately for each chromosome, or for different interval types (e.g., genes, introns, exons), we’d have to type out each bed file separately.
We can get around this by using loops, which we’ll cover next time.