Welcome to the fourth day of the FAS Informatics Bioinformatics Tips and Tricks Workshop!
If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit the file while executing code in the terminal. Please download the raw file here
Today you’ll learn more about how to write scripts, control the behavior of your scripts using loops and conditional statements, and more!
As with yesterday, we’ll create a symbolic link (analogous to a Windows shortcut) in your current working directory called “data4” that points to the workshop data directory.
Run the following commands in the terminal window to create links to the files in our data directory in your own data directory:
mkdir -p data4
ln -s -f /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2024/day4/* data4
# ln: The Unix link command, which can create shortcuts to folders and files at the provided path to the second provided path
# -s: This option tells ln to create a symbolic link rather than a hard link (original files are not changed)
# -f: This option forces ln to create the link
ls -l data4
# Show the details of the files in the new linked directory
We introduced scripting yesterday. If you need a refresher on how we
developed our first script, the data4 folder contains
different iterations of “snp-density-*.sh” that step through the
process.
Loops are useful if you want to repeat the same command multiple
times, while changing something specific about that command each time.
For example, on day 3 we counted the number of genes in the macaque
annotation .gff file, but what if we wanted to know the
number of genes in all the gff files across our data folders? Or what if
we wanted to extract the different regions of the genome, like introns
or exons from a gff file? Today we’ll learn how to do those
things using the “For” loop, which executes a sequence of commands once
for each item in a list of items.
Run the following code block by copy-pasting it to your terminal to see a for loop in action. You can also write out the code in the terminal, and note how the prompt changes as you type a multi-line command.
# a simple loop
for i in 1 2 3
do
echo $i
done
# for: loops over a list of values
# i: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Note: bash also has
whileloops that execute the loop body while a condition evaluates to true.
The components of a loop are:
for keywordin keyworddo keyworddone keywordThe for loop will iterate over the elements in the list,
assigning each element to the variable name in turn. The commands in the
loop will be run once for each element in the list. The “$” sign is used
to access the value of the variable. After the for loop,
the variable will subsequently retain its value from the last iteration
of the loop.
What is the thought process that goes into writing a loop? First, we start with a command that we want to repeat multiple times. Then, we think about what we want to change each time we run the command. This is the variable. Then, we think about what values we want to assign to the variable. This is the list of elements the loop will iterate over. Finally, we put it all together.
When first testing out a loop, it’s a good idea to make use of the
echo command to see what your loop will do before it
actually does it.
Run the below code to see how echo can be used to do a trial run of a loop.
for i in 1 2 3
do
echo "some_command $i | another_command > $i-result.txt"
done
# for: loops over a list of values
# i: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Let’s find out how many exercises there are in each RMarkdown document (student version).
Exercise: Using the steps above, write a loop that will print out how many exercises there are in each Rmarkdown document
data4/Biotips-workshop-2023-Day1-student.Rmd,data4/Biotips-workshop-2023-Day2-student.Rmd, anddata4/Biotips-workshop-2023-Day3-student.Rmd. You can use thegrep -ccommand to count the number of matches to “Exercise” in a file. When iterating over file name, it is a good idea to encase the variable name in double quotes so that any weird characters such as spaces in the file name are parsed correctly. Write your solution in the code block below.
# The command we want to repeat is: grep -c 'Exercise'
# The variable we want to change each time we run the command is: filename
# The values we want to assign to the variable are:
# data3/Biotips-workshop-2024-Day1-student.Rmd
# data3/Biotips-workshop-2024-Day2-student.Rmd
# data3/Biotips-workshop-2024-Day3-student.Rmd
# If we're unsure how it'll go, we can put echo in front of the command
# Put your solution here:
for filename in data3/Biotips-workshop-2024-Day1-student.Rmd data3/Biotips-workshop-2024-Day2-student.Rmd data3/Biotips-workshop-2024-Day3-student.Rmd
do
grep -c 'Exercise' "$filename"
done
# for: loops over a list of values
# i: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Note: A note on using echo to test your code. Run the following two code chunks and compare the result:
for filename in data3/Biotips-workshop-2024-Day2-student.Rmd
do
echo "grep -c 'Exercise' \"$filename\" > $filename-count.txt"
done
for filename in data3/Biotips-workshop-2024-Day2-student.Rmd
do
echo grep -c 'Exercise' "$filename" > $filename-count.txt
done
What happened here? In the first version, the full command was
enclosed in double quotes, and treated as a string to print to the
terminal (note the \" to escape the inner double
quotes, and treat them as literal " characters within the
string instead of terminating / starting a string). In the second
version, the echo was only applied to the
grep, the output of which was then redirected (with
>) into the file
data3/Biotips-workshop-2024-Day2-student.Rmd-count.txt. You
can go to your data4/ folder in the Files pane and view
that text file. When using echo to test your code, be mindful of which
part of your code you are echoing.
Now let’s figure out a command to get the exercises themselves. Sometimes it’s easier to build out a command outside the loop, working on a single file at a time, so let’s start with just finding the exercises for the second day’s file.
Exercise: Awk can be used to capture multiple lines based on a pattern. The syntax for that is
awk /start/,/end/ file-to-match. Write a command that will print out the full exercise prompts indata3/Biotips-workshop-2024-Day2-student.Rmd. What should be your “start” and what should be your “end” in the awk command? Hint: there is an empty line after every exercise prompt. Write your solution in the code block below.
## An awk command that will print out all the exercises in the file data4/Biotips-workshop-2023-Day2-student.Rmd
awk '/Exercise/,/^`/' data3/Biotips-workshop-2024-Day2-student.Rmd
# OR
awk '/Exercise/,/^$/' data3/Biotips-workshop-2024-Day2-student.Rmd
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
Exercise:
Incorporate the working
awkcommand into a for loop, iterating over the same.Rmdfiles as before.Use
>>to append the output to a file calledexercises.txt. Where should you add this code? Write your solution in the code block below.
## A for loop that extracts all the exercises from the RMarkdown files and appends them to a file called exercises.txt
for filename in data3/Biotips-workshop-2024-Day1-student.Rmd data3/Biotips-workshop-2024-Day2-student.Rmd data3/Biotips-workshop-2024-Day3-student.Rmd
do
awk '/^> \*\*Exercise/,/^$/' "$filename" >> exercises.txt
done
# for: loops over a list of values
# filename: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
for filename in data3/*.Rmd
do
awk '/^> \*\*Exercise/,/^$/' "$filename" >> exercises.txt
done
Now you can see how useful loops can be to process multiple files at once!
Enumerating all the files in your for loop can get tedious and messy. Since some of our files all share the same extension, we can use wildcards to specify all the files that match that extension. Wildcards get processed before the loop is run so it essentially generates the list of files the loop iterates over.
Run the following code to see how wildcards work in a for loop:
for filename in data2/*.vcf
do
echo $filename
done
# for: loops over a list of values
# filename: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Note: if the wildcard pattern does not match any existing path (file or directory) name, the pattern will not expand into a list, resulting in a string containing a literal
*. E.g., the following results in “data2/*.bcf” being echoed:
for filename in data2/*.bcf
do
echo $filename
done
# for: loops over a list of values
# filename: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Exercise: Create a for loop that prints out the path for all the
.Rmdfiles we processed earlier. Write your solution in the code block below.
for filename in data3/*student.Rmd
do
echo $filename
done
# for: loops over a list of values
# filename: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Last time, we practiced using a for loop in the command line to iterate over three .Rmd files and find the number of exercises in each. Today, we will start writing loops within scripts for added reproducibility and portability. In this section we’ll walk you through how a script comes together conceptually. It starts with an idea for a single command that gradually gets too complicated to keep track of in your head.
We have previously used the bedtools makewindows
function to generate a bed file with the coordinates of
the 100 kilobase windows in the scaffold for the Amazon molly genome. We
can also generate bed files for different subsets of
the annotation we have for the Amazon molly: coding exons, introns,
intergenic regions, UTRs. For example, to extract the exons from a
gff file, we can use the following awk
command:
awk 'BEGIN{OFS="\t"}; $3=="exon"{print $1, $4-1, $5}' data3/poeFor_NW_006799939.gff > exon.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
# > : The Unix redirect operator to write the output of the command to the following file
We can then repeat the command and substitute “exon” in both the awk
and output bed for “CDS”, “mRNA”, or “gene”, for a total of 4 different
bed files. We could just copy and paste the command and change the file
name, but that’s not very efficient. This is a good candidate for a
loop! In other bash commands, we’ve been able to use $ to
denote a shell variable that we wanted to switch. However, this won’t
work inside an awk command. This is because awk by default
doesn’t have access to shell variables, but we can pass them to an
awk script with the -v option. The syntax
below…
SHELL_VALUE="query"
awk -v variable_inside_awk=$SHELL_VALUE 'variable_inside_awk{print}' file-to-awk
… will become the following when it’s parsed and executed:
awk 'query{print}' file-to-awk
Exercise: Write a loop to create the 4 different bed files, but
echothe command since we won’t be running it just yet. Write your solution in the code block below.NOTE: When using
echoto display anawkcommand, some things may not display exactly how you think they should. This is becauseawkandbash(the terminal program) share some of the same syntax. For example, bothawkandbashuse the$to indicate variable names. When evaluating anawkcommand, the terminal knows to use the variable withinawk. However, when usingechoto display theawkcommand, the$is interpreted bybashinstead. This can lead to unexpected behavior if, for instance, yourawkcommand uses$3. If$3does not exist in yourbashenvironment an empty string will be displayed byecho, even though the command would evaluate correctly inawk.We can get around this by escaping these shared syntactic characters using the
\symbol. For example,$3will try to be evaluated bybash, but\$3will tell bash to instead literally use the “$” character.Keep this in mind when you display your commands with
echo.
for TYPE in exon CDS mRNA gene
do
echo "awk -v type=$TYPE 'BEGIN{OFS=\"\t\"}; \$3==type {print \$1, \$4-1, \$5}' data3/poeFor_NW_006799939.gff > $TYPE.bed"
done
# for: loops over a list of values
# TYPE: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Since we’re creating four bed files, let’s organize
them by putting them all in a folder called poeFor_beds. So
we want to add a command for making a folder before running this for
loop using mkdir -p poeFor_beds. Now that the command is
more than just one line, we should probably put it in a script. In
RStudio create a new script from File –> New File –> Shell script.
Name it make-beds.sh. Now copy and paste what we have so
far into the script. Also add on
| bedtools sort -i - | bedtools merge before the redirect
to the .bed file. This will sort the bed file and merge
overlapping intervals. You new script should look like the
following:
Exercise: Write a script called
make-beds.shthat redirects the output of the for loop into a new directory calledpoeFor_beds. Also add in the sorting and merging of overlapping intervals. Write your solution in the code block below (and also modify it in your bash script).
#!/bin/bash
mkdir -p poeFor_beds
for TYPE in exon CDS mRNA gene
do
awk -v type=$TYPE 'BEGIN{OFS="\t"}; $3==type {print $1, $4-1, $5}' data3/poeFor_NW_006799939.gff | bedtools sort -i - | bedtools merge > poeFor_beds/$TYPE.bed
done
# for: loops over a list of values
# TYPE: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
chmod +x make-beds.sh
./make-beds.sh
## Run the make-beds.sh script
Now we can run the script by typing bash make-beds.sh in
the terminal. You should see 4 new bed files in the
poeFor_beds folder. Great! But what if we want to run this
script on a different gff file? We can add a variable for the gff file
name and change the script to use that variable instead of the
hard-coded file name.
Exercise: Recall what we learned about running bash scripts with variables in the previous session. Add a variable for the gff file name and change the script to use that variable instead of the hard-coded file name. Modify your bash script with your code (and also write it in the code block below if you want).
#!/bin/bash
GFF=$1
mkdir -p poeFor_beds
for TYPE in exon CDS mRNA gene
do
awk -v type=$TYPE 'BEGIN{OFS="\t"}; $3==type {print $1, $4-1, $5}' $GFF | bedtools sort -i - | bedtools merge > poeFor_beds/$TYPE.bed
done
# for: loops over a list of values
# TYPE: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Now we can run the script by typing
bash make-beds.sh data3/poeFor_NW_006799939.gff in the
terminal. Use ls poeFor_beds to confirm that there are 4
bed files in that folder. Note that the previous bed files were
overwritten! Always use caution when running your scripts. In this case,
we are fine with the previous files being overwritten.
We are almost done; let’s make a few more using
bedtools complement, which reports all regions of the
genome NOT in an interval, and bedtools subtract, which
removes overlaps. For example, we don’t have an intergenic
feature in our GFF, but we can create one by getting everything that is
not a gene. And we don’t have a UTR feature or an
intron feature, but we can get them by taking our exons and
removing coding sequences (CDS) or by taking genes and removing
exons.
Exercise: Add the following lines to the end of your
make-bed.shscript and re-run the script.
bedtools complement -i poeFor_beds/gene.bed -g data3/poeFor_NW_006799939.fasta.fai > poeFor_beds/intergenic.bed
# bedtools: A suite of programs to process bed files
# complement: The sub-program of bedtools to execute
# -i: The input bed file
# -g: A genome size file, which needs two columns, the scaffold/chromosome name and the size of that feature
# > : The Unix redirect operator to write the output of the command to the following file
bedtools subtract -a poeFor_beds/gene.bed -b poeFor_beds/exon.bed | bedtools sort -i - | bedtools merge > poeFor_beds/intron.bed
bedtools subtract -a poeFor_beds/exon.bed -b poeFor_beds/CDS.bed | bedtools sort -i - | bedtools merge > poeFor_beds/UTR.bed
# Two commands that subtract bed intervals in one file (-b) from another (-a) and pipe both with | and - into other bedtools commands and
# finally redirect to an output bed file.
Run wc -l poeFor_beds/*.bed to see how many lines are in
each bed file. Note that here we have run a command on multiple files
without an explicit loop, thanks to regex!
In this section we’ll work on another script to continue our analysis
of the Amazon molly genome. We’ll write a script that will calculate the
SNP density for each of the different bed files we created in the
previous section. We already have code for this from the previous
session, which is reproduced below. (If you don’t have the file already,
create a file named snp-density.sh and copy the below code
into it.):
#!/bin/bash
BED=$1
VCF=$2
bedtools intersect -c -a $BED -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{print snps/lens}'
In order to run this, we have to specify a bed file and a VCF file as command line arguments.
Try running this script in your terminal using the command:
bash snp-density.sh poeFor_beds/exon.bed data3/poeFor_NW_006799939.vcf
It’s not bad, but it still requires a lot of typing if we want to run
this on the all the .bed files. Plus, we want to run this
and essentially generate a report summarizing the results all in one
place. To better wrap my brain around what kind of code I need, I like
to write what I want in plain English first, then in
pseudocode, and then in real code.
Pseudocode is a programming word meaning a sketch of
code that. It should be structured like the code you intend the write,
but without worrying about grammar or syntax.
Pseudocode is useful for creating a template that gets
filled in with real code. It can also help identify logical flow errors
and prevent laborious bug-fixing.
Here’s my plain English version:
“I want to loop over the an arbitrary list of bed files and run the snp-density-4.sh script on each one, and then print out the results in a file called snp-densities.txt. The script should be run like bash script-name.sh list-of-bed-files.txt vcf-file.vcf > snp-densities.txt”
Here’s my pseudocode (I didn’t necessarily write this up linearly, but it’s presented here in a linear fashion):
bed_list=a list of bed files
vcf_file=a vcf file
for each bed_file in the bed_list:
print out "the snp density for <bed_file> is:"
run bedtools intersect on the bed_file and the vcf_file then pipe it to the awk statement to get the snp density
print out "----" as a spacer
Now let’s start translating the pseudocode into real code. We’ll start with the for loop. We already know how to write a for loop, but we need to figure out how to get the list of bed files. How would you get a list (in the form of a text file) of the bed files we need process?
Exercise: Write a shell command (not script, this is too simple for a script) to get a list of the bed files we need to process and write it to
bed_files.txt. Write your solution in the code block below. Usewc -l bed_files.txtto check the number of.bedfiles. Put your command in the code block below.
## A shell command to get a list of bed files and save it to a file
ls poeFor_beds/*.bed > bed_files.txt
We’ve got the list of bed files we want. Let’s write out the skeleton
of the loop. Before when we were assembling the list of things to loop
over, we simply typed them out
(e.g. for TYPE in exon CDS mRNA gene;). However, now we
want to get the output from a command, in this case cat to
get the lines of the file and then pipe that to the for loop. We can do
this by storing the output of that command in a shell variable and
looping over that with this syntax:
for VAR in $(cat FILENAME)
This says that we’re going to run the cat command on our
file, but instead of displaying the contents of the file to the screen,
we’re going to substitute its output in our script using the
$() (command substitution) syntax. In this case,
we’re not storing the output of cat in a named variable
(e.g. x=$(cat FILENAME)), but instead using it directly in
our for loop. Now, in our loop, each line in
FILENAME will be iterated over and stored as
VAR.
Exercise: Write a for loop that iterates over each bed file in the
bed_files.txtfile and prints out the name of the bed file. Write your solution in the code block below.
## A for loop that prints out the name of the files in bed_files.txt
for BED in $(cat bed_files.txt)
do
echo $BED
done
# for: loops over a list of values
# BED: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
# $(cat bed_files.txt): this runs the cat command on the bed_files.txt file and uses it as input to the loop
Now we have all the parts we need to create our script.
Exercise: Write a script called
snp-density-5.shthat takes as input a file of bed files and a vcf file and finds the snp density of each bed file within the vcf. Write your solution in the code block below (and also modify it in your bash script). I’ve copied over the pseudocode to help you out.
# bed_list=a list of bed files
# vcf_file=a vcf file
# for each bed_file in the bed_list:
# print out "the snp density for <bed_file> is:"
# run bedtools intersect on the bed_file and the vcf_file then pipe it to the awk statement to get the snp density
# print out "----" as a spacer
#!/bin/bash
BEDFILES=$1
VCF=$2
for BEDFILE in $(cat $BEDFILES)
do
echo "SNP density for $BEDFILE:"
bedtools intersect -c -a $BEDFILE -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{if(lens > 0){print snps/lens}}'
echo "---"
done
# for: loops over a list of values
# BEDFILE: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
# $(cat bed_files.txt): this runs the cat command on the bed_files.txt file and uses it as input to the loop
Use
chmod +x snp-density-5.shto make the script executable. Then let’s run the script in the terminal:
bash snp-density-5.sh bed_files.txt data3/poeFor_NW_006799939.vcf > snp-densities.txt
You should see a file called snp-densities.txt in your
data3 folder. Let’s take a look at it using
cat snp-densities.txt. Congratulations! You’ve made a
working script that can be reused or just kept as a record of how you
processed your data. Can you think of ways in which the script might be
improved?
In this section, we’ll get into some more advanced bash scripting concepts. We’ll learn how to use conditional statements to control the flow of our scripts. Conditional statements can allow your script to make decisions based on the results of a command or the value of a variable, making it more responsive. Here is a simple example of a conditional statement:
FILE=bed_files.txt
if [[ -f "$FILE" ]]
then
echo "File $FILE exists."
fi
# if: a conditional statement which evaluates the expression contained in [[]] as true or false and executes the subsequent lines only if true
This conditional statement is made up of the following components:
if keyword[[ ]]).then keywordif command’s exit
status is zerofi keywordThe if keyword starts the conditional. The conditional
expression you’re checking is enclosed in the double square brackets. In
this case, the -f checks if a file exists and the
expression evaluates to true if it does. The then keyword
starts the commands that need to be run if the statement is true. Here
we’re just printing to the terminal a string that confirms that the file
named $FILE exists. The fi keyword ends the
conditional. If the file doesn’t exist, nothing happens.
Note: you may see scripts that use
[instead of[[for conditional expressions. This older syntax is standardized and more portable to other (non-bash) shells, though it supports only a subset of conditional expressions supported by[[.
An exit status is a small integer value returned returned by
a command to its caller upon completion. The if statement
interprets an exit status of 0 for its command to mean “true”, and
non-zero to mean “false”. The [[ command exits with status
0 if its conditional expression evaluates to true; otherwise, it exits
with status 1.
The shell variable
$?contains the exit status of the last command executed. Try copying/pasting each of the following lines individually, noting the exit status echoed to the screen for each (note the use of semicolons instead of newlines to separate statements for brevity):
true; echo $?
false; echo $?
[[ -f bed_files.txt ]]; echo $?
[[ -f bed_files.txt.nonexistent ]]; echo $?
As observed in the above code block, the true and
false shell commands silently exit with status 0 and
non-zero, respectively. For other commands, a zero exit status indicates
“success”, and a non-zero exit status indicates an error occurred.
if ls
then
echo "ls succeeded (exit status 0)!"
fi
if ls some-nonexistent-file
then
echo "should not see this!"
fi
# Note the use of the ! ("not") operator to negate the exit status of the command:
if ! ls some-nonexistent-file
then
echo "uh oh, ls failed!"
fi
Some commands assign different exit statuses to different error
conditions; see the program’s documentation for details (e.g., issue
man ls and check the “Exit status” section near the
bottom).
Exercise: Write a conditional statement that checks if the file
data3/poeFor_NW_006799939.vcfexists and prints out the header. Write your solution in the code block below.
## Write a conditional statement that checks if a VCF file exists and prints out the header
FILE=data3/poeFor_NW_006799939.vcf
if [[ -f "$FILE" ]]
then
bcftools view -h "$FILE"
fi
# if: a conditional statement which evaluates the expression contained in [[]] as true or false and executes the subsequent lines only if true
There are some particularly useful conditional tests built in to bash that let you check file properties (see the complete list here):
-e filenamefilename exists as a file or
directory or symlink
-f filenamefilename exists and is a regular
file, not a directory or symlink
-s filenamefilename exists and has a size
> 0
-d filenamefilename is a directory
You can also check string properties:
-z stringstring is empty or null (0-length)
-n stringstring is not empty/null
You might want to process a file only if it exists, or you might want
to run different commands depending on the size or number of files. In a
larger workflow, you can imagine that sometimes, you’re cleaning a FASTQ
file and all the reads get thrown out leaving you without an output
cleaned FASTQ. You might want to be able to check for that condition and
have the script report it to you and skip the subsquent steps. For the
purposes of this workshops, we’re going to be using conditionals for
basic error checking in our snp-density-5.sh script. To do
that, we will need to introduce if-else statements.
The if-else statement is similar to the if statement, but it has an additional set of commands to run if the boolean value is false. Here’s an example:
FILE="data3/poeFor_NW_006799939.vcf"
if [[ -f "$FILE" ]]; then
echo "$FILE exists."
else
echo "$FILE does not exist. Skipping subsequent steps."
fi
# if: a conditional statement which evaluates the expression contained in [[]] as true or false and executes the subsequent lines only if true
# else: a conditional statement which follows an if statement and executes the subsequent lines only if that statement was false
Exercise: Modify your
snp-density-5.shscript to include an if-else statment so that it checks if the.bedfile exists. If the file exists, the script should continue. If the file does not exist, the script should print a message saying it doesn’t exist. Test your script by modifyingbed_files.txtto include a file that doesn’t exist. Write your solution in the code block below (and also modify it in your bash script).
#!/bin/bash
BEDFILES=$1
VCF=$2
for BEDFILE in $(cat $BEDFILES)
do
if [[ -f "$BEDFILE" ]]
then
echo "SNP density for $BEDFILE:"
bedtools intersect -c -a $BEDFILE -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{if(lens > 0){print snps/lens}}'
echo "---"
else
echo "$BEDFILE does not exist. Skipping this file."
fi
done
This is fine, but we aren’t checking if the VCF file exists as well.
We want to modify the script to check if the VCF file exists and then
exit the script if it doesn’t. We can do this using the
exit command. The exit command exits the
script and returns a status code. By convention, a status code of 0
means the script exited successfully and any other number means there
was an error. Here is an example of using an exit command:
Copy the below code into a file named
exit.sh, make it executable withchmod +x exit.shand run it withbash exit.shto see how the exit command works. (Do not copy and paste it directly into terminal or you will exit the current terminal session and respawn a new session):
#!/bin/bash
for i in 1 2 3 4 5
do
if [[ $i -eq 3 ]]
then
echo "Exiting because i is $i"
exit 1
else
echo "i is $i"
fi
done
Exercise: Modify your
snp-density-5.shscript to check if the VCF file exists and exit the script if it doesn’t. Test your script by giving running it likebash snp-density-5.sh bed_files.txt fake_vcf.vcfWrite your solution in the code block below (and also modify it in your bash script). Hint: you may want to check if the VCF does NOT exist, which will save you an else statement. Also think about where you want to place this if statement to reduce the number of times the script performs the check.
#!/bin/bash
BEDFILES=$1
VCF=$2
if [[ ! -f "$VCF" ]]
then
echo "$VCF does not exist. Exiting script."
exit 1
fi
for BEDFILE in $(cat $BEDFILES)
do
if [[ -f "$BEDFILE" ]]
then
echo "SNP density for $BEDFILE:"
bedtools intersect -c -a $BEDFILE -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{if(lens > 0){print snps/lens}}'
echo "---"
else
echo "$BEDFILE does not exist. Skipping this file."
fi
done
We’ve come a long way since the our first one-liner script
snp-density-1.sh! Let’s copy snp-density-5.sh
to a new version called snp-density-final.sh where we’ll
put on the finishing touches that make the script easier to use, read,
and monitor.
Imagine you come back to this script 6 months later, vaguely
recalling that by running it, you can generate a report on SNP
densities. However, you don’t remember exactly how it works so you just
naively run bash snp-density-final.sh. You just get the
uninformative error below:
does not exist. Exiting script.
To save future you some confusion, we can use an if statement to
print out how the script is to be used if no arguments are given to it.
The syntax for checking if a variable is empty is
-z $VARIABLE.
Exercise: Use and
ifstatement to check whether bothBEDFILEandVCFare given as arguments. Modify your script to print out a usage message if they are not and exit the script. If they are, then proceed with the rest of the script. Write your solution in the code block below (and also modify it in your bash script).
#!/bin/bash
BEDFILES=$1
VCF=$2
if [[ -z $BEDFILES || -z $VCF ]]
then
echo "Usage: ./snp-density-final.sh <file of bed files to process> <vcf file>"
exit 1
fi
if [[ ! -f "$VCF" ]]
then
echo "$VCF does not exist. Exiting script."
exit 1
else
for BEDFILE in $(cat $BEDFILES)
do
if [[ -e $BEDFILE ]]
then
echo "SNP density for $BEDFILE:"
bedtools intersect -c -a $BEDFILE -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{if(lens > 0){print snps/lens}}'
echo "---"
else
echo "File $BEDFILE does not exist. Skipping!"
fi
done
fi
Bonus note on elif statements.
Sometimes, you may want to check for multiple exclusive conditions
consecutively. In those cases, using only if else can get complicated
because each if statement needs to be ended with a
fi and those are hard to keep track of. Imagine a script
that extracts certain metadata from different file types. If the file is
a .vcf, you might want to use bcftools to
process it. If the file is a .bam, you might want to use
samtools, etc. If the file is neither, you might want to
print an error message. Here is an example of how elif
followed by an else to catch everything else can be used to
manage that scenario:
You do not need to run the following code, just observe how it is structured.
#!/bin/bash
FILE=$1
if [[ $FILE == *.vcf ]]
then
echo "Processing VCF file with bcftools..."
# bcftools commands here
elif [[ $FILE == *.bam ]]
then
echo "Processing BAM file with samtools..."
# samtools commands here
else
echo "Unknown file type. Please provide a .vcf or .bam file."
exit 1
fi
In the last section we made our script more user friendly by adding a usage message if not all of the variables were given as arguments. We can also make our script more readable by adding a detailed description of how the script works in the file itself using comments. We highly encourage everyong to annotate their code with comments. It makes it easier for others (and future you) to understand. In the case of a script, documentation typically means adding a description of what the script does, what the inputs are, what the outputs are, and the author of the script to the top of the file. Here’s an example of what that might look like:
Exercise: Add a description of what the script does, what the inputs are, what the outputs are, and the author of the script to the top of your
snp-density-final.shscript. Write your solution in the code block below (and also modify it in your bash script).
#
# This script takes a file of bed files and a VCF file and calculates average SNPs per base for each bed file
# Usage: ./snp-density-final.sh <file of bed files to process> <vcf file>
#