Welcome to the third day of the FAS Informatics Bioinformatics Tips and Tricks Workshop!
If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit and execute the code in this file. Please download the raw file here
Today we’re going to continue our tour and explanation of common genomics file formats and their associated tools by talking about variant files, that is files which indicate where variants in some sample relative to a reference are.
We’ll also introduce some basic concepts about bash scripting today, in preparation for tomorrow when we’ll talk about using Slurm and submitting jobs to the cluster.
mkdir -p data3
ln -s -f /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/* data3
# ln: The Unix link command, which can create shortcuts to folders and files at the provided path to the second provided path
# -s: This option tells ln to create a symbolic link rather than a hard link (original files are not changed)
# -f: This option forces ln to create the link
ls -l data3
# Show the details of the files in the new linked directory
## total 36
## lrwxrwxrwx 1 gthomas informatics 109 Mar 29 18:24 Biotips-workshop-2023-Day3-student.Rmd -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/Biotips-workshop-2023-Day3-student.Rmd
## lrwxrwxrwx 1 gthomas informatics 121 Mar 29 18:24 GCF_000485575.1_Poecilia_formosa-5.1.2_genomic.fna -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/GCF_000485575.1_Poecilia_formosa-5.1.2_genomic.fna
## lrwxrwxrwx 1 gthomas informatics 125 Mar 29 18:24 GCF_000485575.1_Poecilia_formosa-5.1.2_genomic.fna.fai -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/GCF_000485575.1_Poecilia_formosa-5.1.2_genomic.fna.fai
## lrwxrwxrwx 1 gthomas informatics 82 Mar 29 18:24 genomic.gff -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/genomic.gff
## lrwxrwxrwx 1 gthomas informatics 96 Mar 29 18:24 poeFor_NW_006799939.fasta -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/poeFor_NW_006799939.fasta
## lrwxrwxrwx 1 gthomas informatics 100 Mar 29 18:24 poeFor_NW_006799939.fasta.fai -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/poeFor_NW_006799939.fasta.fai
## lrwxrwxrwx 1 gthomas informatics 94 Mar 29 18:24 poeFor_NW_006799939.gff -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/poeFor_NW_006799939.gff
## lrwxrwxrwx 1 gthomas informatics 94 Mar 29 18:24 poeFor_NW_006799939.vcf -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/poeFor_NW_006799939.vcf
## lrwxrwxrwx 1 gthomas informatics 79 Mar 29 18:24 pop1.txt -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/pop1.txt
Another important type of genomic coordinate file is the VCF file, or variant call format file. This is another tab-delimited file that contains variants between a single genome or a sample of genomes and a reference genome. These variants can include single nucleotide changes, small insertions and deletions, or even larger structural variants. The most common type of variation you’ll see in a VCF file are the single nucleotide ones, so the content of a VCF file is commonly referred to as SNP calls.
We’ll now be working with a VCF file that contains SNP data from some Amazon molly (Poecilia formosa), a freshwater fish. The original data consists of hundreds of thousands of SNPs across the molly genome among 16 individuals, but we’ll be looking only at SNPs on a single scaffold today (about 100,000).
Let’s take a look at our file and explain a bit about the format.
Run the code block below to view the beginning of our VCF file:
head -n 43 data3/poeFor_NW_006799939.vcf
# head: A Unix command to display the first few lines of a file
# -n 43: This tells head to display exactly the first 43 lines of the provided file
## ##fileformat=VCFv4.2
## ##FILTER=<ID=PASS,Description="All filters passed">
## ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
## ##FILTER=<ID=FS_SOR_filter,Description="(vc.isSNP() && ((vc.hasAttribute('FS') && FS > 60.0) || (vc.hasAttribute('SOR') && SOR > 3.0))) || ((vc.isIndel() || vc.isMixed()) && ((vc.hasAttribute('FS') && FS > 200.0) || (vc.hasAttribute('SOR') && SOR > 10.0)))">
## ##FILTER=<ID=LowQual,Description="Low quality">
## ##FILTER=<ID=MQ_filter,Description="vc.isSNP() && ((vc.hasAttribute('MQ') && MQ < 40.0) || (vc.hasAttribute('MQRankSum') && MQRankSum < -12.5))">
## ##FILTER=<ID=QUAL_filter,Description="QUAL < 30.0">
## ##FILTER=<ID=RPRS_filter,Description="(vc.isSNP() && (vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -8.0)) || ((vc.isIndel() || vc.isMixed()) && (vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -20.0)) || (vc.hasAttribute('QD') && QD < 2.0)">
## ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
## ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
## ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
## ##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
## ##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
## ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
## ##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
## ##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
## ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
## ##GATKCommandLine=<ID=GenomicsDBImport,CommandLine="GenomicsDBImport --genomicsdb-workspace-path results/Poecilia_formosa/GCF_000485575.1/04_genomicsDB/DB_L47 --batch-size 25 --sample-name-map results/Poecilia_formosa/GCF_000485575.1/04_genomicsDB/DB_mapfile_L47 --genomicsdb-shared-posixfs-optimizations true --intervals results/Poecilia_formosa/GCF_000485575.1/intervalFiles/list47.list --tmp-dir tmp/ --genomicsdb-segment-size 1048576 --genomicsdb-vcf-buffer-size 16384 --overwrite-existing-genomicsdb-workspace false --consolidate false --validate-sample-name-map false --merge-input-intervals false --reader-threads 1 --max-num-intervals-to-import-in-parallel 1 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 0 --cloud-index-prefetch-buffer 0 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false",Version="4.1.8.0",Date="July 28, 2022 12:36:49 AM GMT-05:00">
## ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
## ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
## ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
## ##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
## ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
## ##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
## ##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
## ##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
## ##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
## ##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
## ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
## ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
## ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
## ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
## ##INFO=<ID=RAW_MQandDP,Number=2,Type=Integer,Description="Raw data (sum of squared MQ and total depth) for improved RMS Mapping Quality calculation. Incompatible with deprecated RAW_MQ formulation.">
## ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
## ##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
## ##contig=<ID=NW_006799939.1,length=7371162>
## ##source=GenomicsDBImport
## ##bcftools_viewVersion=1.12+htslib-1.12
## ##bcftools_viewCommand=view -S results/GCF_000485575.1/QC/final.samps.txt -t ^mtDNA -v snps -m2 -M2 -f .,PASS -e 'AF==1 | AF==0 | AF<0.01 | ALT="*" | F_MISSING > 0.75 | TYPE~"indel" | ref="N"' -O z -o results/GCF_000485575.1/QC/final_filtered.vcf.gz results/GCF_000485575.1/final.vcf.gz; Date=Fri Dec 23 20:36:54 2022
## ##bcftools_viewVersion=1.9+htslib-1.9
## ##bcftools_viewCommand=view final_filtered.vcf.gz NW_006799939.1; Date=Thu Mar 23 09:13:07 2023
So what we see here at the top of the file is the
header of the VCF. Much like headers in
BAM and SAM files, headers in
VCF files contain information about how this file was
created and the contents of the file. For example, the lines that begin
with ##FORMAT
describe the contents of a particular column
of the file, while the lines begin with ##FILTER
describe
what the filter labels in the Filter column mean. Unlike the
headers in BAM and SAM files, which
begin with the @
character, the headers in a
VCF file begin with the ##
string (like
GFF files).
Now that we know a bit about the header, let’s look at the actual content of the file, the tab-delimited rows containing SNP calls.
Run the code block below to view the beginning of our VCF file while skipping header lines:
grep -v "##" data3/poeFor_NW_006799939.vcf | head -n 2
# grep: The Unix string search command
# "##": The string to search for in the provided file - we just don't want to display these for this demonstration
# head: A Unix command to display the first few lines of a file
# | : The Unix pipe operator to pass output from one command as input to another command
# -n 2: This tells head to display exactly the first 2 lines of the provided file
## NOTE: ignore the "Broken pipe" and "error closing standard output" errors! This is just an artifact of our setup for the workshop today.
## #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMN03568550 SAMN03568551 SAMN03568552 SAMN03568553 SAMN03568554 SAMN03568555 SAMN03568556 SAMN03568557 SAMN03568558 SAMN03568559 SAMN03568562 SAMN03568563 SAMN03568565 SAMN03568566 SAMN03568567 SAMN03568568
## NW_006799939.1 1985 . T C 31.46 . AC=2;AF=0.25;AN=8;DP=7;ExcessHet=0.3218;FS=0;InbreedingCoeff=0.3548;MLEAC=2;MLEAF=0.25;MQ=40;QD=31.46;SOR=1.609 GT:AD:DP:GQ:PGT:PID:PL:PS ./.:0,0:0:.:.:.:0,0,0:. 0/0:3,0:3:9:.:.:0,9,124:. 0/0:1,0:1:3:.:.:0,3,32:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:1,0:1:.:.:.:0,0,0:. 1|1:0,1:1:3:1|1:1977_A_G:45,3,0:1977 0/0:1,0:1:3:.:.:0,3,45:.
The first line here, beginning with a single #
shows us
the names of the columns, which may be referred to in the file header
like we saw above. Each subsequent line represents one
variant relative to the reference genome, with the
following column definitions:
.
character.PASS
(or sometimes
.
).;
) delimited list of information.:
) delimited list of the keys for the
information in the subsequent columns. These keys are defined in the
file header. 10+. SAMPLE GENOTYPES: Each subsequent column contains a
colon (:
) delimited list that follows the keys in the
FORMAT column for the genotypes of the samples. Each column corresponds
to one sample with the name being the column header.The genotypes are encoded in the sample columns with the
GT
key. In this file, it is the first entry. For example,
with the keys defined in column 9, the genotype of the first sample:
./.:0,0:0:.:.:.:0,0,0:.
means that the GT (GENOTYPE from the header) is ./.
, the
AD (ALLELIC DEPTH) is 0,0, the DP (DEPTH) is 0, and so on.
Genotypes use the following syntax, with 0
being the
reference allele and 1
being the alternate allele:
0/0
means this sample is homozygous for the reference
allele.0/1
or 1/0
means this sample is
heterozygous for the reference allele (0) and the alternate allele
(1).1/1
means this sample is homozygous for the alternate
allele..
is used for missing; this site was not genotyped for
this individual, which happens to be the case for the sample outlined
above.For VCF files with multiple samples (like this one),
sites can have more than two alleles, in which case the genotypes
0/2
or 0/3
or 2/2
may be
seen.
There are many quirks to VCF files, but they are also highly standardized, so if you find yourself working with them be sure to read the very detailed specifications here:
.bcf
file
extension. BCF files are more efficient to read than VCF files, and are
typically compressed.There are a myriad of ways to use and analyze a VCF,
but for general processing of them, we will use bcftools
.
bcftools
falls under the same development umbrella as samtools
and
thus shares much of the same usage and underlying philosophy to act like
a native Unix command. It was preceded by vcftools, which still retains
some useful functionality, but is no longer developed.
Much like bedtools
, bcftools
has a wide
range of functions so we won’t be able to get to everything today. Be
sure to read the
documentation to find out if it can accomplish the task you need
with your VCF or BCF file.
One of the most basic things you can do to a file is to view it. Indeed, looking at your data is an important part of each step of a bioinformatics workflow.
Like samtools
, bcftools
has a dedicated
view
command.
Exercise: 1. Run the code block below to
view
the top of the Amazon molly VCF file. 2. Edit thebcftools view
command to exclude the header from the output.
## Use the view command to display the VCF file, then edit it to exclude the header
## data3/poeFor_NW_006799939.vcf
bcftools view -H data3/poeFor_NW_006799939.vcf | head -n 10
## Use the view command to display the VCF file, then edit it to exclude the header
## NOTE: ignore the "cannot write" error! This is just an artifact of our setup for the workshop today.
## NW_006799939.1 1985 . T C 31.46 . AC=2;AF=0.25;AN=8;DP=7;ExcessHet=0.3218;FS=0;InbreedingCoeff=0.3548;MLEAC=2;MLEAF=0.25;MQ=40;QD=31.46;SOR=1.609 GT:AD:DP:GQ:PGT:PID:PL:PS ./.:0,0:0:.:.:.:0,0,0:. 0/0:3,0:3:9:.:.:0,9,124:. 0/0:1,0:1:3:.:.:0,3,32:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:1,0:1:.:.:.:0,0,0:. 1|1:0,1:1:3:1|1:1977_A_G:45,3,0:1977 0/0:1,0:1:3:.:.:0,3,45:.
## NW_006799939.1 18613 . G A 32.84 . AC=3;AF=0.107;AN=28;BaseQRankSum=0.967;DP=54;ExcessHet=3.5218;FS=5.869;InbreedingCoeff=-0.0251;MLEAC=3;MLEAF=0.107;MQ=46.57;MQRankSum=-1.15;QD=2.19;ReadPosRankSum=1.38;SOR=2.91 GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,1:5:17:.:.:17,0,159:. 0/0:3,0:3:9:.:.:0,9,119:. 0/0:5,0:5:15:.:.:0,15,181:. 0/0:8,1:9:1:.:.:0,1,324:. 0/0:2,0:2:6:.:.:0,6,80:. 0/1:3,1:4:20:.:.:20,0,103:. 0|1:5,1:6:25:0|1:18613_G_A:25,0,158:18613 0/0:5,0:5:15:.:.:0,15,193:. 0/0:3,0:3:9:.:.:0,9,118:. 0/0:3,0:3:9:.:.:0,9,116:. 0/0:2,0:2:6:.:.:0,6,85:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. 0/0:2,0:2:6:.:.:0,6,88:. 0/0:3,0:3:9:.:.:0,9,123:. 0/0:1,0:1:3:.:.:0,3,39:.
## NW_006799939.1 18638 . G C 2025.74 . AC=30;AF=0.938;AN=32;BaseQRankSum=-0.14;DP=62;ExcessHet=3.1527;FS=6.816;InbreedingCoeff=-0.1453;MLEAC=25;MLEAF=0.781;MQ=54.98;MQRankSum=1.59;QD=29.2;ReadPosRankSum=0.589;SOR=0.617 GT:AD:DP:GQ:PGT:PID:PL:PS 1/1:0,4:4:12:.:.:140,12,0:. 1/1:0,4:4:12:.:.:150,12,0:. 1/1:0,4:4:12:.:.:159,12,0:. 0/1:1,8:9:18:.:.:286,0,18:. 1/1:0,2:2:6:.:.:80,6,0:. 1/1:0,5:5:15:.:.:202,15,0:. 0|1:1,5:6:27:1|0:18613_G_A:141,0,27:18613 1/1:0,5:5:15:.:.:198,15,0:. 1/1:0,4:4:12:.:.:170,12,0:. 1/1:0,3:3:9:.:.:124,9,0:. 1/1:0,3:3:9:.:.:108,9,0:. 1/1:0,1:1:3:.:.:34,3,0:. 1/1:0,1:1:3:.:.:35,3,0:. 1/1:0,1:1:3:.:.:32,3,0:. 1/1:0,5:5:15:.:.:198,15,0:. 1/1:0,1:1:3:.:.:37,3,0:.
## NW_006799939.1 18706 . C T 48.85 . AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DP=67;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.0961;MLEAC=2;MLEAF=0.067;MQ=52.92;MQRankSum=-0.967;QD=12.21;ReadPosRankSum=-0.967;SOR=2.833 GT:AD:DP:GQ:PL 1/1:0,1:2:3:42,3,0 0/0:2,0:2:6:0,6,78 0/0:6,0:6:0:0,0,164 0/0:6,0:6:12:0,12,180 0/0:3,0:3:9:0,9,121 0/0:9,0:9:27:0,27,343 0/0:7,0:7:0:0,0,169 ./.:0,0:0:.:0,0,0 0/0:4,0:4:9:0,9,135 0/0:4,0:4:12:0,12,180 0/0:3,0:3:9:0,9,124 0/0:4,0:4:0:0,0,93 0/0:4,0:4:12:0,12,165 0/0:5,0:5:0:0,0,138 0/0:5,0:5:0:0,0,43 0/1:2,1:3:31:31,0,74
## NW_006799939.1 18860 . G C 306.82 . AC=6;AF=0.2;AN=30;BaseQRankSum=0.524;DP=50;ExcessHet=1.2974;FS=31.91;InbreedingCoeff=-0.0458;MLEAC=6;MLEAF=0.2;MQ=47.79;MQRankSum=-1.282;QD=25.57;ReadPosRankSum=-0.524;SOR=2.775 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:3,3:8:99:0|1:18860_G_C:117,0,112:18860 0/0:5,0:5:15:.:.:0,15,204:. 0/0:3,0:3:9:.:.:0,9,97:. 0|1:3,3:7:99:0|1:18860_G_C:117,0,109:18860 0/0:2,0:2:6:.:.:0,6,79:. 0/0:3,0:3:9:.:.:0,9,110:. 0/0:2,0:2:6:.:.:0,6,49:. 0/0:5,0:5:15:.:.:0,15,193:. 0/0:2,0:2:6:.:.:0,6,74:. 0|1:1,1:3:39:0|1:18860_G_C:39,0,39:18860 0/0:1,0:1:3:.:.:0,3,42:. 0/0:1,0:1:3:.:.:0,3,39:. 1|1:0,1:1:3:1|1:18860_G_C:45,3,0:18860 ./.:0,0:0:.:.:.:0,0,0:. 0|1:4,1:5:30:0|1:18860_G_C:30,0,165:18860 0/0:1,0:1:3:.:.:0,3,44:.
## NW_006799939.1 19009 . G A 51.6 . AC=1;AF=0.031;AN=32;BaseQRankSum=-0.992;DP=76;ExcessHet=3.0103;FS=3.01;InbreedingCoeff=-0.0974;MLEAC=1;MLEAF=0.031;MQ=56.02;MQRankSum=-1.732;QD=6.45;ReadPosRankSum=-1.668;SOR=1.911 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:7,0:7:21:.:.:0,21,279:. 0/0:5,0:5:0:.:.:0,0,112:. 0|1:6,2:9:66:0|1:19009_G_A:66,0,246:19009 0/0:1,0:1:3:.:.:0,3,37:. 0/0:5,0:5:12:.:.:0,12,180:. 0/0:6,0:6:18:.:.:0,18,238:. 0/0:8,0:8:24:.:.:0,24,315:. 0/0:5,0:5:15:.:.:0,15,204:. 0/0:1,0:1:3:.:.:0,3,39:. 0/0:7,0:7:21:.:.:0,21,287:. 0/0:3,0:3:9:.:.:0,9,111:. 0/0:4,0:4:12:.:.:0,12,162:. 0/0:4,0:4:12:.:.:0,12,154:. 0/0:4,0:4:12:.:.:0,12,147:. 0/0:5,0:5:12:.:.:0,12,180:. 0/0:2,0:2:6:.:.:0,6,90:.
## NW_006799939.1 19028 . G A 83.85 . AC=1;AF=0.033;AN=30;BaseQRankSum=-1.602;DP=79;ExcessHet=3.0103;FS=3.09;InbreedingCoeff=-0.0545;MLEAC=1;MLEAF=0.033;MQ=52.96;MQRankSum=-2.744;QD=7.62;ReadPosRankSum=-0.191;SOR=2.093 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:10,0:10:27:.:.:0,27,405:. 0/0:6,0:6:0:.:.:0,0,87:. 0|1:8,3:11:97:0|1:19009_G_A:97,0,320:19009 ./.:1,0:1:.:.:.:0,0,0:. 0/0:3,0:3:9:.:.:0,9,116:. 0/0:7,0:7:0:.:.:0,0,216:. 0/0:10,0:10:6:.:.:0,6,335:. 0/0:6,0:6:0:.:.:0,0,166:. 0/0:1,0:1:3:.:.:0,3,39:. 0/0:6,0:6:0:.:.:0,0,177:. 0/0:1,0:1:3:.:.:0,3,14:. 0/0:3,0:3:9:.:.:0,9,123:. 0/0:4,0:4:9:.:.:0,9,135:. 0/0:4,0:4:12:.:.:0,12,147:. 0/0:4,0:4:12:.:.:0,12,169:. 0/0:2,0:2:6:.:.:0,6,89:.
## NW_006799939.1 19149 . G A 2633.86 . AC=26;AF=0.813;AN=32;BaseQRankSum=0.431;DP=77;ExcessHet=5.4855;FS=6.445;InbreedingCoeff=-0.2699;MLEAC=24;MLEAF=0.75;MQ=45.74;MQRankSum=1.15;QD=28.55;ReadPosRankSum=0.842;SOR=0.123 GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:1,4:5:30:.:.:143,0,30:. 1|1:0,4:4:12:1|1:19149_G_A:177,12,0:19149 0/1:1,2:3:36:.:.:75,0,36:. 1/1:0,5:5:15:.:.:202,15,0:. 1/1:0,1:1:3:.:.:43,3,0:. 0/1:2,8:10:56:.:.:305,0,56:. 0/1:1,3:4:33:.:.:108,0,33:. 1/1:0,3:3:9:.:.:135,9,0:. 1/1:0,3:3:9:.:.:135,9,0:. 1/1:0,4:4:12:.:.:160,12,0:. 1/1:0,6:6:18:.:.:246,18,0:. 1/1:0,6:6:18:.:.:261,18,0:. 1/1:0,4:4:12:.:.:171,12,0:. 1/1:0,7:7:21:.:.:311,21,0:. 0/1:1,2:3:34:.:.:67,0,34:. 0/1:1,4:5:29:.:.:164,0,29:.
## NW_006799939.1 19171 . C T 312.01 . AC=11;AF=0.344;AN=32;BaseQRankSum=0.566;DP=86;ExcessHet=14.6009;FS=13.547;InbreedingCoeff=-0.5693;MLEAC=11;MLEAF=0.344;MQ=46;MQRankSum=-1.383;QD=31.2;ReadPosRankSum=-0.366;SOR=2.985 GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,1:5:25:.:.:25,0,158:. 0|1:4,1:5:19:1|0:19149_G_A:19,0,165:19149 0/1:4,1:5:30:.:.:30,0,146:. 0/0:4,0:4:12:.:.:0,12,169:. 0/1:3,2:5:73:.:.:73,0,109:. 0/1:7,1:8:19:.:.:19,0,291:. 0/1:3,2:5:68:.:.:68,0,120:. 0/0:6,0:6:18:.:.:0,18,241:. 0/0:4,0:4:12:.:.:0,12,175:. 0/1:5,1:6:27:.:.:27,0,192:. 0/1:5,1:6:27:.:.:27,0,207:. 0/0:4,0:4:12:.:.:0,12,179:. 0/1:6,1:7:24:.:.:24,0,249:. 0/1:6,1:7:13:.:.:13,0,240:. 0/1:1,1:2:38:.:.:39,0,38:. 0/0:6,0:6:0:.:.:0,0,169:.
## NW_006799939.1 19467 . T A 2931.49 . AC=17;AF=0.531;AN=32;BaseQRankSum=-0.306;DP=157;ExcessHet=33.3406;FS=2.909;InbreedingCoeff=-0.8939;MLEAC=17;MLEAF=0.531;MQ=58.71;MQRankSum=-0.431;QD=18.91;ReadPosRankSum=0.244;SOR=0.952 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:9,6:15:99:0|1:19467_T_A:213,0,349:19467 0|1:6,5:11:99:0|1:19467_T_A:188,0,234:19467 0|1:5,6:11:99:0|1:19467_T_A:237,0,183:19467 0|1:9,7:16:99:0|1:19467_T_A:267,0,357:19467 0|1:4,6:10:99:0|1:19467_T_A:240,0,126:19467 0|1:7,4:11:99:0|1:19467_T_A:142,0,270:19467 0|1:10,7:17:99:0|1:19467_T_A:255,0,394:19467 0|1:2,4:6:72:0|1:19467_T_A:156,0,72:19467 0|1:8,5:13:99:0|1:19467_T_A:186,0,321:19467 0|1:2,4:6:67:0|1:19467_T_A:151,0,67:19467 0|1:1,2:3:32:0|1:19467_T_A:81,0,32:19467 1|1:0,3:3:9:1|1:19467_T_A:127,9,0:19467 0|1:2,7:9:63:0|1:19467_T_A:277,0,63:19467 0|1:1,7:8:21:0|1:19467_T_A:291,0,21:19467 0|1:5,3:8:99:0|1:19467_T_A:108,0,194:19467 0|1:6,2:8:65:0|1:19467_T_A:65,0,233:19467
So view
acts much the same way as the Unix command
cat
, but has specific functionality for the
VCF format, and also handles compressed
BCF files natively.
One useful function when view
ing files is to be able to
subset or filter them. We’ve provided a file that lists the sample IDs
for a made-up population and can use bcftools view
to
extract genotypes from specific samples.
Run the code block below to extract SNPs only from a subset of samples:
cat data3/pop1.txt
echo "-------"
# cat: A Unix command that displays the contents of a file in the Terminal
# echo: A Unix command that simply prints the provided input to the screen
# These two commands are just to see the pop1.txt file
bcftools view -S data3/pop1.txt -H data3/poeFor_NW_006799939.vcf | head
# bcftools: A suite of programs to process VCF files
# view: The sub-program of bcftools to execute
# -S : A bcftools view option that specifies to only extract genotype columns from the samples listed in the provided file
# -H : A bcftools view option that specifies to exclude the VCF header from the output
## NOTE: ignore the "cannot write" error! This is just an artifact of our setup for the workshop today.
## SAMN03568550
## SAMN03568551
## SAMN03568552
## SAMN03568553
## SAMN03568554
## SAMN03568555
## SAMN03568556
## SAMN03568557
## -------
## NW_006799939.1 1985 . T C 31.46 . AC=0;AF=0.25;AN=4;DP=7;ExcessHet=0.3218;FS=0;InbreedingCoeff=0.3548;MLEAC=2;MLEAF=0.25;MQ=40;QD=31.46;SOR=1.609 GT:AD:DP:GQ:PGT:PID:PL:PS ./.:0,0:0:.:.:.:0,0,0:. 0/0:3,0:3:9:.:.:0,9,124:. 0/0:1,0:1:3:.:.:0,3,32:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:.
## NW_006799939.1 18613 . G A 32.84 . AC=3;AF=0.107;AN=16;BaseQRankSum=0.967;DP=54;ExcessHet=3.5218;FS=5.869;InbreedingCoeff=-0.0251;MLEAC=3;MLEAF=0.107;MQ=46.57;MQRankSum=-1.15;QD=2.19;ReadPosRankSum=1.38;SOR=2.91 GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,1:5:17:.:.:17,0,159:. 0/0:3,0:3:9:.:.:0,9,119:. 0/0:5,0:5:15:.:.:0,15,181:. 0/0:8,1:9:1:.:.:0,1,324:. 0/0:2,0:2:6:.:.:0,6,80:. 0/1:3,1:4:20:.:.:20,0,103:. 0|1:5,1:6:25:0|1:18613_G_A:25,0,158:18613 0/0:5,0:5:15:.:.:0,15,193:.
## NW_006799939.1 18638 . G C 2025.74 . AC=14;AF=0.938;AN=16;BaseQRankSum=-0.14;DP=62;ExcessHet=3.1527;FS=6.816;InbreedingCoeff=-0.1453;MLEAC=25;MLEAF=0.781;MQ=54.98;MQRankSum=1.59;QD=29.2;ReadPosRankSum=0.589;SOR=0.617 GT:AD:DP:GQ:PGT:PID:PL:PS 1/1:0,4:4:12:.:.:140,12,0:. 1/1:0,4:4:12:.:.:150,12,0:. 1/1:0,4:4:12:.:.:159,12,0:. 0/1:1,8:9:18:.:.:286,0,18:. 1/1:0,2:2:6:.:.:80,6,0:. 1/1:0,5:5:15:.:.:202,15,0:. 0|1:1,5:6:27:1|0:18613_G_A:141,0,27:18613 1/1:0,5:5:15:.:.:198,15,0:.
## NW_006799939.1 18706 . C T 48.85 . AC=2;AF=0.1;AN=14;BaseQRankSum=0.967;DP=67;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.0961;MLEAC=2;MLEAF=0.067;MQ=52.92;MQRankSum=-0.967;QD=12.21;ReadPosRankSum=-0.967;SOR=2.833 GT:AD:DP:GQ:PL 1/1:0,1:2:3:42,3,0 0/0:2,0:2:6:0,6,78 0/0:6,0:6:0:0,0,164 0/0:6,0:6:12:0,12,180 0/0:3,0:3:9:0,9,121 0/0:9,0:9:27:0,27,343 0/0:7,0:7:0:0,0,169 ./.:0,0:0:.:0,0,0
## NW_006799939.1 18860 . G C 306.82 . AC=2;AF=0.2;AN=16;BaseQRankSum=0.524;DP=50;ExcessHet=1.2974;FS=31.91;InbreedingCoeff=-0.0458;MLEAC=6;MLEAF=0.2;MQ=47.79;MQRankSum=-1.282;QD=25.57;ReadPosRankSum=-0.524;SOR=2.775 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:3,3:8:99:0|1:18860_G_C:117,0,112:18860 0/0:5,0:5:15:.:.:0,15,204:. 0/0:3,0:3:9:.:.:0,9,97:. 0|1:3,3:7:99:0|1:18860_G_C:117,0,109:18860 0/0:2,0:2:6:.:.:0,6,79:. 0/0:3,0:3:9:.:.:0,9,110:. 0/0:2,0:2:6:.:.:0,6,49:. 0/0:5,0:5:15:.:.:0,15,193:.
## NW_006799939.1 19009 . G A 51.6 . AC=1;AF=0.031;AN=16;BaseQRankSum=-0.992;DP=76;ExcessHet=3.0103;FS=3.01;InbreedingCoeff=-0.0974;MLEAC=1;MLEAF=0.031;MQ=56.02;MQRankSum=-1.732;QD=6.45;ReadPosRankSum=-1.668;SOR=1.911 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:7,0:7:21:.:.:0,21,279:. 0/0:5,0:5:0:.:.:0,0,112:. 0|1:6,2:9:66:0|1:19009_G_A:66,0,246:19009 0/0:1,0:1:3:.:.:0,3,37:. 0/0:5,0:5:12:.:.:0,12,180:. 0/0:6,0:6:18:.:.:0,18,238:. 0/0:8,0:8:24:.:.:0,24,315:. 0/0:5,0:5:15:.:.:0,15,204:.
## NW_006799939.1 19028 . G A 83.85 . AC=1;AF=0.033;AN=14;BaseQRankSum=-1.602;DP=79;ExcessHet=3.0103;FS=3.09;InbreedingCoeff=-0.0545;MLEAC=1;MLEAF=0.033;MQ=52.96;MQRankSum=-2.744;QD=7.62;ReadPosRankSum=-0.191;SOR=2.093 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:10,0:10:27:.:.:0,27,405:. 0/0:6,0:6:0:.:.:0,0,87:. 0|1:8,3:11:97:0|1:19009_G_A:97,0,320:19009 ./.:1,0:1:.:.:.:0,0,0:. 0/0:3,0:3:9:.:.:0,9,116:. 0/0:7,0:7:0:.:.:0,0,216:. 0/0:10,0:10:6:.:.:0,6,335:. 0/0:6,0:6:0:.:.:0,0,166:.
## NW_006799939.1 19149 . G A 2633.86 . AC=12;AF=0.813;AN=16;BaseQRankSum=0.431;DP=77;ExcessHet=5.4855;FS=6.445;InbreedingCoeff=-0.2699;MLEAC=24;MLEAF=0.75;MQ=45.74;MQRankSum=1.15;QD=28.55;ReadPosRankSum=0.842;SOR=0.123 GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:1,4:5:30:.:.:143,0,30:. 1|1:0,4:4:12:1|1:19149_G_A:177,12,0:19149 0/1:1,2:3:36:.:.:75,0,36:. 1/1:0,5:5:15:.:.:202,15,0:. 1/1:0,1:1:3:.:.:43,3,0:. 0/1:2,8:10:56:.:.:305,0,56:. 0/1:1,3:4:33:.:.:108,0,33:. 1/1:0,3:3:9:.:.:135,9,0:.
## NW_006799939.1 19171 . C T 312.01 . AC=6;AF=0.344;AN=16;BaseQRankSum=0.566;DP=86;ExcessHet=14.6009;FS=13.547;InbreedingCoeff=-0.5693;MLEAC=11;MLEAF=0.344;MQ=46;MQRankSum=-1.383;QD=31.2;ReadPosRankSum=-0.366;SOR=2.985 GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:4,1:5:25:.:.:25,0,158:. 0|1:4,1:5:19:1|0:19149_G_A:19,0,165:19149 0/1:4,1:5:30:.:.:30,0,146:. 0/0:4,0:4:12:.:.:0,12,169:. 0/1:3,2:5:73:.:.:73,0,109:. 0/1:7,1:8:19:.:.:19,0,291:. 0/1:3,2:5:68:.:.:68,0,120:. 0/0:6,0:6:18:.:.:0,18,241:.
## NW_006799939.1 19467 . T A 2931.49 . AC=8;AF=0.531;AN=16;BaseQRankSum=-0.306;DP=157;ExcessHet=33.3406;FS=2.909;InbreedingCoeff=-0.8939;MLEAC=17;MLEAF=0.531;MQ=58.71;MQRankSum=-0.431;QD=18.91;ReadPosRankSum=0.244;SOR=0.952 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:9,6:15:99:0|1:19467_T_A:213,0,349:19467 0|1:6,5:11:99:0|1:19467_T_A:188,0,234:19467 0|1:5,6:11:99:0|1:19467_T_A:237,0,183:19467 0|1:9,7:16:99:0|1:19467_T_A:267,0,357:19467 0|1:4,6:10:99:0|1:19467_T_A:240,0,126:19467 0|1:7,4:11:99:0|1:19467_T_A:142,0,270:19467 0|1:10,7:17:99:0|1:19467_T_A:255,0,394:19467 0|1:2,4:6:72:0|1:19467_T_A:156,0,72:19467
This can be very handy, but be warned - the INFO fields are not automatically recalculated, and will no longer be correct once you remove some individuals. In general, the INFO column and its information displays information about the site while the FORMAT fields display information about each sample.
bcftools
and
bedtools
Many of the tools we’ve demonstrated, samtools
,
bedtools
, bcftools
, are general purpose tools
for basic processing and manipulation of specific file formats used in
genomics. Many of the tasks they do could be done with native
bash
commands, but that would require a lot of piping and
probably some custom scripting on uncompressed files. However, if you
really start to get to know these tools and their versatility, you can
do some more advanced analyses purely with them. Let’s demonstrate by
calculating SNP density across 100k windows the specific 7 Mb scaffold
we are working with from the Amazon molly genome.
To do this, we’ll need:
Since we’ve already downloaded the genome and generated the index
file, let’s start with the second one, getting the window coordinates.
We can do this easily with the bedtools getwindows
command:
Run the code block below to generate a bed file with 100kb windows in the Amazon molly genome:
bedtools makewindows -g data3/poeFor_NW_006799939.fasta.fai -w 100000 > poeFor-windows-100k.bed
# bedtools: A suite of programs to process bed files
# makewindows: The sub-program of bedtools to execute
# -g : A genome size file, which needs two columns, the scaffold/chromosome name and the size of that feature
# -w : The size of the windows to partition each scaffold in -g into
# -b : The second interval file to check overlaps
# > : The Unix redirect operator to write the output of the command to the following file
head poeFor-windows-100k.bed
# Display the first few lines of the bed file containing windows
## NW_006799939.1 0 100000
## NW_006799939.1 100000 200000
## NW_006799939.1 200000 300000
## NW_006799939.1 300000 400000
## NW_006799939.1 400000 500000
## NW_006799939.1 500000 600000
## NW_006799939.1 600000 700000
## NW_006799939.1 700000 800000
## NW_006799939.1 800000 900000
## NW_006799939.1 900000 1000000
Now, we can use bedtools intersect
, which also works
with VCF files, to count how many SNPs appear in each
window.
Exercise: Use
bedtools intersect
to count the number of SNPs in each 100kb window in the Amazon molly genome. You will need to look in the help menu ofbedtools intersect
to find the right option to output a column with the number of overlaps per region! Save this file aspoeFor-windows-100k-snps.bed
.
## Write a bedtools intersect command to count how many SNPs overlap each 100kb window in the molly genome
## data3/poeFor_NW_006799939.vcf
## poeFor-windows-100k.bed
bedtools intersect -c -a poeFor-windows-100k.bed -b data3/poeFor_NW_006799939.vcf > poeFor-windows-100k-snps.bed
## Write a bedtools intersect command to count how many SNPs overlap each 100kb window in the molly genome
head poeFor-windows-100k-snps.bed
# Display the first few lines of the bed file containing windows and overlapping SNPs
## NW_006799939.1 0 100000 1489
## NW_006799939.1 100000 200000 1684
## NW_006799939.1 200000 300000 1357
## NW_006799939.1 300000 400000 1591
## NW_006799939.1 400000 500000 1770
## NW_006799939.1 500000 600000 1178
## NW_006799939.1 600000 700000 1562
## NW_006799939.1 700000 800000 1708
## NW_006799939.1 800000 900000 1672
## NW_006799939.1 900000 1000000 1148
BONUS Exercise: If you attended the R workshop last week or know R already, use the code block below to read in the file we just generated with SNPs per 100k window and plot the density along this scaffold.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
# Load the tidyverse
## Write code in R to plot SNP density across the scaffold in our bed file
# poeFor-windows-100k-snps.bed
am_snps = read_tsv("poeFor-windows-100k-snps.bed", col_names=c("scaffold", "start", "end", "num.snps"))
## Rows: 74 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): scaffold
## dbl (3): start, end, num.snps
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Read the bed file containing 1Mb windows and number of SNPs per window, and name the columns
snp_plot = ggplot(am_snps, aes(x=start, y=num.snps)) +
# Initialize a ggplot grob and load the data and aesthetics into it and store it as the snp_plot object
geom_line() +
# Plot the provided data as a line graph
#facet_wrap(~scaffold, scales="free") +
# Split the plot by scaffold
theme_classic()
# Set a different theme
## Write code in R to plot SNP density across the scaffold in our bed file
print(snp_plot)
# Display the plot
So far, we’ve been focused on interactive analysis of small files. However, in real bioinformatics work, you will likely find yourself with many samples that you want to run the same commands on. And often these commands can require significant compute time, such that you want to be able to run these commands non-interactively. As nice as these markdown notebooks are for teaching and data exploration, they are not the ideal format for this task. Instead, what we want are scripts.
A bash script (like an R script, if you attended the R workshop earlier), is just a text file that executes a series of commands in order. Bash scripts can get exceedingly complicated, but today we are just going to introduce a few basic concepts.
We are ultimately going to make a script that takes as input a VCF file and one or more bed files, and reports as output the average SNP density in regions of the genome inside bed intervals. Let’s start simple.
Using the file menu, go to New file –> Shell script and create a new file. Call it
snp-density.sh
.This is going to be our script in a new tab of our text editor, that we will modify as we go.
The first thing we need to do is a little bit of shell magic to get
the shell to treat the script we are writing as a command. This is
called a shebang (#!
) and basically says “run the script
with the following interpreter”. Type the following
line at the top of your script:
#!/bin/bash
This means that when we run snp-density.sh
on the
command line, the commands in the script will be interpreted as shell
commands.
We need to do one more thing, which is make our script
executable. We’ll use the chmod
command
for this.
In the Terminal below, make sure you are in the same directory as your script file, and then type the following command.
chmod +x snp-density.sh
We now have a script we can run.
In the code block below, try running your (empty) script. Note that the
./
specifies the relative path to the script, and means “execute the script snp-density.sh in the current working directory”. If omitted, you’ll get a “command not found” error.
chmod +x snp-density-0.sh
# Make the script executable
./snp-density-0.sh
# ./ : Execute the following file as a shell script
echo "done"
# echo: A Unix command that simply prints the provided input to the screen
## done
However, nothing will happen if we do, since there are currently no commands in the file to run.
Exercise: In order to have our shell script do anything, we need to put some commands in it. 1. Copy the
bedtools intersect
command from above into the script. 2. Instead of writing the output of the command to the file, pipe the output of the command toawk
. 3. Useawk
, to calculate the average number of SNPs per base across intervals. To do this, you’ll need to increment two counters for each row of the output: one with the number of snps in that row, the other with the length of the element in that row, and then divide the number of snps by the total length. Warning:awk
has a built-in function calledlength
, so you’ll need to call your length variable something else.
Now, try running your script again.
In the code block below, try running your script.
chmod +x snp-density-1.sh
# Make the script executable
./snp-density-1.sh
# ./ : Execute the following file as a shell script
## 0.0143416
We should get a number now.
However, we have a problem. Ideally, we’d like to be able to do this calculation on any VCF file and any bed file. Right now those files are hard-coded, which means to change what file the script operates on requires changing the script.
To make our script flexible, we need to introduce the concept of
variables. We’ve seen this in awk
commands; we can also
create variables in the shell. We do this just by assigning something
(text or the output of a command, usually) to a string.
Run the code block below to create a variable called TEXT, and print it out with the
echo
command.
TEXT="Hello World!"
# = : Assign the variable name on the left to the value on the right
echo $TEXT
# echo: A Unix command that simply prints the provided input to the screen
# $ : A symbol that the shell understands to mean that the following is the name of a variable
## Hello World!
Note that when we access what is stored in the variable we use the
$
notation, but we don’t use this when we assign to a
variable. Note also we cannot have spaces before or after the equal sign
or the shell will thing we are trying to run the TEXT
command, not create a variable called TEXT
.
Exercise: 1. Modify our shell script,
snp-density.sh
, so that it includes two variables, one calledVCF
and one calledBED
. For now, define these in the second and third lines of the script with the paths to the BED file and the VCF file we currently have. 2. Replace the hard-coded file names in thebedtools intersect
command options-a
and-b
with theBED
andVCF
variables. 3. In the code block below run your script again and make sure it still works.
chmod +x snp-density-2.sh
# Make the script executable
./snp-density-2.sh
# ./ : Execute the following file as a shell script
## 0.0143416
You should get the same answer as before.
Note that right now, we aren’t actually changing anything, we’ve just moved the files we are going to operate on to the top of the script, but we still would need to change the script each time we run it on different files. To make our script more general (able to take various inputs more easily), we should have a way to change the input file without having to edit the script. The way to do this is with command line options. You’ve seen this a lot in the tools we’ve run up to this point. For instance, when we run almost any shell command, it requires a filename, e.g.:
head poeFor-windows-100k.bed
In this case, the file name poeFor-windows-100k.bed
, is
a command line option, also known as a command
line argument or just option,
argument, or parameter. Commands and
scripts can take multiple arguments. Some are
positional, meaning that the way they are read into the
program depends on their placement in the command. Others have specific
flags, meaning the way they are read into the program
depends on the flag they come after in the command. For instance:
head -n 20 poeFor-windows-100k.bed
This is a command that hopefully looks pretty understandable to you
by now. This command, head
has two command line
options as we’ve written it:
-n 20
, which changes the
number of lines the command displays. Internally, this sets the value of
some variable to a different number.poeFor-windows-100k.bed
, which is the last listed argument.
Again this internally sets the value of some variable to the path to
that file (actually, for most bash commands, the first un-flagged option
is usually reserved for the file name).Well, we can modify our script to take input from the command line as
well. Shell scripts have access to special variables that can be used to
create simple, positional, command line options.
Specifically, the variables $1
, $2
,
$3
, etc will contain the first, second, third, etc.
arguments specified after the script is called.
Exercise: 1. Modify your script to replace the path to the bed file with
$1
, and the path to the VCF file with$2
using positional command line arguments. 2. Run the following code block to call your script again, this time taking arguments, with the bed file first and the vcf second.
chmod +x snp-density-3.sh
# Make the script executable
./snp-density-3.sh poeFor-windows-100k.bed data3/poeFor_NW_006799939.vcf
# ./ : Execute the following file as a shell script
## 0.0143416
If everything is working, you should get the same answer as before. But now we can give our script any bed file and any VCF file, right from the command line. Note that we don’t do any error checking - if you accidentally specify a VCF file that uses a different reference genome than your bed file, nothing in the script will prevent this.
This is better, but it still requires a lot of typing if we want to run this on 10 files. If, for example, we wanted to compute SNP density separately for each chromosome, or for different interval types (e.g., genes, introns, exons), we’d have to type out each bed file separately.
We can get around this by using loops.
The most common kind of loop we’ll use is a for
loop,
which iterates over elements.
Run the following code block to see a for loop in action
# a simple loop
for i in 1 2 3;
do
echo $i
done
## 1
## 2
## 3
Importantly, a loop can iterate over the output of another command.
Run the following code block to iterate over the results of the
ls
command in a for loop
# use the $() syntax to capture the output of a command into a loop; by default each newline will be a separate iteration
for FILE in $(ls)
do
echo $FILE
done
## bed_files
## Biotips-workshop-2023-Day1-instructor.html
## Biotips-workshop-2023-Day1-instructor.Rmd
## Biotips-workshop-2023-Day1-student.Rmd
## Biotips-workshop-2023-Day2-instructor.html
## Biotips-workshop-2023-Day2-instructor.Rmd
## Biotips-workshop-2023-Day2-student.Rmd
## Biotips-workshop-2023-Day3-instructor_files
## Biotips-workshop-2023-Day3-instructor.html
## Biotips-workshop-2023-Day3-instructor.Rmd
## Biotips-workshop-2023-Day3-student.Rmd
## Biotips-workshop-2023-Day4-instructor.Rmd
## Biotips-workshop-2023-Day4-slurm-instructor.Rmd
## coverage-submit.sh
## css
## data3
## end.html
## exon.bed
## img
## index.html
## links.html
## poeFor_beds
## poeFor-windows-100k.bed
## poeFor-windows-100k-snps.bed
## programs.html
## scripts
## snp-density-0.sh
## snp-density-1.sh
## snp-density-2.sh
## snp-density-3.sh
## snp-density-4.sh
## snp-density-final.sh
## snp-density.sh
## snp-density-simple.sh
## start.html
## tables
## terms.html
All this does right now is print out the files in our current working
directory, just like running the ls
command would.
Loops are very powerful. For example, let’s say we want to make a few different subsets of the annotation we have for the Amazon molly: coding exons, introns, intergenic regions, UTRs.
Let’s start by writing an awk
command that will read
through the GFF file associated with the Amazon molly
reference genome, find all the exons in it, and make a
bed file of just exons intervals. Since all the
information for a bed file is also contained in a
GFF file (scaffold, start, and end positions) we can
convert a GFF to a bed with
awk
:
BEGIN{OFS="\t"}; {print $1, $4-1, $5}
The BEGIN{OFS="\t"}
tells awk
to use a tab
separator in the output (required for bedtools
), and then
we print the first, fourth, and fifth columns, remembering that
bed and GFF files have different
coordinate systems.
Run the code block below to convert the exon coordinates in the Amazon molly GFF file to bed format:
awk 'BEGIN{OFS="\t"}; $3=="exon"{print $1, $4-1, $5}' data3/poeFor_NW_006799939.gff > exon.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
# > : The Unix redirect operator to write the output of the command to the following file
head exon.bed
## NW_006799939.1 844 1347
## NW_006799939.1 4648 4710
## NW_006799939.1 11357 11431
## NW_006799939.1 28599 28684
## NW_006799939.1 29916 30008
## NW_006799939.1 30191 30266
## NW_006799939.1 31518 31530
## NW_006799939.1 36141 36299
## NW_006799939.1 36412 36520
## NW_006799939.1 36633 36731
We could just copy and paste this command line to run this again to
get bed files for other features like genes, mRNAs,
CDS, etc. But we can also use a loop! Note that awk
, by
default doesn’t have access to shell variables, but we can pass them to
an awk
script with the -v
option.
mkdir -p poeFor_beds
# mkdir: The Unix command to make a new directory
# -p: This option tells mkdir to make any parent directories that do not exist
GFF="data3/poeFor_NW_006799939.gff"
# = : Assign the variable name on the left to the value on the right
for TYPE in exon CDS mRNA gene;
do
awk -v type=$TYPE 'BEGIN{OFS="\t"}; $3==type {print $1, $4-1, $5}' $GFF | bedtools sort -i - | bedtools merge > poeFor_beds/$TYPE.bed
done
# This loop runs an awk script on several different inputs and pipes it to bedtools
ls poeFor_beds
# ls: The Unix list directory contents command
## CDS.bed
## exon.bed
## gene.bed
## intergenic.bed
## intron.bed
## mRNA.bed
## UTR.bed
We are almost done; let’s make a few more using
bedtools complement
, which reports all regions of the
genome NOT in an interval, and bedtools subtract
, which
removes overlaps. For example, we don’t have an intergenic
feature in our GFF, but we can create one by getting everything that is
not a gene. And we don’t have a UTR
feature or an
intron
feature, but we can get them by taking our exons and
removing coding sequences (CDS) or by taking genes and removing
exons.
bedtools complement -i poeFor_beds/gene.bed -g data3/poeFor_NW_006799939.fasta.fai > poeFor_beds/intergenic.bed
# bedtools: A suite of programs to process bed files
# complement: The sub-program of bedtools to execute
# -i: The input bed file
# -g: A genome size file, which needs two columns, the scaffold/chromosome name and the size of that feature
# > : The Unix redirect operator to write the output of the command to the following file
bedtools subtract -a poeFor_beds/gene.bed -b poeFor_beds/exon.bed | bedtools sort -i - | bedtools merge > poeFor_beds/intron.bed
bedtools subtract -a poeFor_beds/exon.bed -b poeFor_beds/CDS.bed | bedtools sort -i - | bedtools merge > poeFor_beds/UTR.bed
# Two commands that subtract bed intervals in one file (-b) from another (-a) and pipe both with | and - into other bedtools commands and
# finally redirect to an output bed file.
wc -l poeFor_beds/*.bed
# Display the number of lines in each bed file in the poeFor_beds directory using wc -l and a wildcard (*)
# Note that here we have run a command on multiple files without an explicit loop, thanks to regex!
## 1800 poeFor_beds/CDS.bed
## 2053 poeFor_beds/exon.bed
## 174 poeFor_beds/gene.bed
## 175 poeFor_beds/intergenic.bed
## 1879 poeFor_beds/intron.bed
## 144 poeFor_beds/mRNA.bed
## 621 poeFor_beds/UTR.bed
## 6846 total
Now we have a directory of bed files, and we might want to run our script over each one, getting the average SNP density by feature type. We can do this with a loop.
Exercise: 1. Modify the code block below to add a for loop that iterates over the bed files in the poeFor_beds directory, and for each bed file calls the snp density script.
VCF="data3/poeFor_NW_006799939.vcf"
# = : Assign the variable name on the left to the value on the right
# add a for loop that iterates over each bed file in the poeFor_beds directory
for BED in $(ls poeFor_beds/*.bed)
do
echo "SNP density for $BED:"
./snp-density-3.sh $BED $VCF
echo "---"
done
# add a for loop that iterates over each bed file in the poeFor_beds directory
# Loop over each bed file in our directory and run our snp-density script on it and the VCF file
## SNP density for poeFor_beds/CDS.bed:
## 0.00824013
## ---
## SNP density for poeFor_beds/exon.bed:
## 0.0105178
## ---
## SNP density for poeFor_beds/gene.bed:
## 0.0143007
## ---
## SNP density for poeFor_beds/intergenic.bed:
## 0.0144225
## ---
## SNP density for poeFor_beds/intron.bed:
## 0.0148971
## ---
## SNP density for poeFor_beds/mRNA.bed:
## 0.0142947
## ---
## SNP density for poeFor_beds/UTR.bed:
## 0.0127176
## ---
We can actually make our script a bit better by putting the loop in the script, and passing a file of files, that is a single file that has all the bed files we want to operate on, one per line, as an option.
Exercise: 1. Re-write the
snp-density.sh
script to take as input a single file that contains a list of bed files (one per line) in addition to the VCF file and write a loop to run the SNP densityawk
command on each file in that file. Hint: thecat
command will return the lines in a file, so $(cat) will let you iterate over the lines of a file in a for loop 2. Run the code block below to generate the file containing all of our bed files and run thesnp-density.sh
script
VCF="data3/poeFor_NW_006799939.vcf"
# = : Assign the variable name on the left to the value on the right
ls poeFor_beds/*.bed > bed_files
# Redirect the output of ls to a file called bed_files
chmod +x snp-density-4.sh
# Make the script executable
./snp-density-4.sh bed_files $VCF
# ./ : Execute the following file as a shell script
## SNP density for poeFor_beds/CDS.bed:
## 0.00824013
## ---
## SNP density for poeFor_beds/exon.bed:
## 0.0105178
## ---
## SNP density for poeFor_beds/gene.bed:
## 0.0143007
## ---
## SNP density for poeFor_beds/intergenic.bed:
## 0.0144225
## ---
## SNP density for poeFor_beds/intron.bed:
## 0.0148971
## ---
## SNP density for poeFor_beds/mRNA.bed:
## 0.0142947
## ---
## SNP density for poeFor_beds/UTR.bed:
## 0.0127176
## ---
Progress! We can now calculate SNP density for any arbitrary set of bed elements, given a VCF file (or, we could even do this over multiple VCFs by using another loop).
However, this script is kind of brittle. What if we included a non-bed file in our list of files? Our script will fail cryptically. We might want to add some basic error checking. In particular, we are going to check that the files in the file of bed files actually exist. Note that it is possible, but complicated, to do more sophisticated error handling. In this case, we just want to make sure we don’t have typos or path errors in our file list.
To do this, we need to learn about one more bit of control
flow code: conditional statements, or if
statements. If statements are pretty intuitive
- they just test if
something is True
(in the
logical sense), and only run the following code if it is. We’ve seen a
lot of things like this in awk
, when only doing actions on
line that match a certain pattern. Those are implicit
if
statements. Here is an example of an explicit
if statement in bash syntax.
Run the code block below to see how an
if
statement works in bash. Change the value ofSPEAK
to see how it affects the behavior o the statement.
SPEAK="no"
# = : Assign the variable name on the left to the value on the right
if [[ $SPEAK == "yes" ]];
then
echo "Hello, World!"
else
echo "sshhhh"
fi
## sshhhh
The [[ ]]
syntax here is a bash test operator,
that returns True
or False.
The commands after
then
are run if the condition(s) inside the
[[]]
are True
, and the commands after
else
are run if the condition(s) are
False
.
Importantly, else
statements are not always necessary if
you don’t want to perform any actions if the
conditional (if
) statement is
False
.
Run the code block below to see an
if
statement without anelse
statement:
SPEAK="no"
# = : Assign the variable name on the left to the value on the right
if [[ $SPEAK == "yes" ]];
then
echo "Hello, World!"
fi
There are some particularly useful conditional tests built in to bash that let you check file properties:
-e filename
returns True if
filename
exists -s filename
returns
True if filename
exists and has a size > 0
-d filename
returns True if filename
is a directory
You can also check string properties:
-z string
returns True if string
is empty or null (0-length) -n string
returns True
if string
is not empty/null
Exercise: Write an
if
statement that will check if a file exists and has data, printing (withecho
) “file exists” if it exists and “file is missing” if it doesn’t exist. Try changing the FILE variable to point to something else.
FILE="snp-densitsy.sh"
# = : Assign the variable name on the left to the value on the right
## Write an if statement to check if FILE exists and has data
if [[ -s $FILE ]]
then
echo "file exists"
else
echo "file doesn't exist"
fi
## Write an if statement to check if FILE exists and has data
## file doesn't exist
Exercise: 1. In our
snp-density.sh
script, use anif
statement to check if each file in the input file of bed files exists and only run ourbedtools
command on the file if it does. Note that you don’t need to include an else statement if you don’t want. 2. BONUS: Another user-friendly behavior for scripts is to tell the user how to run the program. A full list of command line options is warranted for some programs, but for a simply one like this only a simple message is required. To do this, write anotherif
statement that checks if both of the command line options ($1
and$2
) have been provided and if not prints a user-friendly message about how to run the program, e.g:
echo "Usage: ./snp-density.sh <file of bed files to process> <vcf file>"
For this you will have to check two conditions with an AND operator using the
&&
syntax, e.g.if [[ test1 && test2 ]]
.
- Run the code block below to test your script. If you did the bonus, remove either
bed_files
or$VCF
from the command that calls your script to test the usage message:
VCF="data3/poeFor_NW_006799939.vcf"
# = : Assign the variable name on the left to the value on the right
echo "bed_doesnt_exist.bed" >> bed_files
# Add a non-existant file name to our list of bed files to check our error-handling if statement
chmod +x snp-density-final.sh
# Make the script executable
./snp-density-final.sh bed_files $VCF
# ./ : Execute the following file as a shell script
# Checking the Usage message if the vcf file is missing:
./snp-density-final.sh bed_files
# ./ : Execute the following file as a shell script
## SNP density for poeFor_beds/CDS.bed:
## 0.00824013
## ---
## SNP density for poeFor_beds/exon.bed:
## 0.0105178
## ---
## SNP density for poeFor_beds/gene.bed:
## 0.0143007
## ---
## SNP density for poeFor_beds/intergenic.bed:
## 0.0144225
## ---
## SNP density for poeFor_beds/intron.bed:
## 0.0148971
## ---
## SNP density for poeFor_beds/mRNA.bed:
## 0.0142947
## ---
## SNP density for poeFor_beds/UTR.bed:
## 0.0127176
## ---
## File bed_doesnt_exist.bed does not exist. Skipping!
## Usage: ./snp-density.sh <file of bed files to process> <vcf file>
Tomorrow, we’ll learn how to run scripts like this using the job scheduling software SLURM, so we can submit them to run non-interactively on the cluster.