Welcome to the third day of the FAS Informatics Bioinformatics Tips and Tricks Workshop!

If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit and execute the code in this file. Please download the raw file here

Today we’re going to continue our tour and explanation of common genomics file formats and their associated tools by talking about variant files, that is files which indicate where variants in some sample relative to a reference are.

We’ll also introduce some basic concepts about bash scripting today, in preparation for tomorrow when we’ll talk about using Slurm and submitting jobs to the cluster.

Setting up our link to the data files


mkdir -p data3
ln -s -f /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/* data3
# ln: The Unix link command, which can create shortcuts to folders and files at the provided path to the second provided path
# -s: This option tells ln to create a symbolic link rather than a hard link (original files are not changed)
# -f: This option forces ln to create the link

ls -l data3
# Show the details of the files in the new linked directory

## total 36
## lrwxrwxrwx 1 gthomas informatics 109 Mar 29 18:24 Biotips-workshop-2023-Day3-student.Rmd -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/Biotips-workshop-2023-Day3-student.Rmd
## lrwxrwxrwx 1 gthomas informatics 121 Mar 29 18:24 GCF_000485575.1_Poecilia_formosa-5.1.2_genomic.fna -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/GCF_000485575.1_Poecilia_formosa-5.1.2_genomic.fna
## lrwxrwxrwx 1 gthomas informatics 125 Mar 29 18:24 GCF_000485575.1_Poecilia_formosa-5.1.2_genomic.fna.fai -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/GCF_000485575.1_Poecilia_formosa-5.1.2_genomic.fna.fai
## lrwxrwxrwx 1 gthomas informatics  82 Mar 29 18:24 genomic.gff -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/genomic.gff
## lrwxrwxrwx 1 gthomas informatics  96 Mar 29 18:24 poeFor_NW_006799939.fasta -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/poeFor_NW_006799939.fasta
## lrwxrwxrwx 1 gthomas informatics 100 Mar 29 18:24 poeFor_NW_006799939.fasta.fai -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/poeFor_NW_006799939.fasta.fai
## lrwxrwxrwx 1 gthomas informatics  94 Mar 29 18:24 poeFor_NW_006799939.gff -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/poeFor_NW_006799939.gff
## lrwxrwxrwx 1 gthomas informatics  94 Mar 29 18:24 poeFor_NW_006799939.vcf -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/poeFor_NW_006799939.vcf
## lrwxrwxrwx 1 gthomas informatics  79 Mar 29 18:24 pop1.txt -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day3/pop1.txt

VCF files

Another important type of genomic coordinate file is the VCF file, or variant call format file. This is another tab-delimited file that contains variants between a single genome or a sample of genomes and a reference genome. These variants can include single nucleotide changes, small insertions and deletions, or even larger structural variants. The most common type of variation you’ll see in a VCF file are the single nucleotide ones, so the content of a VCF file is commonly referred to as SNP calls.

We’ll now be working with a VCF file that contains SNP data from some Amazon molly (Poecilia formosa), a freshwater fish. The original data consists of hundreds of thousands of SNPs across the molly genome among 16 individuals, but we’ll be looking only at SNPs on a single scaffold today (about 100,000).

Let’s take a look at our file and explain a bit about the format.

Run the code block below to view the beginning of our VCF file:


head -n 43 data3/poeFor_NW_006799939.vcf 
# head: A Unix command to display the first few lines of a file
# -n 43: This tells head to display exactly the first 43 lines of the provided file

## ##fileformat=VCFv4.2
## ##FILTER=<ID=PASS,Description="All filters passed">
## ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
## ##FILTER=<ID=FS_SOR_filter,Description="(vc.isSNP() && ((vc.hasAttribute('FS') && FS > 60.0) || (vc.hasAttribute('SOR') &&  SOR > 3.0))) || ((vc.isIndel() || vc.isMixed()) && ((vc.hasAttribute('FS') && FS > 200.0) || (vc.hasAttribute('SOR') &&  SOR > 10.0)))">
## ##FILTER=<ID=LowQual,Description="Low quality">
## ##FILTER=<ID=MQ_filter,Description="vc.isSNP() && ((vc.hasAttribute('MQ') && MQ < 40.0) || (vc.hasAttribute('MQRankSum') && MQRankSum < -12.5))">
## ##FILTER=<ID=QUAL_filter,Description="QUAL < 30.0">
## ##FILTER=<ID=RPRS_filter,Description="(vc.isSNP() && (vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -8.0)) || ((vc.isIndel() || vc.isMixed()) && (vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -20.0)) || (vc.hasAttribute('QD') && QD < 2.0)">
## ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
## ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
## ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
## ##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
## ##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
## ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
## ##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
## ##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
## ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
## ##GATKCommandLine=<ID=GenomicsDBImport,CommandLine="GenomicsDBImport --genomicsdb-workspace-path results/Poecilia_formosa/GCF_000485575.1/04_genomicsDB/DB_L47 --batch-size 25 --sample-name-map results/Poecilia_formosa/GCF_000485575.1/04_genomicsDB/DB_mapfile_L47 --genomicsdb-shared-posixfs-optimizations true --intervals results/Poecilia_formosa/GCF_000485575.1/intervalFiles/list47.list --tmp-dir tmp/ --genomicsdb-segment-size 1048576 --genomicsdb-vcf-buffer-size 16384 --overwrite-existing-genomicsdb-workspace false --consolidate false --validate-sample-name-map false --merge-input-intervals false --reader-threads 1 --max-num-intervals-to-import-in-parallel 1 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 0 --cloud-index-prefetch-buffer 0 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false",Version="4.1.8.0",Date="July 28, 2022 12:36:49 AM GMT-05:00">
## ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
## ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
## ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
## ##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
## ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
## ##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
## ##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
## ##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
## ##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
## ##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
## ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
## ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
## ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
## ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
## ##INFO=<ID=RAW_MQandDP,Number=2,Type=Integer,Description="Raw data (sum of squared MQ and total depth) for improved RMS Mapping Quality calculation. Incompatible with deprecated RAW_MQ formulation.">
## ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
## ##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
## ##contig=<ID=NW_006799939.1,length=7371162>
## ##source=GenomicsDBImport
## ##bcftools_viewVersion=1.12+htslib-1.12
## ##bcftools_viewCommand=view -S results/GCF_000485575.1/QC/final.samps.txt -t ^mtDNA -v snps -m2 -M2 -f .,PASS -e 'AF==1 | AF==0 | AF<0.01 | ALT="*" | F_MISSING > 0.75 | TYPE~"indel" | ref="N"' -O z -o results/GCF_000485575.1/QC/final_filtered.vcf.gz results/GCF_000485575.1/final.vcf.gz; Date=Fri Dec 23 20:36:54 2022
## ##bcftools_viewVersion=1.9+htslib-1.9
## ##bcftools_viewCommand=view final_filtered.vcf.gz NW_006799939.1; Date=Thu Mar 23 09:13:07 2023

So what we see here at the top of the file is the header of the VCF. Much like headers in BAM and SAM files, headers in VCF files contain information about how this file was created and the contents of the file. For example, the lines that begin with ##FORMAT describe the contents of a particular column of the file, while the lines begin with ##FILTER describe what the filter labels in the Filter column mean. Unlike the headers in BAM and SAM files, which begin with the @ character, the headers in a VCF file begin with the ## string (like GFF files).

Now that we know a bit about the header, let’s look at the actual content of the file, the tab-delimited rows containing SNP calls.

Run the code block below to view the beginning of our VCF file while skipping header lines:


grep -v "##" data3/poeFor_NW_006799939.vcf | head -n 2
# grep: The Unix string search command
# "##": The string to search for in the provided file - we just don't want to display these for this demonstration
# head: A Unix command to display the first few lines of a file
# | : The Unix pipe operator to pass output from one command as input to another command
# -n 2: This tells head to display exactly the first 2 lines of the provided file
## NOTE: ignore the "Broken pipe" and "error closing standard output" errors! This is just an artifact of our setup for the workshop today.

## #CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMN03568550    SAMN03568551    SAMN03568552    SAMN03568553    SAMN03568554    SAMN03568555    SAMN03568556    SAMN03568557    SAMN03568558    SAMN03568559    SAMN03568562    SAMN03568563    SAMN03568565    SAMN03568566    SAMN03568567    SAMN03568568
## NW_006799939.1   1985    .   T   C   31.46   .   AC=2;AF=0.25;AN=8;DP=7;ExcessHet=0.3218;FS=0;InbreedingCoeff=0.3548;MLEAC=2;MLEAF=0.25;MQ=40;QD=31.46;SOR=1.609 GT:AD:DP:GQ:PGT:PID:PL:PS   ./.:0,0:0:.:.:.:0,0,0:. 0/0:3,0:3:9:.:.:0,9,124:.   0/0:1,0:1:3:.:.:0,3,32:.    ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:1,0:1:.:.:.:0,0,0:. 1|1:0,1:1:3:1|1:1977_A_G:45,3,0:1977    0/0:1,0:1:3:.:.:0,3,45:.

The first line here, beginning with a single # shows us the names of the columns, which may be referred to in the file header like we saw above. Each subsequent line represents one variant relative to the reference genome, with the following column definitions:

CHROM: The chromosome or assembly scaffold in the reference genome.
POS: The position of the variant in the reference genome.
ID: A name for the variant. If the variant has no name, just the . character.
REF: The allele in the reference genome.
ALT: The alternate allele in the the sample.
QUAL: The quality score of the variant call. How this is calculated depends on the variant calling program.
FILTER: The filter label for the current variant. If the variant isn’t filter, this will be PASS (or sometimes .).
INFO: Additional info for the current variant. Usually a semi-colon (;) delimited list of information.
FORMAT: A colon (:) delimited list of the keys for the information in the subsequent columns. These keys are defined in the file header. 10+. SAMPLE GENOTYPES: Each subsequent column contains a colon (:) delimited list that follows the keys in the FORMAT column for the genotypes of the samples. Each column corresponds to one sample with the name being the column header.

The genotypes are encoded in the sample columns with the GT key. In this file, it is the first entry. For example, with the keys defined in column 9, the genotype of the first sample:

./.:0,0:0:.:.:.:0,0,0:.

means that the GT (GENOTYPE from the header) is ./., the AD (ALLELIC DEPTH) is 0,0, the DP (DEPTH) is 0, and so on.

Genotypes use the following syntax, with 0 being the reference allele and 1 being the alternate allele:

0/0 means this sample is homozygous for the reference allele.
0/1 or 1/0 means this sample is heterozygous for the reference allele (0) and the alternate allele (1).
1/1 means this sample is homozygous for the alternate allele.
. is used for missing; this site was not genotyped for this individual, which happens to be the case for the sample outlined above.

For VCF files with multiple samples (like this one), sites can have more than two alleles, in which case the genotypes 0/2 or 0/3 or 2/2 may be seen.

There are many quirks to VCF files, but they are also highly standardized, so if you find yourself working with them be sure to read the very detailed specifications here:

VCF file specs

Note that the binary version of VCF files are BCF files, usually with the `.bcf` file extension. BCF files are more efficient to read than VCF files, and are typically compressed.

There are a myriad of ways to use and analyze a VCF, but for general processing of them, we will use bcftools. bcftools falls under the same development umbrella as samtools and thus shares much of the same usage and underlying philosophy to act like a native Unix command. It was preceded by vcftools, which still retains some useful functionality, but is no longer developed.

bcftools

Much like bedtools, bcftools has a wide range of functions so we won’t be able to get to everything today. Be sure to read the documentation to find out if it can accomplish the task you need with your VCF or BCF file.

One of the most basic things you can do to a file is to view it. Indeed, looking at your data is an important part of each step of a bioinformatics workflow.

Like samtools, bcftools has a dedicated view command.

Exercise: 1. Run the code block below to view the top of the Amazon molly VCF file. 2. Edit the bcftools view command to exclude the header from the output.


## Use the view command to display the VCF file, then edit it to exclude the header
## data3/poeFor_NW_006799939.vcf
bcftools view -H data3/poeFor_NW_006799939.vcf | head -n 10
## Use the view command to display the VCF file, then edit it to exclude the header
## NOTE: ignore the "cannot write" error! This is just an artifact of our setup for the workshop today.

## NW_006799939.1   1985    .   T   C   31.46   .   AC=2;AF=0.25;AN=8;DP=7;ExcessHet=0.3218;FS=0;InbreedingCoeff=0.3548;MLEAC=2;MLEAF=0.25;MQ=40;QD=31.46;SOR=1.609 GT:AD:DP:GQ:PGT:PID:PL:PS   ./.:0,0:0:.:.:.:0,0,0:. 0/0:3,0:3:9:.:.:0,9,124:.   0/0:1,0:1:3:.:.:0,3,32:.    ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:1,0:1:.:.:.:0,0,0:. 1|1:0,1:1:3:1|1:1977_A_G:45,3,0:1977    0/0:1,0:1:3:.:.:0,3,45:.
## NW_006799939.1   18613   .   G   A   32.84   .   AC=3;AF=0.107;AN=28;BaseQRankSum=0.967;DP=54;ExcessHet=3.5218;FS=5.869;InbreedingCoeff=-0.0251;MLEAC=3;MLEAF=0.107;MQ=46.57;MQRankSum=-1.15;QD=2.19;ReadPosRankSum=1.38;SOR=2.91    GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:4,1:5:17:.:.:17,0,159:. 0/0:3,0:3:9:.:.:0,9,119:.   0/0:5,0:5:15:.:.:0,15,181:. 0/0:8,1:9:1:.:.:0,1,324:.   0/0:2,0:2:6:.:.:0,6,80:.    0/1:3,1:4:20:.:.:20,0,103:. 0|1:5,1:6:25:0|1:18613_G_A:25,0,158:18613   0/0:5,0:5:15:.:.:0,15,193:. 0/0:3,0:3:9:.:.:0,9,118:.   0/0:3,0:3:9:.:.:0,9,116:.   0/0:2,0:2:6:.:.:0,6,85:.    ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. 0/0:2,0:2:6:.:.:0,6,88:.    0/0:3,0:3:9:.:.:0,9,123:.   0/0:1,0:1:3:.:.:0,3,39:.
## NW_006799939.1   18638   .   G   C   2025.74 .   AC=30;AF=0.938;AN=32;BaseQRankSum=-0.14;DP=62;ExcessHet=3.1527;FS=6.816;InbreedingCoeff=-0.1453;MLEAC=25;MLEAF=0.781;MQ=54.98;MQRankSum=1.59;QD=29.2;ReadPosRankSum=0.589;SOR=0.617 GT:AD:DP:GQ:PGT:PID:PL:PS   1/1:0,4:4:12:.:.:140,12,0:. 1/1:0,4:4:12:.:.:150,12,0:. 1/1:0,4:4:12:.:.:159,12,0:. 0/1:1,8:9:18:.:.:286,0,18:. 1/1:0,2:2:6:.:.:80,6,0:.    1/1:0,5:5:15:.:.:202,15,0:. 0|1:1,5:6:27:1|0:18613_G_A:141,0,27:18613   1/1:0,5:5:15:.:.:198,15,0:. 1/1:0,4:4:12:.:.:170,12,0:. 1/1:0,3:3:9:.:.:124,9,0:.   1/1:0,3:3:9:.:.:108,9,0:.   1/1:0,1:1:3:.:.:34,3,0:.    1/1:0,1:1:3:.:.:35,3,0:.    1/1:0,1:1:3:.:.:32,3,0:.    1/1:0,5:5:15:.:.:198,15,0:. 1/1:0,1:1:3:.:.:37,3,0:.
## NW_006799939.1   18706   .   C   T   48.85   .   AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DP=67;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.0961;MLEAC=2;MLEAF=0.067;MQ=52.92;MQRankSum=-0.967;QD=12.21;ReadPosRankSum=-0.967;SOR=2.833  GT:AD:DP:GQ:PL  1/1:0,1:2:3:42,3,0  0/0:2,0:2:6:0,6,78  0/0:6,0:6:0:0,0,164 0/0:6,0:6:12:0,12,180   0/0:3,0:3:9:0,9,121 0/0:9,0:9:27:0,27,343   0/0:7,0:7:0:0,0,169 ./.:0,0:0:.:0,0,0   0/0:4,0:4:9:0,9,135 0/0:4,0:4:12:0,12,180   0/0:3,0:3:9:0,9,124 0/0:4,0:4:0:0,0,93  0/0:4,0:4:12:0,12,165   0/0:5,0:5:0:0,0,138 0/0:5,0:5:0:0,0,43  0/1:2,1:3:31:31,0,74
## NW_006799939.1   18860   .   G   C   306.82  .   AC=6;AF=0.2;AN=30;BaseQRankSum=0.524;DP=50;ExcessHet=1.2974;FS=31.91;InbreedingCoeff=-0.0458;MLEAC=6;MLEAF=0.2;MQ=47.79;MQRankSum=-1.282;QD=25.57;ReadPosRankSum=-0.524;SOR=2.775   GT:AD:DP:GQ:PGT:PID:PL:PS   0|1:3,3:8:99:0|1:18860_G_C:117,0,112:18860  0/0:5,0:5:15:.:.:0,15,204:. 0/0:3,0:3:9:.:.:0,9,97:.    0|1:3,3:7:99:0|1:18860_G_C:117,0,109:18860  0/0:2,0:2:6:.:.:0,6,79:.    0/0:3,0:3:9:.:.:0,9,110:.   0/0:2,0:2:6:.:.:0,6,49:.    0/0:5,0:5:15:.:.:0,15,193:. 0/0:2,0:2:6:.:.:0,6,74:.    0|1:1,1:3:39:0|1:18860_G_C:39,0,39:18860    0/0:1,0:1:3:.:.:0,3,42:.    0/0:1,0:1:3:.:.:0,3,39:.    1|1:0,1:1:3:1|1:18860_G_C:45,3,0:18860  ./.:0,0:0:.:.:.:0,0,0:. 0|1:4,1:5:30:0|1:18860_G_C:30,0,165:18860   0/0:1,0:1:3:.:.:0,3,44:.
## NW_006799939.1   19009   .   G   A   51.6    .   AC=1;AF=0.031;AN=32;BaseQRankSum=-0.992;DP=76;ExcessHet=3.0103;FS=3.01;InbreedingCoeff=-0.0974;MLEAC=1;MLEAF=0.031;MQ=56.02;MQRankSum=-1.732;QD=6.45;ReadPosRankSum=-1.668;SOR=1.911    GT:AD:DP:GQ:PGT:PID:PL:PS   0/0:7,0:7:21:.:.:0,21,279:. 0/0:5,0:5:0:.:.:0,0,112:.   0|1:6,2:9:66:0|1:19009_G_A:66,0,246:19009   0/0:1,0:1:3:.:.:0,3,37:.    0/0:5,0:5:12:.:.:0,12,180:. 0/0:6,0:6:18:.:.:0,18,238:. 0/0:8,0:8:24:.:.:0,24,315:. 0/0:5,0:5:15:.:.:0,15,204:. 0/0:1,0:1:3:.:.:0,3,39:.    0/0:7,0:7:21:.:.:0,21,287:. 0/0:3,0:3:9:.:.:0,9,111:.   0/0:4,0:4:12:.:.:0,12,162:. 0/0:4,0:4:12:.:.:0,12,154:. 0/0:4,0:4:12:.:.:0,12,147:. 0/0:5,0:5:12:.:.:0,12,180:. 0/0:2,0:2:6:.:.:0,6,90:.
## NW_006799939.1   19028   .   G   A   83.85   .   AC=1;AF=0.033;AN=30;BaseQRankSum=-1.602;DP=79;ExcessHet=3.0103;FS=3.09;InbreedingCoeff=-0.0545;MLEAC=1;MLEAF=0.033;MQ=52.96;MQRankSum=-2.744;QD=7.62;ReadPosRankSum=-0.191;SOR=2.093    GT:AD:DP:GQ:PGT:PID:PL:PS   0/0:10,0:10:27:.:.:0,27,405:.   0/0:6,0:6:0:.:.:0,0,87:.    0|1:8,3:11:97:0|1:19009_G_A:97,0,320:19009  ./.:1,0:1:.:.:.:0,0,0:. 0/0:3,0:3:9:.:.:0,9,116:.   0/0:7,0:7:0:.:.:0,0,216:.   0/0:10,0:10:6:.:.:0,6,335:. 0/0:6,0:6:0:.:.:0,0,166:.   0/0:1,0:1:3:.:.:0,3,39:.    0/0:6,0:6:0:.:.:0,0,177:.   0/0:1,0:1:3:.:.:0,3,14:.    0/0:3,0:3:9:.:.:0,9,123:.   0/0:4,0:4:9:.:.:0,9,135:.   0/0:4,0:4:12:.:.:0,12,147:. 0/0:4,0:4:12:.:.:0,12,169:. 0/0:2,0:2:6:.:.:0,6,89:.
## NW_006799939.1   19149   .   G   A   2633.86 .   AC=26;AF=0.813;AN=32;BaseQRankSum=0.431;DP=77;ExcessHet=5.4855;FS=6.445;InbreedingCoeff=-0.2699;MLEAC=24;MLEAF=0.75;MQ=45.74;MQRankSum=1.15;QD=28.55;ReadPosRankSum=0.842;SOR=0.123 GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:1,4:5:30:.:.:143,0,30:. 1|1:0,4:4:12:1|1:19149_G_A:177,12,0:19149   0/1:1,2:3:36:.:.:75,0,36:.  1/1:0,5:5:15:.:.:202,15,0:. 1/1:0,1:1:3:.:.:43,3,0:.    0/1:2,8:10:56:.:.:305,0,56:.    0/1:1,3:4:33:.:.:108,0,33:. 1/1:0,3:3:9:.:.:135,9,0:.   1/1:0,3:3:9:.:.:135,9,0:.   1/1:0,4:4:12:.:.:160,12,0:. 1/1:0,6:6:18:.:.:246,18,0:. 1/1:0,6:6:18:.:.:261,18,0:. 1/1:0,4:4:12:.:.:171,12,0:. 1/1:0,7:7:21:.:.:311,21,0:. 0/1:1,2:3:34:.:.:67,0,34:.  0/1:1,4:5:29:.:.:164,0,29:.
## NW_006799939.1   19171   .   C   T   312.01  .   AC=11;AF=0.344;AN=32;BaseQRankSum=0.566;DP=86;ExcessHet=14.6009;FS=13.547;InbreedingCoeff=-0.5693;MLEAC=11;MLEAF=0.344;MQ=46;MQRankSum=-1.383;QD=31.2;ReadPosRankSum=-0.366;SOR=2.985   GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:4,1:5:25:.:.:25,0,158:. 0|1:4,1:5:19:1|0:19149_G_A:19,0,165:19149   0/1:4,1:5:30:.:.:30,0,146:. 0/0:4,0:4:12:.:.:0,12,169:. 0/1:3,2:5:73:.:.:73,0,109:. 0/1:7,1:8:19:.:.:19,0,291:. 0/1:3,2:5:68:.:.:68,0,120:. 0/0:6,0:6:18:.:.:0,18,241:. 0/0:4,0:4:12:.:.:0,12,175:. 0/1:5,1:6:27:.:.:27,0,192:. 0/1:5,1:6:27:.:.:27,0,207:. 0/0:4,0:4:12:.:.:0,12,179:. 0/1:6,1:7:24:.:.:24,0,249:. 0/1:6,1:7:13:.:.:13,0,240:. 0/1:1,1:2:38:.:.:39,0,38:.  0/0:6,0:6:0:.:.:0,0,169:.
## NW_006799939.1   19467   .   T   A   2931.49 .   AC=17;AF=0.531;AN=32;BaseQRankSum=-0.306;DP=157;ExcessHet=33.3406;FS=2.909;InbreedingCoeff=-0.8939;MLEAC=17;MLEAF=0.531;MQ=58.71;MQRankSum=-0.431;QD=18.91;ReadPosRankSum=0.244;SOR=0.952   GT:AD:DP:GQ:PGT:PID:PL:PS   0|1:9,6:15:99:0|1:19467_T_A:213,0,349:19467 0|1:6,5:11:99:0|1:19467_T_A:188,0,234:19467 0|1:5,6:11:99:0|1:19467_T_A:237,0,183:19467 0|1:9,7:16:99:0|1:19467_T_A:267,0,357:19467 0|1:4,6:10:99:0|1:19467_T_A:240,0,126:19467 0|1:7,4:11:99:0|1:19467_T_A:142,0,270:19467 0|1:10,7:17:99:0|1:19467_T_A:255,0,394:19467    0|1:2,4:6:72:0|1:19467_T_A:156,0,72:19467   0|1:8,5:13:99:0|1:19467_T_A:186,0,321:19467 0|1:2,4:6:67:0|1:19467_T_A:151,0,67:19467   0|1:1,2:3:32:0|1:19467_T_A:81,0,32:19467    1|1:0,3:3:9:1|1:19467_T_A:127,9,0:19467 0|1:2,7:9:63:0|1:19467_T_A:277,0,63:19467   0|1:1,7:8:21:0|1:19467_T_A:291,0,21:19467   0|1:5,3:8:99:0|1:19467_T_A:108,0,194:19467  0|1:6,2:8:65:0|1:19467_T_A:65,0,233:19467

So view acts much the same way as the Unix command cat, but has specific functionality for the VCF format, and also handles compressed BCF files natively.

One useful function when viewing files is to be able to subset or filter them. We’ve provided a file that lists the sample IDs for a made-up population and can use bcftools view to extract genotypes from specific samples.

Run the code block below to extract SNPs only from a subset of samples:


cat data3/pop1.txt
echo "-------"
# cat: A Unix command that displays the contents of a file in the Terminal
# echo: A Unix command that simply prints the provided input to the screen
# These two commands are just to see the pop1.txt file

bcftools view -S data3/pop1.txt -H data3/poeFor_NW_006799939.vcf | head
# bcftools: A suite of programs to process VCF files
# view: The sub-program of bcftools to execute
# -S : A bcftools view option that specifies to only extract genotype columns from the samples listed in the provided file
# -H : A bcftools view option that specifies to exclude the VCF header from the output
## NOTE: ignore the "cannot write" error! This is just an artifact of our setup for the workshop today.

## SAMN03568550
## SAMN03568551
## SAMN03568552
## SAMN03568553
## SAMN03568554
## SAMN03568555
## SAMN03568556
## SAMN03568557
## -------
## NW_006799939.1   1985    .   T   C   31.46   .   AC=0;AF=0.25;AN=4;DP=7;ExcessHet=0.3218;FS=0;InbreedingCoeff=0.3548;MLEAC=2;MLEAF=0.25;MQ=40;QD=31.46;SOR=1.609 GT:AD:DP:GQ:PGT:PID:PL:PS   ./.:0,0:0:.:.:.:0,0,0:. 0/0:3,0:3:9:.:.:0,9,124:.   0/0:1,0:1:3:.:.:0,3,32:.    ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:. ./.:0,0:0:.:.:.:0,0,0:.
## NW_006799939.1   18613   .   G   A   32.84   .   AC=3;AF=0.107;AN=16;BaseQRankSum=0.967;DP=54;ExcessHet=3.5218;FS=5.869;InbreedingCoeff=-0.0251;MLEAC=3;MLEAF=0.107;MQ=46.57;MQRankSum=-1.15;QD=2.19;ReadPosRankSum=1.38;SOR=2.91    GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:4,1:5:17:.:.:17,0,159:. 0/0:3,0:3:9:.:.:0,9,119:.   0/0:5,0:5:15:.:.:0,15,181:. 0/0:8,1:9:1:.:.:0,1,324:.   0/0:2,0:2:6:.:.:0,6,80:.    0/1:3,1:4:20:.:.:20,0,103:. 0|1:5,1:6:25:0|1:18613_G_A:25,0,158:18613   0/0:5,0:5:15:.:.:0,15,193:.
## NW_006799939.1   18638   .   G   C   2025.74 .   AC=14;AF=0.938;AN=16;BaseQRankSum=-0.14;DP=62;ExcessHet=3.1527;FS=6.816;InbreedingCoeff=-0.1453;MLEAC=25;MLEAF=0.781;MQ=54.98;MQRankSum=1.59;QD=29.2;ReadPosRankSum=0.589;SOR=0.617 GT:AD:DP:GQ:PGT:PID:PL:PS   1/1:0,4:4:12:.:.:140,12,0:. 1/1:0,4:4:12:.:.:150,12,0:. 1/1:0,4:4:12:.:.:159,12,0:. 0/1:1,8:9:18:.:.:286,0,18:. 1/1:0,2:2:6:.:.:80,6,0:.    1/1:0,5:5:15:.:.:202,15,0:. 0|1:1,5:6:27:1|0:18613_G_A:141,0,27:18613   1/1:0,5:5:15:.:.:198,15,0:.
## NW_006799939.1   18706   .   C   T   48.85   .   AC=2;AF=0.1;AN=14;BaseQRankSum=0.967;DP=67;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.0961;MLEAC=2;MLEAF=0.067;MQ=52.92;MQRankSum=-0.967;QD=12.21;ReadPosRankSum=-0.967;SOR=2.833  GT:AD:DP:GQ:PL  1/1:0,1:2:3:42,3,0  0/0:2,0:2:6:0,6,78  0/0:6,0:6:0:0,0,164 0/0:6,0:6:12:0,12,180   0/0:3,0:3:9:0,9,121 0/0:9,0:9:27:0,27,343   0/0:7,0:7:0:0,0,169 ./.:0,0:0:.:0,0,0
## NW_006799939.1   18860   .   G   C   306.82  .   AC=2;AF=0.2;AN=16;BaseQRankSum=0.524;DP=50;ExcessHet=1.2974;FS=31.91;InbreedingCoeff=-0.0458;MLEAC=6;MLEAF=0.2;MQ=47.79;MQRankSum=-1.282;QD=25.57;ReadPosRankSum=-0.524;SOR=2.775   GT:AD:DP:GQ:PGT:PID:PL:PS   0|1:3,3:8:99:0|1:18860_G_C:117,0,112:18860  0/0:5,0:5:15:.:.:0,15,204:. 0/0:3,0:3:9:.:.:0,9,97:.    0|1:3,3:7:99:0|1:18860_G_C:117,0,109:18860  0/0:2,0:2:6:.:.:0,6,79:.    0/0:3,0:3:9:.:.:0,9,110:.   0/0:2,0:2:6:.:.:0,6,49:.    0/0:5,0:5:15:.:.:0,15,193:.
## NW_006799939.1   19009   .   G   A   51.6    .   AC=1;AF=0.031;AN=16;BaseQRankSum=-0.992;DP=76;ExcessHet=3.0103;FS=3.01;InbreedingCoeff=-0.0974;MLEAC=1;MLEAF=0.031;MQ=56.02;MQRankSum=-1.732;QD=6.45;ReadPosRankSum=-1.668;SOR=1.911    GT:AD:DP:GQ:PGT:PID:PL:PS   0/0:7,0:7:21:.:.:0,21,279:. 0/0:5,0:5:0:.:.:0,0,112:.   0|1:6,2:9:66:0|1:19009_G_A:66,0,246:19009   0/0:1,0:1:3:.:.:0,3,37:.    0/0:5,0:5:12:.:.:0,12,180:. 0/0:6,0:6:18:.:.:0,18,238:. 0/0:8,0:8:24:.:.:0,24,315:. 0/0:5,0:5:15:.:.:0,15,204:.
## NW_006799939.1   19028   .   G   A   83.85   .   AC=1;AF=0.033;AN=14;BaseQRankSum=-1.602;DP=79;ExcessHet=3.0103;FS=3.09;InbreedingCoeff=-0.0545;MLEAC=1;MLEAF=0.033;MQ=52.96;MQRankSum=-2.744;QD=7.62;ReadPosRankSum=-0.191;SOR=2.093    GT:AD:DP:GQ:PGT:PID:PL:PS   0/0:10,0:10:27:.:.:0,27,405:.   0/0:6,0:6:0:.:.:0,0,87:.    0|1:8,3:11:97:0|1:19009_G_A:97,0,320:19009  ./.:1,0:1:.:.:.:0,0,0:. 0/0:3,0:3:9:.:.:0,9,116:.   0/0:7,0:7:0:.:.:0,0,216:.   0/0:10,0:10:6:.:.:0,6,335:. 0/0:6,0:6:0:.:.:0,0,166:.
## NW_006799939.1   19149   .   G   A   2633.86 .   AC=12;AF=0.813;AN=16;BaseQRankSum=0.431;DP=77;ExcessHet=5.4855;FS=6.445;InbreedingCoeff=-0.2699;MLEAC=24;MLEAF=0.75;MQ=45.74;MQRankSum=1.15;QD=28.55;ReadPosRankSum=0.842;SOR=0.123 GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:1,4:5:30:.:.:143,0,30:. 1|1:0,4:4:12:1|1:19149_G_A:177,12,0:19149   0/1:1,2:3:36:.:.:75,0,36:.  1/1:0,5:5:15:.:.:202,15,0:. 1/1:0,1:1:3:.:.:43,3,0:.    0/1:2,8:10:56:.:.:305,0,56:.    0/1:1,3:4:33:.:.:108,0,33:. 1/1:0,3:3:9:.:.:135,9,0:.
## NW_006799939.1   19171   .   C   T   312.01  .   AC=6;AF=0.344;AN=16;BaseQRankSum=0.566;DP=86;ExcessHet=14.6009;FS=13.547;InbreedingCoeff=-0.5693;MLEAC=11;MLEAF=0.344;MQ=46;MQRankSum=-1.383;QD=31.2;ReadPosRankSum=-0.366;SOR=2.985    GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:4,1:5:25:.:.:25,0,158:. 0|1:4,1:5:19:1|0:19149_G_A:19,0,165:19149   0/1:4,1:5:30:.:.:30,0,146:. 0/0:4,0:4:12:.:.:0,12,169:. 0/1:3,2:5:73:.:.:73,0,109:. 0/1:7,1:8:19:.:.:19,0,291:. 0/1:3,2:5:68:.:.:68,0,120:. 0/0:6,0:6:18:.:.:0,18,241:.
## NW_006799939.1   19467   .   T   A   2931.49 .   AC=8;AF=0.531;AN=16;BaseQRankSum=-0.306;DP=157;ExcessHet=33.3406;FS=2.909;InbreedingCoeff=-0.8939;MLEAC=17;MLEAF=0.531;MQ=58.71;MQRankSum=-0.431;QD=18.91;ReadPosRankSum=0.244;SOR=0.952    GT:AD:DP:GQ:PGT:PID:PL:PS   0|1:9,6:15:99:0|1:19467_T_A:213,0,349:19467 0|1:6,5:11:99:0|1:19467_T_A:188,0,234:19467 0|1:5,6:11:99:0|1:19467_T_A:237,0,183:19467 0|1:9,7:16:99:0|1:19467_T_A:267,0,357:19467 0|1:4,6:10:99:0|1:19467_T_A:240,0,126:19467 0|1:7,4:11:99:0|1:19467_T_A:142,0,270:19467 0|1:10,7:17:99:0|1:19467_T_A:255,0,394:19467    0|1:2,4:6:72:0|1:19467_T_A:156,0,72:19467

A note about subsetting

This can be very handy, but be warned - the INFO fields are not automatically recalculated, and will no longer be correct once you remove some individuals. In general, the INFO column and its information displays information about the site while the FORMAT fields display information about each sample.

SNP density with `bcftools` and `bedtools`

Many of the tools we’ve demonstrated, samtools, bedtools, bcftools, are general purpose tools for basic processing and manipulation of specific file formats used in genomics. Many of the tasks they do could be done with native bash commands, but that would require a lot of piping and probably some custom scripting on uncompressed files. However, if you really start to get to know these tools and their versatility, you can do some more advanced analyses purely with them. Let’s demonstrate by calculating SNP density across 100k windows the specific 7 Mb scaffold we are working with from the Amazon molly genome.

To do this, we’ll need:

The .fai index file of the Amazon molly genome, which we have already generated.
A bed file with the coordinates of the 100 kilobase windows in the scaffold
A VCF file with Amazon molly SNPs, which we have provided.

Since we’ve already downloaded the genome and generated the index file, let’s start with the second one, getting the window coordinates. We can do this easily with the bedtools getwindows command:

Run the code block below to generate a bed file with 100kb windows in the Amazon molly genome:


bedtools makewindows -g data3/poeFor_NW_006799939.fasta.fai -w 100000 > poeFor-windows-100k.bed
# bedtools: A suite of programs to process bed files
# makewindows: The sub-program of bedtools to execute
# -g : A genome size file, which needs two columns, the scaffold/chromosome name and the size of that feature
# -w : The size of the windows to partition each scaffold in -g into
# -b : The second interval file to check overlaps
# > : The Unix redirect operator to write the output of the command to the following file

head poeFor-windows-100k.bed
# Display the first few lines of the bed file containing windows

## NW_006799939.1   0   100000
## NW_006799939.1   100000  200000
## NW_006799939.1   200000  300000
## NW_006799939.1   300000  400000
## NW_006799939.1   400000  500000
## NW_006799939.1   500000  600000
## NW_006799939.1   600000  700000
## NW_006799939.1   700000  800000
## NW_006799939.1   800000  900000
## NW_006799939.1   900000  1000000

Now, we can use bedtools intersect, which also works with VCF files, to count how many SNPs appear in each window.

Exercise: Use bedtools intersect to count the number of SNPs in each 100kb window in the Amazon molly genome. You will need to look in the help menu of bedtools intersect to find the right option to output a column with the number of overlaps per region! Save this file as poeFor-windows-100k-snps.bed.


## Write a bedtools intersect command to count how many SNPs overlap each 100kb window in the molly genome
## data3/poeFor_NW_006799939.vcf
## poeFor-windows-100k.bed
bedtools intersect -c -a poeFor-windows-100k.bed -b data3/poeFor_NW_006799939.vcf > poeFor-windows-100k-snps.bed
## Write a bedtools intersect command to count how many SNPs overlap each 100kb window in the molly genome

head poeFor-windows-100k-snps.bed
# Display the first few lines of the bed file containing windows and overlapping SNPs

## NW_006799939.1   0   100000  1489
## NW_006799939.1   100000  200000  1684
## NW_006799939.1   200000  300000  1357
## NW_006799939.1   300000  400000  1591
## NW_006799939.1   400000  500000  1770
## NW_006799939.1   500000  600000  1178
## NW_006799939.1   600000  700000  1562
## NW_006799939.1   700000  800000  1708
## NW_006799939.1   800000  900000  1672
## NW_006799939.1   900000  1000000 1148

BONUS Exercise: If you attended the R workshop last week or know R already, use the code block below to read in the file we just generated with SNPs per 100k window and plot the density along this scaffold.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

# Load the tidyverse

## Write code in R to plot SNP density across the scaffold in our bed file
# poeFor-windows-100k-snps.bed
am_snps = read_tsv("poeFor-windows-100k-snps.bed", col_names=c("scaffold", "start", "end", "num.snps"))

## Rows: 74 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): scaffold
## dbl (3): start, end, num.snps
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Read the bed file containing 1Mb windows and number of SNPs per window, and name the columns

snp_plot = ggplot(am_snps, aes(x=start, y=num.snps)) +
# Initialize a ggplot grob and load the data and aesthetics into it and store it as the snp_plot object
  
  geom_line() +
  # Plot the provided data as a line graph
  
  #facet_wrap(~scaffold, scales="free") +
  # Split the plot by scaffold
  
  theme_classic()
  # Set a different theme
## Write code in R to plot SNP density across the scaffold in our bed file

print(snp_plot)

# Display the plot

Scripts

So far, we’ve been focused on interactive analysis of small files. However, in real bioinformatics work, you will likely find yourself with many samples that you want to run the same commands on. And often these commands can require significant compute time, such that you want to be able to run these commands non-interactively. As nice as these markdown notebooks are for teaching and data exploration, they are not the ideal format for this task. Instead, what we want are scripts.

A bash script (like an R script, if you attended the R workshop earlier), is just a text file that executes a series of commands in order. Bash scripts can get exceedingly complicated, but today we are just going to introduce a few basic concepts.

We are ultimately going to make a script that takes as input a VCF file and one or more bed files, and reports as output the average SNP density in regions of the genome inside bed intervals. Let’s start simple.

Using the file menu, go to New file –> Shell script and create a new file. Call it snp-density.sh.This is going to be our script in a new tab of our text editor, that we will modify as we go.

The first thing we need to do is a little bit of shell magic to get the shell to treat the script we are writing as a command. This is called a shebang (#!) and basically says “run the script with the following interpreter”. Type the following line at the top of your script:

#!/bin/bash

This means that when we run snp-density.sh on the command line, the commands in the script will be interpreted as shell commands.

We need to do one more thing, which is make our script executable. We’ll use the chmod command for this.

In the Terminal below, make sure you are in the same directory as your script file, and then type the following command.

chmod +x snp-density.sh

We now have a script we can run.

In the code block below, try running your (empty) script. Note that the ./ specifies the relative path to the script, and means “execute the script snp-density.sh in the current working directory”. If omitted, you’ll get a “command not found” error.


chmod +x snp-density-0.sh
# Make the script executable

./snp-density-0.sh
# ./ : Execute the following file as a shell script

echo "done"
# echo: A Unix command that simply prints the provided input to the screen

## done

However, nothing will happen if we do, since there are currently no commands in the file to run.

Exercise: In order to have our shell script do anything, we need to put some commands in it. 1. Copy the bedtools intersect command from above into the script. 2. Instead of writing the output of the command to the file, pipe the output of the command to awk. 3. Use awk, to calculate the average number of SNPs per base across intervals. To do this, you’ll need to increment two counters for each row of the output: one with the number of snps in that row, the other with the length of the element in that row, and then divide the number of snps by the total length. Warning: awk has a built-in function called length, so you’ll need to call your length variable something else.

Now, try running your script again.

In the code block below, try running your script.


chmod +x snp-density-1.sh
# Make the script executable

./snp-density-1.sh
# ./ : Execute the following file as a shell script

## 0.0143416

We should get a number now.

However, we have a problem. Ideally, we’d like to be able to do this calculation on any VCF file and any bed file. Right now those files are hard-coded, which means to change what file the script operates on requires changing the script.

To make our script flexible, we need to introduce the concept of variables. We’ve seen this in awk commands; we can also create variables in the shell. We do this just by assigning something (text or the output of a command, usually) to a string.

Run the code block below to create a variable called TEXT, and print it out with the echo command.


TEXT="Hello World!"
# = : Assign the variable name on the left to the value on the right

echo $TEXT
# echo: A Unix command that simply prints the provided input to the screen
# $ : A symbol that the shell understands to mean that the following is the name of a variable

## Hello World!

Note that when we access what is stored in the variable we use the $ notation, but we don’t use this when we assign to a variable. Note also we cannot have spaces before or after the equal sign or the shell will thing we are trying to run the TEXT command, not create a variable called TEXT.

Exercise: 1. Modify our shell script, snp-density.sh, so that it includes two variables, one called VCF and one called BED. For now, define these in the second and third lines of the script with the paths to the BED file and the VCF file we currently have. 2. Replace the hard-coded file names in the bedtools intersect command options -a and -b with the BED and VCF variables. 3. In the code block below run your script again and make sure it still works.


chmod +x snp-density-2.sh
# Make the script executable

./snp-density-2.sh
# ./ : Execute the following file as a shell script

## 0.0143416

You should get the same answer as before.

Note that right now, we aren’t actually changing anything, we’ve just moved the files we are going to operate on to the top of the script, but we still would need to change the script each time we run it on different files. To make our script more general (able to take various inputs more easily), we should have a way to change the input file without having to edit the script. The way to do this is with command line options. You’ve seen this a lot in the tools we’ve run up to this point. For instance, when we run almost any shell command, it requires a filename, e.g.:


head poeFor-windows-100k.bed

In this case, the file name poeFor-windows-100k.bed, is a command line option, also known as a command line argument or just option, argument, or parameter. Commands and scripts can take multiple arguments. Some are positional, meaning that the way they are read into the program depends on their placement in the command. Others have specific flags, meaning the way they are read into the program depends on the flag they come after in the command. For instance:


head -n 20 poeFor-windows-100k.bed

This is a command that hopefully looks pretty understandable to you by now. This command, head has two command line options as we’ve written it:

A flag, -n 20, which changes the number of lines the command displays. Internally, this sets the value of some variable to a different number.
A positional argument, the file name poeFor-windows-100k.bed, which is the last listed argument. Again this internally sets the value of some variable to the path to that file (actually, for most bash commands, the first un-flagged option is usually reserved for the file name).

Well, we can modify our script to take input from the command line as well. Shell scripts have access to special variables that can be used to create simple, positional, command line options. Specifically, the variables $1, $2, $3, etc will contain the first, second, third, etc. arguments specified after the script is called.

Exercise: 1. Modify your script to replace the path to the bed file with $1, and the path to the VCF file with $2 using positional command line arguments. 2. Run the following code block to call your script again, this time taking arguments, with the bed file first and the vcf second.


chmod +x snp-density-3.sh
# Make the script executable

./snp-density-3.sh poeFor-windows-100k.bed data3/poeFor_NW_006799939.vcf 
# ./ : Execute the following file as a shell script

## 0.0143416

If everything is working, you should get the same answer as before. But now we can give our script any bed file and any VCF file, right from the command line. Note that we don’t do any error checking - if you accidentally specify a VCF file that uses a different reference genome than your bed file, nothing in the script will prevent this.

This is better, but it still requires a lot of typing if we want to run this on 10 files. If, for example, we wanted to compute SNP density separately for each chromosome, or for different interval types (e.g., genes, introns, exons), we’d have to type out each bed file separately.

We can get around this by using loops.

The most common kind of loop we’ll use is a for loop, which iterates over elements.

Run the following code block to see a for loop in action


# a simple loop
for i in 1 2 3;
do
  echo $i
done

## 1
## 2
## 3

Importantly, a loop can iterate over the output of another command.

Run the following code block to iterate over the results of the ls command in a for loop


# use the $() syntax to capture the output of a command into a loop; by default each newline will be a separate iteration

for FILE in $(ls)
do
  echo $FILE
done

## bed_files
## Biotips-workshop-2023-Day1-instructor.html
## Biotips-workshop-2023-Day1-instructor.Rmd
## Biotips-workshop-2023-Day1-student.Rmd
## Biotips-workshop-2023-Day2-instructor.html
## Biotips-workshop-2023-Day2-instructor.Rmd
## Biotips-workshop-2023-Day2-student.Rmd
## Biotips-workshop-2023-Day3-instructor_files
## Biotips-workshop-2023-Day3-instructor.html
## Biotips-workshop-2023-Day3-instructor.Rmd
## Biotips-workshop-2023-Day3-student.Rmd
## Biotips-workshop-2023-Day4-instructor.Rmd
## Biotips-workshop-2023-Day4-slurm-instructor.Rmd
## coverage-submit.sh
## css
## data3
## end.html
## exon.bed
## img
## index.html
## links.html
## poeFor_beds
## poeFor-windows-100k.bed
## poeFor-windows-100k-snps.bed
## programs.html
## scripts
## snp-density-0.sh
## snp-density-1.sh
## snp-density-2.sh
## snp-density-3.sh
## snp-density-4.sh
## snp-density-final.sh
## snp-density.sh
## snp-density-simple.sh
## start.html
## tables
## terms.html

All this does right now is print out the files in our current working directory, just like running the ls command would.

Loops are very powerful. For example, let’s say we want to make a few different subsets of the annotation we have for the Amazon molly: coding exons, introns, intergenic regions, UTRs.

Let’s start by writing an awk command that will read through the GFF file associated with the Amazon molly reference genome, find all the exons in it, and make a bed file of just exons intervals. Since all the information for a bed file is also contained in a GFF file (scaffold, start, and end positions) we can convert a GFF to a bed with awk:

BEGIN{OFS="\t"}; {print $1, $4-1, $5}

The BEGIN{OFS="\t"} tells awk to use a tab separator in the output (required for bedtools), and then we print the first, fourth, and fifth columns, remembering that bed and GFF files have different coordinate systems.

Run the code block below to convert the exon coordinates in the Amazon molly GFF file to bed format:


awk 'BEGIN{OFS="\t"}; $3=="exon"{print $1, $4-1, $5}' data3/poeFor_NW_006799939.gff > exon.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
# > : The Unix redirect operator to write the output of the command to the following file

head exon.bed

## NW_006799939.1   844 1347
## NW_006799939.1   4648    4710
## NW_006799939.1   11357   11431
## NW_006799939.1   28599   28684
## NW_006799939.1   29916   30008
## NW_006799939.1   30191   30266
## NW_006799939.1   31518   31530
## NW_006799939.1   36141   36299
## NW_006799939.1   36412   36520
## NW_006799939.1   36633   36731

We could just copy and paste this command line to run this again to get bed files for other features like genes, mRNAs, CDS, etc. But we can also use a loop! Note that awk, by default doesn’t have access to shell variables, but we can pass them to an awk script with the -v option.


mkdir -p poeFor_beds
# mkdir: The Unix command to make a new directory
# -p: This option tells mkdir to make any parent directories that do not exist

GFF="data3/poeFor_NW_006799939.gff"
# = : Assign the variable name on the left to the value on the right

for TYPE in exon CDS mRNA gene;
do
  awk -v type=$TYPE 'BEGIN{OFS="\t"}; $3==type {print $1, $4-1, $5}' $GFF | bedtools sort -i - | bedtools merge > poeFor_beds/$TYPE.bed
done
# This loop runs an awk script on several different inputs and pipes it to bedtools

ls poeFor_beds
# ls: The Unix list directory contents command

## CDS.bed
## exon.bed
## gene.bed
## intergenic.bed
## intron.bed
## mRNA.bed
## UTR.bed

We are almost done; let’s make a few more using bedtools complement, which reports all regions of the genome NOT in an interval, and bedtools subtract, which removes overlaps. For example, we don’t have an intergenic feature in our GFF, but we can create one by getting everything that is not a gene. And we don’t have a UTR feature or an intron feature, but we can get them by taking our exons and removing coding sequences (CDS) or by taking genes and removing exons.


bedtools complement -i poeFor_beds/gene.bed -g data3/poeFor_NW_006799939.fasta.fai > poeFor_beds/intergenic.bed
# bedtools: A suite of programs to process bed files
# complement: The sub-program of bedtools to execute
# -i: The input bed file
# -g:  A genome size file, which needs two columns, the scaffold/chromosome name and the size of that feature
# > : The Unix redirect operator to write the output of the command to the following file

bedtools subtract -a poeFor_beds/gene.bed -b poeFor_beds/exon.bed | bedtools sort -i - | bedtools merge > poeFor_beds/intron.bed
bedtools subtract -a poeFor_beds/exon.bed -b poeFor_beds/CDS.bed | bedtools sort -i - | bedtools merge > poeFor_beds/UTR.bed
# Two commands that subtract bed intervals in one file (-b) from another (-a) and pipe both with | and - into other bedtools commands and
# finally redirect to an output bed file.

wc -l poeFor_beds/*.bed
# Display the number of lines in each bed file in the poeFor_beds directory using wc -l and a wildcard (*)
# Note that here we have run a command on multiple files without an explicit loop, thanks to regex!

##   1800 poeFor_beds/CDS.bed
##   2053 poeFor_beds/exon.bed
##    174 poeFor_beds/gene.bed
##    175 poeFor_beds/intergenic.bed
##   1879 poeFor_beds/intron.bed
##    144 poeFor_beds/mRNA.bed
##    621 poeFor_beds/UTR.bed
##   6846 total

Now we have a directory of bed files, and we might want to run our script over each one, getting the average SNP density by feature type. We can do this with a loop.

Exercise: 1. Modify the code block below to add a for loop that iterates over the bed files in the poeFor_beds directory, and for each bed file calls the snp density script.


VCF="data3/poeFor_NW_006799939.vcf"
# = : Assign the variable name on the left to the value on the right

# add a for loop that iterates over each bed file in the poeFor_beds directory
for BED in $(ls poeFor_beds/*.bed)
do
  echo "SNP density for $BED:"
  ./snp-density-3.sh $BED $VCF
  echo "---"
done
# add a for loop that iterates over each bed file in the poeFor_beds directory

# Loop over each bed file in our directory and run our snp-density script on it and the VCF file

## SNP density for poeFor_beds/CDS.bed:
## 0.00824013
## ---
## SNP density for poeFor_beds/exon.bed:
## 0.0105178
## ---
## SNP density for poeFor_beds/gene.bed:
## 0.0143007
## ---
## SNP density for poeFor_beds/intergenic.bed:
## 0.0144225
## ---
## SNP density for poeFor_beds/intron.bed:
## 0.0148971
## ---
## SNP density for poeFor_beds/mRNA.bed:
## 0.0142947
## ---
## SNP density for poeFor_beds/UTR.bed:
## 0.0127176
## ---

We can actually make our script a bit better by putting the loop in the script, and passing a file of files, that is a single file that has all the bed files we want to operate on, one per line, as an option.

Exercise: 1. Re-write the snp-density.sh script to take as input a single file that contains a list of bed files (one per line) in addition to the VCF file and write a loop to run the SNP density awk command on each file in that file. Hint: the cat command will return the lines in a file, so $(cat ) will let you iterate over the lines of a file in a for loop 2. Run the code block below to generate the file containing all of our bed files and run the snp-density.sh script


VCF="data3/poeFor_NW_006799939.vcf"
# = : Assign the variable name on the left to the value on the right

ls poeFor_beds/*.bed > bed_files
# Redirect the output of ls to a file called bed_files

chmod +x snp-density-4.sh
# Make the script executable

./snp-density-4.sh bed_files $VCF
# ./ : Execute the following file as a shell script

## SNP density for poeFor_beds/CDS.bed:
## 0.00824013
## ---
## SNP density for poeFor_beds/exon.bed:
## 0.0105178
## ---
## SNP density for poeFor_beds/gene.bed:
## 0.0143007
## ---
## SNP density for poeFor_beds/intergenic.bed:
## 0.0144225
## ---
## SNP density for poeFor_beds/intron.bed:
## 0.0148971
## ---
## SNP density for poeFor_beds/mRNA.bed:
## 0.0142947
## ---
## SNP density for poeFor_beds/UTR.bed:
## 0.0127176
## ---

Progress! We can now calculate SNP density for any arbitrary set of bed elements, given a VCF file (or, we could even do this over multiple VCFs by using another loop).

However, this script is kind of brittle. What if we included a non-bed file in our list of files? Our script will fail cryptically. We might want to add some basic error checking. In particular, we are going to check that the files in the file of bed files actually exist. Note that it is possible, but complicated, to do more sophisticated error handling. In this case, we just want to make sure we don’t have typos or path errors in our file list.

To do this, we need to learn about one more bit of control flow code: conditional statements, or if statements. If statements are pretty intuitive - they just test if something is True (in the logical sense), and only run the following code if it is. We’ve seen a lot of things like this in awk, when only doing actions on line that match a certain pattern. Those are implicit if statements. Here is an example of an explicit if statement in bash syntax.

Run the code block below to see how an if statement works in bash. Change the value of SPEAK to see how it affects the behavior o the statement.


SPEAK="no"
# = : Assign the variable name on the left to the value on the right

if [[ $SPEAK == "yes" ]];
then
  echo "Hello, World!"
else
  echo "sshhhh"
fi

## sshhhh

The [[ ]] syntax here is a bash test operator, that returns True or False. The commands after then are run if the condition(s) inside the [[]] are True, and the commands after else are run if the condition(s) are False.

Importantly, else statements are not always necessary if you don’t want to perform any actions if the conditional (if) statement is False.

Run the code block below to see an if statement without an else statement:


SPEAK="no"
# = : Assign the variable name on the left to the value on the right

if [[ $SPEAK == "yes" ]];
then
  echo "Hello, World!"
fi

There are some particularly useful conditional tests built in to bash that let you check file properties:

-e filename returns True if filename exists -s filename returns True if filename exists and has a size > 0 -d filename returns True if filename is a directory

You can also check string properties:

-z string returns True if string is empty or null (0-length) -n string returns True if string is not empty/null

Exercise: Write an if statement that will check if a file exists and has data, printing (with echo) “file exists” if it exists and “file is missing” if it doesn’t exist. Try changing the FILE variable to point to something else.


FILE="snp-densitsy.sh"
# = : Assign the variable name on the left to the value on the right

## Write an if statement to check if FILE exists and has data
if [[ -s $FILE ]]
then
  echo "file exists"
else
  echo "file doesn't exist"
fi
## Write an if statement to check if FILE exists and has data

## file doesn't exist

Exercise: 1. In our snp-density.sh script, use an if statement to check if each file in the input file of bed files exists and only run our bedtools command on the file if it does. Note that you don’t need to include an else statement if you don’t want. 2. BONUS: Another user-friendly behavior for scripts is to tell the user how to run the program. A full list of command line options is warranted for some programs, but for a simply one like this only a simple message is required. To do this, write another if statement that checks if both of the command line options ($1 and $2) have been provided and if not prints a user-friendly message about how to run the program, e.g:

echo "Usage: ./snp-density.sh <file of bed files to process> <vcf file>"

For this you will have to check two conditions with an AND operator using the && syntax, e.g. if [[ test1 && test2 ]].

Run the code block below to test your script. If you did the bonus, remove either bed_files or $VCF from the command that calls your script to test the usage message:


VCF="data3/poeFor_NW_006799939.vcf"
# = : Assign the variable name on the left to the value on the right

echo "bed_doesnt_exist.bed" >> bed_files
# Add a non-existant file name to our list of bed files to check our error-handling if statement

chmod +x snp-density-final.sh
# Make the script executable

./snp-density-final.sh bed_files $VCF
# ./ : Execute the following file as a shell script

# Checking the Usage message if the vcf file is missing:
./snp-density-final.sh bed_files 
# ./ : Execute the following file as a shell script

## SNP density for poeFor_beds/CDS.bed:
## 0.00824013
## ---
## SNP density for poeFor_beds/exon.bed:
## 0.0105178
## ---
## SNP density for poeFor_beds/gene.bed:
## 0.0143007
## ---
## SNP density for poeFor_beds/intergenic.bed:
## 0.0144225
## ---
## SNP density for poeFor_beds/intron.bed:
## 0.0148971
## ---
## SNP density for poeFor_beds/mRNA.bed:
## 0.0142947
## ---
## SNP density for poeFor_beds/UTR.bed:
## 0.0127176
## ---
## File bed_doesnt_exist.bed does not exist. Skipping!
## Usage: ./snp-density.sh <file of bed files to process> <vcf file>

End of Day 3

Tomorrow, we’ll learn how to run scripts like this using the job scheduling software SLURM, so we can submit them to run non-interactively on the cluster.

Harvard Informatics Bioinformatics Tips Workshop

Day 3: VCF files and shell scripts

Tim Sackton, Gregg Thomas

March 29, 2023

Setting up our link to the data files

VCF files

Note that the binary version of VCF files are BCF files, usually with the `.bcf` file extension. BCF files are more efficient to read than VCF files, and are typically compressed.

bcftools

A note about subsetting

SNP density with `bcftools` and `bedtools`

Scripts

End of Day 3

Harvard Informatics Bioinformatics Tips Workshop

Day 3: VCF files and shell scripts

Tim Sackton, Gregg Thomas

March 29, 2023

Setting up our link to the data files

VCF files

Note that the binary version of VCF files are BCF files, usually with the .bcf file extension. BCF files are more efficient to read than VCF files, and are typically compressed.

bcftools

A note about subsetting

SNP density with bcftools and bedtools

Scripts

End of Day 3

Note that the binary version of VCF files are BCF files, usually with the `.bcf` file extension. BCF files are more efficient to read than VCF files, and are typically compressed.

SNP density with `bcftools` and `bedtools`