Welcome to the second day of the FAS Informatics Bioinformatics Tips & Tricks workshop!
If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit and execute the code in this file. Please download the raw file here
Today we’re going to continue our tour and explanation of common genomics file formats and their associated tools by talking about interval files, that is files which indicate regions of a genome (.bed files, .gff files).
We’ll be learning about how to view and manipulate these files using both the native commands present in the Linux command line as well as tools developed specifically for these file formats.
mkdir -p data2
ln -s -f /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/* data2
# ln: The Unix link command, which can create shortcuts to folders and files at the provided path to the second provided path
# -s: This option tells ln to create a symbolic link rather than a hard link (original files are not changed)
# -f: This option forces ln to create the link
ls -l data2
# Show the details of the files in the new linked directory
## total 88
## lrwxrwxrwx 1 gthomas informatics 109 Mar 23 00:41 Biotips-workshop-2023-Day2-student.Rmd -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/Biotips-workshop-2023-Day2-student.Rmd
## lrwxrwxrwx 1 gthomas informatics 116 Mar 23 00:41 GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fna -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fna
## lrwxrwxrwx 1 gthomas informatics 120 Mar 23 00:41 GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fna.fai -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fna.fai
## lrwxrwxrwx 1 gthomas informatics 130 Mar 23 00:41 GCF_008822105.2_bTaeGut2.pat.W.v2_genomic-subsample.fna.fai -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic-subsample.fna.fai
## lrwxrwxrwx 1 gthomas informatics 108 Mar 23 00:41 Macaca_mulatta.Mmul_8.0.1.86.chr.gff3 -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
## lrwxrwxrwx 1 gthomas informatics 95 Mar 23 00:41 macaque-svs-filtered.bed -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/macaque-svs-filtered.bed
## lrwxrwxrwx 1 gthomas informatics 99 Mar 23 00:41 macaque-svs-filtered.n20.bed -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/macaque-svs-filtered.n20.bed
## lrwxrwxrwx 1 gthomas informatics 79 Mar 23 00:41 pop1.txt -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/pop1.txt
## lrwxrwxrwx 1 gthomas informatics 81 Mar 23 00:41 rheMac8.fa -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/rheMac8.fa
## lrwxrwxrwx 1 gthomas informatics 85 Mar 23 00:41 rheMac8.fa.fai -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/rheMac8.fa.fai
## lrwxrwxrwx 1 gthomas informatics 82 Mar 23 00:41 samples.txt -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/samples.txt
## lrwxrwxrwx 1 gthomas informatics 93 Mar 23 00:41 taeGut-windows-1mb.bed -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/taeGut-windows-1mb.bed
## lrwxrwxrwx 1 gthomas informatics 98 Mar 23 00:41 taeGut-windows-1mb-snps.bed -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/taeGut-windows-1mb-snps.bed
## lrwxrwxrwx 1 gthomas informatics 127 Mar 23 00:41 Taeniopygia_guttata_GCF_008822105.2.filtered.n100000.vcf -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/Taeniopygia_guttata_GCF_008822105.2.filtered.n100000.vcf
## lrwxrwxrwx 1 gthomas informatics 122 Mar 23 00:41 Taeniopygia_guttata_GCF_008822105.2.filtered.n1.vcf -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/Taeniopygia_guttata_GCF_008822105.2.filtered.n1.vcf
## lrwxrwxrwx 1 gthomas informatics 123 Mar 23 00:41 Taeniopygia_guttata_GCF_008822105.2.filtered.n20.vcf -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/Taeniopygia_guttata_GCF_008822105.2.filtered.n20.vcf
## lrwxrwxrwx 1 gthomas informatics 94 Mar 23 00:41 test-header-samples.vcf -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/test-header-samples.vcf
## lrwxrwxrwx 1 gthomas informatics 86 Mar 23 00:41 test-header.vcf -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/test-header.vcf
## lrwxrwxrwx 1 gthomas informatics 82 Mar 23 00:41 test-n1.vcf -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/test-n1.vcf
## lrwxrwxrwx 1 gthomas informatics 83 Mar 23 00:41 test-n20.vcf -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/test-n20.vcf
## lrwxrwxrwx 1 gthomas informatics 79 Mar 23 00:41 test.vcf -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/test.vcf
## lrwxrwxrwx 1 gthomas informatics 82 Mar 23 00:41 windows.bed -> /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2023/day2/windows.bed
Just to begin, I wanted to take a second to re-iterate a few concepts we learned yesterday. In general, the aim of a lot of the commands we run is to take text in a file that is formatted in a specific way and manipulate or process that text. This is central to the Unix philosophy:
formatted text -> command -> processed text
By default, most commands simply print their output to the screen. While this doesn’t immediately make sense when processing such large files, it is integral to be able to perform some other operations namely, piping and redirecting.
Many times, instead of displaying output to the screen we will
want to save the output to a file. Natively, Unix has the
redirect operator, which is the >
character. Note that this is distinct from the literal string
">"
, which we see as the header character in
FASTA files. Rather, this is part of the command being
run:
command > output_file.txt
If we are using grep
to search for header lines in a
FASTA file like we did yesterday, we may see a command
like this:
grep '>' file.fa > headers.txt
In this example, '>'
and >
are doing
2 different things. The string literal '>'
, being
quoted, is the string we are searching for in our file with
grep
– it is an input argument to our grep
command. The second, unquoted >
is the Unix redirect
operator, which is placed at the end of the command and tells the shell
to redirect the output into the provided file.
Many programs will also have built-in options to redirect output to a
file. A common option is -o filename.txt
, which would tell
a command to write output to that file rather than display it on the
screen. We saw this yesterday with samtools
, e.g.:
samtools view -b -o output.bam input.sam
which would convert input.sam
to BAM
format and save it to output.bam
. While -o
is
a common output option, it is not universal and its important to read
the documentation for each tool you use to see the output options.
|
operator. Remember that commands simply take text as input and process
it in someway that is output to the screen. If the output of one command
is compatible with another, then they can be strung together:command1 input_file.txt | command2
Here, we’ve specified the input file for command1
, but
not for command2
. Instead, the |
operator says
**take the output of command1
and use it as the input of
command2
. This is an extremely powerful way to construct
basic pipelines and we did this a bit yesterday.
Pipes and redirects can be combined:
command1 input_file.txt | command2 > output_file.txt
Here, the text in input_file.txt
is first processed by
command1
and that processed text is piped
to command2
as input. command2
does its
processing of the text and then this is redirected to
output_file.txt
, which should now have the text processed
by both commands.
Note that if the program you run has a -o
option to save
output that you use, you can no longer pipe that output
to another command:
command1 -o output_file.txt input_file.txt | command2
This will result in output_file.txt
containing only the
text processed by command1
. Since the text from
command1
was written to the file, there is nothing to
pipe to command2
, which may or may not
display an error.
Today we’ll talk about bed files. Bed files are used to indicate regions of a genome with each line in the file representing one region. The bed format is an extremely flexible format – the regions contained within it can represent anything. In it’s most basic and common form it is also an extremely simple format, consisting of three columns of text separated by a tab character. The first column represents the chromosome or assembly scaffold of the region, while the second indicates the starting coordinate and the third indicates the ending coordinate.
Bed files might have the .bed
extension, and while it is
best practice to use a file extension that properly describes the format
of a file it is not required. Any 3 column tab delimited file that has
the columns we described is a bed file.
We will talk about several different file types today that are used to reference locations in the genome. Unfortunately for all of us, for various reasons different file types use different coordinate styles. Bed files, which we will talk about first, use 0-based coordinates and do not include end base in the interval (technically, this is called a right-open interval). So in a bed file, an interval that includes the first 100 bases of a chromosome would have start=0, end=100.
Gff files in contrast use 1-based coordinates and do include both the start and the end base in the interval (technically, this is called a closed interval). So in a gff file, an interval that includes the first 100 bases of a chromosome would have start=1, end=100.
It is worth noting that while the 1-based closed format of GFF files is more intuitive to read, it does suffer some issues. In particular, it is impossible unambiguously encode a 0-length feature in a GFF file.
Today we’ll be working with a bed file that contains calls of structural variants (e.g. large deletions and duplications of segments of the genome; abbreviated SVs) from a small population of rhesus macaques (if you attended the R workshop earlier this month you might already be familiar with this dataset). Rheseus macaques are small, Old-World monkeys that are widespread across southern and eastern Asia and are a common model organism for the study of human disease and primate evolution. We sequenced these genomes to study the evolution structural variation over different timescales.
First thing we should do is look at our data. We can do this a couple of ways here. With the RStudio setup with the VDI, we can just use our file browser on the right to navigate to the path of the file and open it in the text editor (this panel).
However, if we want to see things in a more Unix way, we can use a command to directly display the contents of the file in our Terminal.
Run the following command in the Terminal below to view the bed file containing macaque SVs.
Note that whenever you see the > character followed by green text, this is an exercise or action to be done by you!
less -S data2/macaque-svs-filtered.bed
less
is a file viewing program that lets us look at
parts of a file without loading the whole thing into memory. You can
scroll through the file with <up arrow>
and
<down arrow>
to move line-by-line, or with
<spacebar>
and b
to move by page (one
screen of text). The -S
flag simply means do not wrap the
lines to fit on the screen, so we can also scroll left and right with
<left arrow>
and <right arrow>
.
Press q
to quit and return to the Terminal interface.
So what do we see? We see, as described, three columns of text
indicating the chromosome, start coordinate, and end coordinate of each
SV (row). We also see a fourth column with a bunch of
extra information. The fourth column in a bed file is an optional column
meant to provide each region with a unique ID. In this case, the unique
ID is just a long string of separate pieces of information delimited by
a colon (:
) character. In a way, I’ve made this column a
sort of catch-all for other information not included in the base
bed format (e.g. SV length, SV type), which is a common
strategy in genomic file formats. Most of this information we can
ignore, but I will point out that TYPE of each SV is encoded as a
string, with deletions being <DEL>
and duplications
being <DUP>
. We may use this information later.
In addition to this optional fourth column for an ID, bed files have several other common pieces of information that could be encoded in extra columns. Most of the time these extra columns are ignored by the tools that process bed files, but sometimes specific columns are used.
For more information on bed files and these extra columns, visit the following links:
So imagine we get this bed file from our collaborator who has called these SVs, and the first thing we should do is get a general idea about the variants called. What can we do from the command line?
The most basic thing we’ll want to know is how many structural variants have been called. Recalling that each line in a bed file represents one region, which in this case means one structural variant, we can simply count the number of lines in the file with the wc command.
Run the command below to count the number of SVs in the bed file. How many SVs are there?
wc -l data2/macaque-svs-filtered.bed
# wc: the Unix word count command
# -l: tells wc to only return the line count
## 3646 data2/macaque-svs-filtered.bed
Cool! We also may want to known how many of these SVs are deletions
and how many are duplications. We can figure that out with
grep
.
Exercise: In the code block below, use
grep
to count the number of deletions and duplications separately. Remember that SV type is encoded in the fourth column of our bed file.
## Count the number of deletions
# data2/macaque-svs-filtered.bed
grep -c "<DEL>" data2/macaque-svs-filtered.bed
## Count the number of deletions
## Count the number of duplications
# data2/macaque-svs-filtered.bed
grep -c "<DUP>" data2/macaque-svs-filtered.bed
## Count the number of duplications
## 3214
## 432
awk
basicsSo we have a lot more deletions than duplications. If we didn’t have reason to believe that deletions are more common than duplications (which we think they are) we may want to ask our collaborator to re-check their calls. But we can do some more checking ourselves too. Maybe, on average, the deletions being called are smaller events than the duplications so it would be expected that there are more of them. To check whether that is the case, we could get the average length of deletions and duplications in our bed file. The first step of that is to get the length of each SV.
Yesterday, we started to learn about awk
, which is a
scripting language that is interpreted in the Unix shell. Basically what
this means is that we can use awk
much the same way as if
we were programming in a text editor. awk
’s appeal for us
is that it is set up to automatically read through and process text
files, line by line, which is a common task in bioinformatics. We could
achieve the same functionality by writing a Python or R script, but
because those are not integrated into the shell we would waste time
writing code to read and write files. awk
does that
automatically, so for simple file operations it is an extremely useful
to for bioinformaticians to have.
Yesterday you learned the basic syntax of an awk
command:
awk '{ action; other action }' input_file.txt
This means that awk
reads input_file.txt
line by line, and for each line performs both action
and
other action
. A semi-colon (;
) is used to
de-limit separate actions.
The simplest awk
program we could right then, would be
something like this.
Run the code block below. What happens?
awk '{}' data2/macaque-svs-filtered.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
echo "done"
# echo: A Unix command that simply prints the provided input to the screen
## done
Here, awk
has read through our bed file, but nothing is
displayed to the screen because we didn’t code any actions for it to
perform.
The most basic action we can code for an awk
program is
the print
command.
Run the code block below:
awk '{print}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## chr1 89943 90471 chr1:89943:<DUP>:528:1907.19
## chr1 130740 131675 chr1:130740:<DEL>:935:285.63
## chr1 218574 219534 chr1:218574:<DUP>:960:5699.01
## chr1 219608 220078 chr1:219608:<DUP>:470:2074.69
## chr1 519434 541582 chr1:519434:<DUP>:22148:1673.64
## chr1 519473 542033 chr1:519473:<DUP>:22560:2560.16
## chr1 520173 541800 chr1:520173:<DEL>:21627:2955.11
## chr1 525401 525806 chr1:525401:<DEL>:405:2986.21
## chr1 541132 590572 chr1:541132:<DEL>:49440:316.41
## chr1 552968 582234 chr1:552968:<DUP>:29266:189.32
## chr1 766381 766933 chr1:766381:<DEL>:552:5099.0
## chr1 1117696 1122022 chr1:1117696:<DEL>:4326:201.55
## chr1 1151866 1154542 chr1:1151866:<DEL>:2676:11284.32
## chr1 1166390 1167586 chr1:1166390:<DEL>:1196:15253.03
## chr1 1408621 1409766 chr1:1408621:<DEL>:1145:1112.53
## chr1 1409564 1410074 chr1:1409564:<DEL>:510:13091.76
## chr1 1564979 1565374 chr1:1564979:<DEL>:395:9231.44
## chr1 1602888 1604046 chr1:1602888:<DEL>:1158:1586.52
## chr1 1774887 1775498 chr1:1774887:<DEL>:611:1933.48
## chr1 1831576 1831983 chr1:1831576:<DEL>:407:3537.19
This time, now that we’ve given the instruction for awk
to print
we see each line displayed on the screen. This is
a good demonstration of awk
, but doesn’t really do anything
we couldn’t do before. We can view the contents of files with
cat
, less
, head
,
tail
, etc. awk
, however, also splits each
record (line) into fields (columns)
based on some character delimiter (tab by default). This naturally turns
our text file into a data table to manipulate right in the shell.
In awk
, the fields or columns are
identified by number and a special character, the dollar sign
$
, to indicate we want to access that column. So, for
instance, if I wanted to access only the third column from a given
record, I could do so with $3
.
Run the code block below to use
awk
to print the only the third column from the bed file with macaque SVs. We callhead
first to not overflow the text editor with output:
awk '{print $3}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## 90471
## 131675
## 219534
## 220078
## 541582
## 542033
## 541800
## 525806
## 590572
## 582234
## 766933
## 1122022
## 1154542
## 1167586
## 1409766
## 1410074
## 1565374
## 1604046
## 1775498
## 1831983
Another functionality of awk
, since it is a scripting
language is that there are basic operations it can perform on the input
data. For instance, given two input columns that are numeric,
awk
can add, subtract, multiply, and divide them with the
+
, -
, *
, and /
operators.
Exercise: In the code block below, use
awk
to print the length of each SV:
## Use awk to print the length of each SV
# data2/macaque-svs-filtered.n20.bed
awk '{print $3 - $2}' data2/macaque-svs-filtered.n20.bed
## Use awk to print the length of each SV
## 528
## 935
## 960
## 470
## 22148
## 22560
## 21627
## 405
## 49440
## 29266
## 552
## 4326
## 2676
## 1196
## 1145
## 510
## 395
## 1158
## 611
## 407
As a programmer (we are coding now!), one of the most important things I can tell you about programming is to always remember what data types you are operating on!
We won’t get into it too much here, but briefly, you should know
about data types. Data types are the
way different pieces of information are encoded. 3
is an
integer. "hello world"
is a
string of characters. "3"
is a
character. This is important to remember because
different functions and operators may perform different actions
depending on the data type input to them, or they might
not work at all with the wrong data type. For example, with algebraic
operators like addition (+
), 3 + 3
is a
perfectly valid instruction to write. But what does
3 + "hello world"
mean? Different programming languages may
perform differntly in this situation some by erroring out and some by
doing something you may not expect and not leaving any trace that
something is wrong. And different programming languages generally have
different data types.
The command above worked because both column 3 and column 2 contain
only integers, so awk
correctly subtracts
their values when the -
operator is provided between them.
The other columns in our bed file, however, contain character
strings.
Run the code block below to try and perform an algebraic operation (
-
) on a column made up of integers and a column made of strings. What happens? What did you expect to happen?
awk '{print $3 - $1}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## 90471
## 131675
## 219534
## 220078
## 541582
## 542033
## 541800
## 525806
## 590572
## 582234
## 766933
## 1122022
## 1154542
## 1167586
## 1409766
## 1410074
## 1565374
## 1604046
## 1775498
## 1831983
This is only printing out the third column unchanged.
awk
is pretty good about not throwing errors, so if you
didn’t catch this, either because of a typo or because you thought
column 1 also contained integers, you may move forward in your analysis
and get some strange results you’d struggle to explain later.
All of which is to say (and to re-iterate) that you should always remember what data types you are operating on!
awk
In programming, variables are names given to pieces
of information, allowing the information to be used later on in the
program. The column numbers used by awk
with the
$
notation are variables that are updated as every record
is read.
awk
has several default variables that
are initialized when the command is run:
FS
: field separator (default: white space)OFS
: output field separator, i.e. what character
separates fields when printingRS
: record separator, i.e. what character records are
split on (default: new line)ORS
: output record separatorNR
: a running count of the number of records that is
updated after each record. At the end of the program NR
will equal the total number of records (lines in the file by
default)Most of these pertain to how awk
separates
records and fields. Like any other
variable in a program, its value can be accessed and
overwritten. For instance, we can change the field
separator (FS
) to be something other than white
space (e.g. a tab character).
Run the code block below to change the FS variable to colon (
:
) and print out the first 3 fields. How is this different from the default behavior?
awk 'BEGIN{FS=":"}{print $1,$2,$3}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## chr1 89943 90471 chr1 89943 <DUP>
## chr1 130740 131675 chr1 130740 <DEL>
## chr1 218574 219534 chr1 218574 <DUP>
## chr1 219608 220078 chr1 219608 <DUP>
## chr1 519434 541582 chr1 519434 <DUP>
## chr1 519473 542033 chr1 519473 <DUP>
## chr1 520173 541800 chr1 520173 <DEL>
## chr1 525401 525806 chr1 525401 <DEL>
## chr1 541132 590572 chr1 541132 <DEL>
## chr1 552968 582234 chr1 552968 <DUP>
## chr1 766381 766933 chr1 766381 <DEL>
## chr1 1117696 1122022 chr1 1117696 <DEL>
## chr1 1151866 1154542 chr1 1151866 <DEL>
## chr1 1166390 1167586 chr1 1166390 <DEL>
## chr1 1408621 1409766 chr1 1408621 <DEL>
## chr1 1409564 1410074 chr1 1409564 <DEL>
## chr1 1564979 1565374 chr1 1564979 <DEL>
## chr1 1602888 1604046 chr1 1602888 <DEL>
## chr1 1774887 1775498 chr1 1774887 <DEL>
## chr1 1831576 1831983 chr1 1831576 <DEL>
Now, the first field includes everything in the line up to the first colon in the last tab separated column. This is most of the line.
NR
is also important. Rather than dealing with how
fields and records are read, it simply counts the number of records as
they are read.
Run the code block below to see how the value of
NR
changes for each record read:
awk '{print NR}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
## 20
awk
patterns and custom variablesYesterday you learned a bit about regular
expressions and how to use them with grep
. Well,
in actuality, awk
is also using regular
expressions to decide which records to
display. By default, the blank regular expression (because none is
provided) matches every line in the file, so every line is displayed.
However, you can use awk
similarly to grep
to
display and process lines that only match some pattern.
Run the code block below to use awk to display only lines that represent duplications:
awk ' /<DUP>/ {print}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## chr1 89943 90471 chr1:89943:<DUP>:528:1907.19
## chr1 218574 219534 chr1:218574:<DUP>:960:5699.01
## chr1 219608 220078 chr1:219608:<DUP>:470:2074.69
## chr1 519434 541582 chr1:519434:<DUP>:22148:1673.64
## chr1 519473 542033 chr1:519473:<DUP>:22560:2560.16
## chr1 552968 582234 chr1:552968:<DUP>:29266:189.32
This should be equivalent to the following:
grep "<DUP>" data2/macaque-svs-filtered.n20.bed
# grep: The Unix string search command
# "<DUP>": The string to search for in the provided file
## chr1 89943 90471 chr1:89943:<DUP>:528:1907.19
## chr1 218574 219534 chr1:218574:<DUP>:960:5699.01
## chr1 219608 220078 chr1:219608:<DUP>:470:2074.69
## chr1 519434 541582 chr1:519434:<DUP>:22148:1673.64
## chr1 519473 542033 chr1:519473:<DUP>:22560:2560.16
## chr1 552968 582234 chr1:552968:<DUP>:29266:189.32
However, with awk
, we can also process the output from
the same command.
Exercise: Use a single
awk
command to print the length of every duplication in the macaque SV bed file.
## Use awk to print the length of every duplication
# data2/macaque-svs-filtered.n20.bed
awk '/<DUP>/ {print $3 - $2}' data2/macaque-svs-filtered.n20.bed
## Use awk to print the length of every duplication
## 528
## 960
## 470
## 22148
## 22560
## 29266
We can also print lines that contain information in a certain column
using the same $
notation as before to refer to the column.
For instance, we can print only SVs on the X chromosome.
Run the following block to print only lines of the bed file where the first column is “chrX”:
awk ' $1=="chrX"{print}' data2/macaque-svs-filtered.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## chrX 1988 2464 chrX:1988:<DEL>:476:5630.03
## chrX 3478 4124 chrX:3478:<DEL>:646:2476.53
## chrX 7281 14220 chrX:7281:<DEL>:6939:307.26
## chrX 62980 63554 chrX:62980:<DEL>:574:1578.01
## chrX 64524 64940 chrX:64524:<DUP>:416:5057.12
## chrX 107557 108311 chrX:107557:<DEL>:754:2465.47
## chrX 165868 166394 chrX:165868:<DUP>:526:402.59
## chrX 207382 208215 chrX:207382:<DUP>:833:2285.44
## chrX 278868 279501 chrX:278868:<DEL>:633:7328.7
## chrX 302402 302996 chrX:302402:<DEL>:594:2752.0
## chrX 377764 378397 chrX:377764:<DEL>:633:317.17
## chrX 411443 411860 chrX:411443:<DEL>:417:2583.56
## chrX 420049 420650 chrX:420049:<DUP>:601:8107.15
## chrX 426741 427265 chrX:426741:<DEL>:524:4874.08
## chrX 427022 427595 chrX:427022:<DUP>:573:684.46
## chrX 489174 491968 chrX:489174:<DEL>:2794:2691.96
## chrX 583257 584278 chrX:583257:<DUP>:1021:3134.13
## chrX 586115 586579 chrX:586115:<DUP>:464:7513.0
## chrX 608503 611838 chrX:608503:<DUP>:3335:2762.6
## chrX 609545 612675 chrX:609545:<DEL>:3130:290.41
## chrX 610696 611235 chrX:610696:<DUP>:539:5966.77
## chrX 610406 610905 chrX:610406:<DEL>:499:12606.37
## chrX 611827 612606 chrX:611827:<DEL>:779:384.62
## chrX 631553 632399 chrX:631553:<DUP>:846:7035.83
## chrX 695765 696374 chrX:695765:<DEL>:609:3692.3
## chrX 711314 712014 chrX:711314:<DUP>:700:1999.33
## chrX 711189 712411 chrX:711189:<DEL>:1222:675.4
## chrX 739531 740094 chrX:739531:<DUP>:563:3943.29
## chrX 739910 740599 chrX:739910:<DEL>:689:289.41
## chrX 787138 788503 chrX:787138:<DUP>:1365:1836.45
## chrX 927279 927667 chrX:927279:<DUP>:388:10797.04
## chrX 1149286 1150206 chrX:1149286:<DUP>:920:348.36
## chrX 1149490 1150104 chrX:1149490:<DEL>:614:1282.46
## chrX 1177417 1178193 chrX:1177417:<DUP>:776:2832.53
## chrX 1280613 1280869 chrX:1280613:<DUP>:256:9117.2
## chrX 1300711 1301484 chrX:1300711:<DEL>:773:1067.94
## chrX 1427624 1428395 chrX:1427624:<DEL>:771:3446.11
## chrX 1700718 1701519 chrX:1700718:<DEL>:801:2508.96
## chrX 2670310 2670696 chrX:2670310:<DUP>:386:1656.48
## chrX 2714010 2716794 chrX:2714010:<DEL>:2784:28167.58
## chrX 2894333 2904887 chrX:2894333:<DEL>:10554:498.77
## chrX 3515446 3515791 chrX:3515446:<DEL>:345:5134.69
## chrX 4456589 4457158 chrX:4456589:<DEL>:569:1875.91
## chrX 6881120 6881959 chrX:6881120:<DEL>:839:4596.71
## chrX 8451528 8452433 chrX:8451528:<DEL>:905:550.93
## chrX 8454996 8475204 chrX:8454996:<DEL>:20208:487.62
## chrX 8472209 8473006 chrX:8472209:<DUP>:797:1077.54
## chrX 8475742 8479634 chrX:8475742:<DEL>:3892:4230.64
## chrX 8477637 8478720 chrX:8477637:<DEL>:1083:2629.3
## chrX 9886789 9887256 chrX:9886789:<DEL>:467:7498.94
## chrX 12506208 12514476 chrX:12506208:<DEL>:8268:1477.12
## chrX 12526487 12549496 chrX:12526487:<DEL>:23009:2340.64
## chrX 15866445 15867408 chrX:15866445:<DEL>:963:12513.77
## chrX 21462342 21464887 chrX:21462342:<DEL>:2545:773.76
## chrX 25341791 25343157 chrX:25341791:<DEL>:1366:23636.98
## chrX 28166208 28168644 chrX:28166208:<DEL>:2436:3149.73
## chrX 28814796 28817815 chrX:28814796:<DEL>:3019:17095.27
## chrX 30019364 30019954 chrX:30019364:<DEL>:590:360.75
## chrX 34457321 34457797 chrX:34457321:<DEL>:476:4343.45
## chrX 41105284 41105990 chrX:41105284:<DEL>:706:2020.33
## chrX 45125902 45126511 chrX:45125902:<DEL>:609:2411.45
## chrX 47425736 47428746 chrX:47425736:<DEL>:3010:3554.64
## chrX 47911677 47914869 chrX:47911677:<DEL>:3192:4711.68
## chrX 49566153 49574194 chrX:49566153:<DUP>:8041:4698.6
## chrX 49566444 49612048 chrX:49566444:<DEL>:45604:2093.93
## chrX 54566354 54588336 chrX:54566354:<DEL>:21982:2478.71
## chrX 55270771 55272939 chrX:55270771:<DUP>:2168:676.53
## chrX 61020469 61021120 chrX:61020469:<DEL>:651:1420.94
## chrX 73368438 73374565 chrX:73368438:<DEL>:6127:30670.27
## chrX 80022182 80022832 chrX:80022182:<DEL>:650:3318.89
## chrX 81382396 81382959 chrX:81382396:<DEL>:563:3110.35
## chrX 81774454 81775074 chrX:81774454:<DEL>:620:7519.69
## chrX 86223974 86225085 chrX:86223974:<DEL>:1111:1993.16
## chrX 86225397 86226678 chrX:86225397:<DEL>:1281:2184.25
## chrX 86600287 86600725 chrX:86600287:<DEL>:438:6027.45
## chrX 86613955 86615407 chrX:86613955:<DEL>:1452:1737.19
## chrX 88765180 88766033 chrX:88765180:<DEL>:853:465.27
## chrX 91174961 91176888 chrX:91174961:<DEL>:1927:4205.93
## chrX 92158227 92159027 chrX:92158227:<DEL>:800:15778.25
## chrX 92753089 92753972 chrX:92753089:<DEL>:883:3724.23
## chrX 97476307 97476914 chrX:97476307:<DEL>:607:745.23
## chrX 98845605 98847132 chrX:98845605:<DEL>:1527:1678.29
## chrX 103957969 103958451 chrX:103957969:<DEL>:482:11427.91
## chrX 106372466 106373502 chrX:106372466:<DEL>:1036:8307.51
## chrX 108518086 108520181 chrX:108518086:<DEL>:2095:3719.46
## chrX 111653104 111653670 chrX:111653104:<DEL>:566:158.36
## chrX 123944663 123946419 chrX:123944663:<DEL>:1756:2666.16
## chrX 124454196 124456326 chrX:124454196:<DEL>:2130:31891.72
## chrX 129169887 129170435 chrX:129169887:<DEL>:548:11203.65
## chrX 129328746 129330969 chrX:129328746:<DEL>:2223:5690.06
## chrX 130990616 130991273 chrX:130990616:<DEL>:657:22216.66
## chrX 135378002 135378668 chrX:135378002:<DEL>:666:3953.37
## chrX 135679612 135715923 chrX:135679612:<DEL>:36311:1521.72
## chrX 135682628 135718741 chrX:135682628:<DEL>:36113:4892.62
## chrX 137821409 137821980 chrX:137821409:<DEL>:571:17208.79
## chrX 145551156 145552312 chrX:145551156:<DUP>:1156:3422.51
## chrX 146387029 146422365 chrX:146387029:<DEL>:35336:4990.42
BEGIN
and END
awk
has two special patterns, BEGIN
and
END
. These patterns are followed by instructions that are
to be performed either before (BEGIN
) or after
(END
) awk
reads every record in the file.
Recall that, by default, awk
performs the specified actions
on every record (line) in the input file. These two
keywords allow us to perform summary tasks both before and after the
records are read and processed.
Run the code block below to use
awk
to only print the total number of records (without usingNR
):
awk ' BEGIN{sum=0} {sum++} END{print sum}' data2/macaque-svs-filtered.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## 3646
To break this down, we told awk
that we want it to read
every record in the bed file, but BEFORE doing that set the value of a
new variable called sum
to 0. Then, as
every record is read, increment sum
by 1 with the
++
operator. Finally, after all records have been read,
print out the value of sum
, which should now be the total
number of lines in the file. Remember that awk
already has
a variable that does this, NR
.
In addition to the ++
operator, which adds 1 to a
variable, it is useful to know about the +=
operator, which
adds whatever is on the right side of the equation to the variable on
the left side. So we could have written the code above as
{sum += 1}
. The ++
operate is a shortcut when
we just need to incremete a variable, but the +=
operator
allows us to increment a variable by more than 1, or even by another
variable (e.g., {sum += $1}
would keep a running total of
the first column of a file).
This command introduces another key concept in awk
programs: user-defined variables. Here,
sum
is not part of awk
’s default namespace –
we create and manipulate this variable on our own. We
could have easily called it something else
(e.g. random_data=0
), but sum
seems to be a
good descriptive name for its purpose. record_count
would
also be a good name for this.
awk
Great! Now we’ve got some new awk
knowledge. Let’s try
and put it all together to calculate the average length of all
SVs in our bed file.
Exercise: In the code block below, write a single
awk
command that calculates the average length of the SVs in the bed file. This command will need to: 1. Calculate the length of each SV 2. Add the length to a running total 3. After reading all records, divide the final total length of all SVs by the total number of SVs in the file (hint: rememberNR
!)
## Write awk command to calculate average length of SVs
# data2/macaque-svs-filtered.bed
awk '{sum += $3 - $2} END {if (NR > 0) print sum / NR }' data2/macaque-svs-filtered.bed
## Write awk command to calculate average length of SVs
## 3615.02
Ok, so we now have the average length of ALL SVs. What about deletions and duplications separately?
Exercise: In the code block below, calculate the average length of duplications and deletions separately (2 commands). This can be done in several ways using the tools we’ve taught (i.e.
grep
,awk
, pipes (|
)) or just with a singleawk
command per SV type. Are deletions or duplications longer on average?
## Calculate the average length of deletions
# data2/macaque-svs-filtered.bed
grep "<DEL>" data2/macaque-svs-filtered.bed | awk '{sum += $3 - $2} END {if (NR > 0) print sum / NR }'
## Calculate the average length of deletions
## Calculate the average length of duplications
# data2/macaque-svs-filtered.bed
grep "<DUP>" data2/macaque-svs-filtered.bed | awk '{sum += $3 - $2} END {if (NR > 0) print sum / NR }'
## Calculate the average length of duplications
## 3161.33
## 6990.42
We can do a lot of simple processing of bed files
(and genomic files in general) with native bash commands like
grep
, awk
, wc
, etc. However,
there are a lot of tasks that require software (commands) built
specifically for these types of files. For bed files (and other interval
files), bedtools is a great tool. It has a wide range
of functions for working with these files, and is particularly powerful
when you are interested in the overlap between regions in two files.
We’ll only have time to go over a small number of bedtools functions in this workshop, so be sure to check out the bedtools website for more in-depth documentation on all its functions:
Given a set of genomic regions in a bed file, one
common task you may want to accomplish is to get the sequences
contained within those intervals from the genome.
bedtools can do this with the
bedtools getfasta
command. You can type
bedtools getfasta -h
in the Terminal below to see some
documentation about this command. To do this, you will need:
.fai
file) of the input genome –
though bedtools will create this automatically if it isn’t found.We’ve provided the genome file for you. So let’s get the sequences of our macaque SVs in FASTA format.
Run the code block below to extract the sequences of the macaque SVs in the bed file in FASTA format:
bedtools getfasta -fi data2/rheMac8.fa -bed data2/macaque-svs-filtered.bed -fo macaque-svs-filtered.fa
# bedtools: A suite of programs to process bed files
# getfasta: The sub-program of bedtools to execute
# -fi: The genome fasta file as input
# -bed: The bed file as input
# -fo: The desired output fasta file
head macaque-svs-filtered.fa
# Display the first few lines of the new file with head
## >chr1:89943-90471
## TGGGTTGATGGTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCAGGGTTGATTGGTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTGGGGGTCGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTATTTCTGGAGTTCAGGGTTGATTGGTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCAGGGTTGATTGTTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTTGTGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATGGTTTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTGGGGGTCGATTGTTTCTGGAGT
## >chr1:130740-131675
## CAGGCAGGTGGGGGGCTATCAGTGTCTATGCAGGCAGGTGGGGGTTCATCAGTGTCTATACAGGCAGGTGGGGGGACATTAGTGTGTATGCAGGCAGGTGAGGGGACATCTAGTGTCTATGCAGGCAGGTGGGGGGACATCCAGTGTCTATGCAGGCAGGTGGGCGGTCATCAGTGTGTATGCAGGCAGGTGGGGGGACACCCAGTGTTTATACAGGCAGGTGGGGGGAGGTCATCAGTGTCTATGCAGGCAGGTGGGGGGACATCCAGTGTCTATGCAGGCAGGTGGGGGGATGCCCAGTGTCTATGCAGGCAGGTGGGGGGATGCCCAGTGTCTATGCAGGCAGGTGGGGGGACACCCAGTGTTTATGCAGGCAGGTGGGGGGAGGTCATCAGTGTCTATGCAGGCAGGTGGGGGGACATCCAGTGTCTATGCAGGCAGGTGAGGGGACATCTAGTGTCTATGCAGGCAGGTGGGGGGTCATCAGTGTGTATGCAGGCAGGTGGGGGGACGTCAGTGTCTATGCAGGCAGGTGGGGGGTCATCCAGTATCCAGTATCTATGCAGGCAGGTGGGGGGGTCATCAAGTGTCTATGCAGGCAGGTGGGGGGACGTCAGTGTCTATGCAGGCAGGTGGGGGATGTCAGTGTCTATGCAGGCAGGTGGGGGGGTCCCCAGTGTCTATGCAGGGGGGTCATCAGTGTCTATGCAGGCAGATGGGGGGACATCAGTGTCTATGCAGGCAGGTGGGGGGACATCCAGTATCTATGCAGGCAGATTGGGGGGATGCCCAGTGTTTATGCAGGCAGATTGGGGGGACACCCAGTGTCTATGCAGGCAGGTGGGGGGCTATCAGTGTCTATGCAGGCAGGTGGGGGGGTCATCAGTGTCTATGCAGGCAGGTGGGGGGACATTAGTGTCTATGCAGGCAGGTGA
## >chr1:218574-219534
## TCTGTCACGGAGGAGGCGGGTCTTTCTCTGTCATGGAGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCACGGGGGAGGAGGATCTTTCTCTGCCAATGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGAAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCATGGGGGAGGCGGGTCTTTCTCTGCCACGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGGAGGCAGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTGTCGTGGGGAAGGCGGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCCATCATGGGGGAGGCGGGTCTTTCTCTGCCTTCAGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCATGTGGGAGGCGGGTCTTTCTCTGCCTTCAGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCATGGGGGAGGCGGGTCTTTCTCCCTCATGGGGAGGCGGGTCTTTCTCCCTCATGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGAAGGCGGGTCTTTCTCTGTCATAGGGGAGGCGGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTGTCATGGGAGAGGCGGGTCTTCGTCTCTCATGGGGGAGGCGGGTCTTCCTCCCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCT
## >chr1:219608-220078
## CTCTGTCACGTGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCATGGGGGAGGCGGGTCTTTCTCTGTCACGAGGGAGGCGGGTCTTTCTCTGTCATGGAGGAGGTGGGTCTTTCTCTGTCACGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCTGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGTCATAGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGTGGGCCTTTCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGT
## >chr1:519434-541582
## TCCTGGGGTCAAAGGTGACAGAAGAGGCAGAGGCTGGAGCTTTCTGGAGAATTTACTGACCACAGCGTGGTGCACTTGACATCAGGCGCCCGCCATGGCCGGGCCTGGGTCTGAATGCTGCCCGGGACCAGCTGCCTGCGCTCCAGCAGCCCCTCCCTCCTGAAGGCCAGGTCCCCCGAGAAGAACGAGGCTGCAGAGTGATGTGGGGGCCAGCGGTGACTTCCTACCACACTGTTCTCAGGTGTAAGAGGGCTCGCTTCTGCCCAGGCATTGTCCGTGGAAGACACACAGCCGGCCACTGCAGCCTCAGTCCTGGGATGCCCTGGGGCTGGGTCACAGGGGGCCACGGGCCACGCTGGGAGGCCACAGTCCTGTCGTGCCATGCAGCTCCCTGTCCCCAGATGTCCGCTCAGGGATGCAGAGGGCAGAAACCACACTCGCTGCCTGAATTCTGGGAGCAGAGCCCGGTACCCACTGCCTGGCCGGGGCCTACCCTGGGACTCCAGCCCCTGTTCCCGCTGGCCCGGGCTTCCGGAGGCAACTGTGTCCCTATCCTGGCTCAAGGTCCAGGCTGCACCTGGAACCTGCACGGTCACTCCTCCAGGTCCTCAATGCTGGAGGACTCTCTCAGACAGGAAACCTTTGCGTTGGGCGCAGGGCGGGGTGCGGGGTGGTCACGGGGAATCGCAGGGCAAAACAGCACAGTGCAATCGCGCAGAGCCTGATATTGGCGGATGAAACATAAACTGCTTTCTGCACTTTGTGTCCTTAGGAAGGGTGTGGGGTGTTGGCGGAAGTAGGAAACAGAAGAGGAGCCTGGGCATGCAGCGGGTCTGTCAGAGAGCAGAGCCCTCGGAGCTGCAGTGCTTGGAGGGAGGCGGTTCACCTCTGCCCACTCTCTCCAtttctctctctctcattttccttttagagatggattcttgctctgagcctaggctggagtgcagtggtgtgattatagctcattgcagcctcgcccttccaggctcaagtgatcctcctgcctcagcctgtccagtagcCATACCCTACTAGGTCCTAGTTAGCCCCCAGAGGCGTGCACCACCACGCCCACTAATTGCAAAAATTTGTTggctgggcgcgatggctaacatctgtaatctttgggaggccaaggcgggcggatcacgaggtcaagagatggagaccatcctggctaacacggtgaaacccggtctctactaaaaatacaaaaaattagccgggtgtggtggcgggggcctgtagtcccagctactcaggaggctgaggcaggagaatggcgggaacccgggaagtggagcttgcagtgatctgagatcactccactgcactccagtctgggggacagagcgagactccgtctcaaaataaataaataaataaatatataataaataaataaaaataaaaataaaaCTAAGCCCTTCCTGATGGTCATTGGGGGGTTTGGGGGTTGGGGGGGGTGTCTGGCTATGGCTGGGGAACTCATTTGGTTTTCCTCCTCCTCCTCtttttattttttggtagagacggggtctcttgatttcccaggctgatctccaactcctgggctcaagcaatcctcctgcctcagcctcccaaagtgttgggattacaggcctgagacaccgtagctagccAGCtttctttttttttttgagacggagtcctgctgtcacccaggctggagtgcagtggcgagatctcagcggatcactgtgttatacgtaaattttcggtgtcgcaaaagaagtagcactcgaatgtacacttttctcagctaggaaatttacttctatagaaggggggtctcatagatggagcaatggtgagcatttggacaagggaggggaaggttcttattcctgacgcaggtagcgcctactgctgtgtggttcccttattggacagcgttagacctcacaatctaaatccgattggcCtttttttttttttgagatggagtcttgctgtgtcgcccagactggagtacagtggtgcgatcttggctcactgcaagctctgcctcctgggttcatgccattcttctgccttagcctcctgagtagttgagactacaagtgtatgccatcatgtgcggctaatttttgtgtttttggtagaaagagatttcaccacgttggccaggatggtctcgatctcctgacctcgagatccacctgcctcggcctcccacagtgctgggattataggcatgagccactgcacctggccttaagtggttctttaaagtctgattcgttgtttctactttccctgatgagggtgggtgtcaaggagtgtggtattcttacataatgtctgatgtttggaatagcAttttttttttttttgaggcagagtctcactctgtcgcccatgctggagtgtagtggcaccatcttgtctcactgtaacctttgcctcccgggttcaaacgatcctcctgcctcagccttccacgtagctaggattacaggcgtccaccaccacggccggctagcttttatatttttagtagagacggggtttcaccatgttggccaggctgtacttgaacttctgacctcaatgatctgcccccctcagcctcccgaagtgctgggatacaggtgtgagccaccactccTCGCTCAAGTAATATGTTAAACTTATGCTTTCTTCTTTTCTTCTTTCttttttttttttttttttttggatggagtcttgttctgtctgcccaggcttgagggcatggcataactcggctcactgccctccgccgttccagtcatgcatatctgctgccttcagcctcctttagtacgggacacgaggccacctgccacccgtgcctggctatttttttatttttttttttttttttttttttttttttttttATCAGgacagagtctggctctgccgccaggctggagcttgcagtggcgtcagctcaacctgcaagctccgctccgcgggttcaacgccattctcatgcctcctcagcctccccgagtaattgggactacagcgcgcccgccaccgccccgctcagtttttgtattttttagcagagaggggttaccgtgtagccaggatgggtctcgattcctgacgcctcgtgatccgcccgtctcggctcccaagctgggattacaggcttgagccacgcgccccggcccggcatttttttcatttttagtaagaaacagggtttcaccgtgtttagccaggattggtgtcgatttcctgacccgtgatccgcccccctcggcctcccaaagtgctggattccaggcctgagcctgcaagccgggccTACTCTTTGGCTTTTAAAAGAATGGGCAACATTGCTTTTCTTTACTAACTTCTAATCTTTCCCTCTCTGACTCATCTCTCCTCCCACTTCTCTTGTTCTCCCTGTCAGTGTTCCTTTCCTAAGAGTTTTTCCCTGTCTATGATCTTTTTTTATAGGCTTTTTTCTAGTTTCTCTTTCTTTGTAATTGTGCGTTAATACTGGCCAATTGTTAGTGACAAATTCCTTGCCAAGAGATCCCTGACCCTAAACCAGCATATTCTGTCCATTCGTTTTAATCTGTACtttatttttcttgagatggagttccgctctgtcgccaggtgtggatggtgtagtggcacgttctcgctcactgtcaactcgccctccagggtcaacccgcaccatcctcgctgccttagcctccgagtacggggattgtacaagcgtccaccacccggcctggcgaggcgcttgatttttttatttcagtagagatgggggttttcatcgtgttagccagatggtcccccccatctcctggactcatgctccgcgcaccgccccttggcctccgcaagtgcgcgattaTGATCTCTCTCAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtctcctgcctcagccctctgagtagctgggattacaggcattttttgtatttttagtacagatggggtttcaccattttggtcaggctggtggaggactcctgacctcaaatgacctccccgcctggcctcccgaaatgctgggattacatgcgtgatccaccacgcccagccATACAGttcttatgttaagacaggctctctgtcgcccaggctggagtgcagtggcgcgatcacagctcactgtttgcctcgacctttcaagctcaagctgtcctcctgcctgagccgcccgcgtagccaggactgcaggggcacagtgccatgcccggctaatttttttttttgtgatggcgttttgctcttgttgcccaggctggagtgcggtggcgcaatcttggctcactgcaacctcctccccctgggttcaagcaattctcctgcctcagcctcccaagtagctgggattacagtcatgtaccaccacgcccggttcattttgtatttttttttagtagacaagggatttctccatgtcagtcaggctagtcctaaactcctgacctcaggtgacccgcccacctcagcctcccaaactgctgggattacaggcgtgagccactgtgcctggtcCTGGCTAATAttttttttttttttgagacggagcctcgctctgtcacccagactaaagtacagcggcgcaatctcagctcactgcaagctccgcctcccgggttcatggcattctcctgcctcagcctcccaagtagctgggactacaggctcctgtcacctcgcccggctaattttttgtatttttgtagagacggggtttcacagtgttagccaggatggcctcaatctcctgacctcgtgatccgcccacctcggcctcccaaagtgctgggattataggcgtgagccaccgcgcccagctgtttttttgtaatgttagtagacatggactttccccttgttacccaggctgggctcaaacttctgaggtataagagatgctcccgccttgaccttgtgaagttctgggattacagacgtgagcccccatgcccagtcAGGGGTttgtttgttttggtttttgtttttgtttttgagacagagtctcactctgtcgcccatgctggagtgcagccgtgcaattttggctcgctgcaacctctgcctcccgggttgaagtgattctcctgcttcagcctcccacgtagctgagaccacaggtgtgccaccgcgcctggctgatttttgtatttttagtggagacggggtctcaccatattggccaggatggcctcaaactccctacctcaggtgatctgcccgcctcggcctcccaaaatactacgttacatgcatgagccaccgtccctggcTGTGGTCAGGCTTTTGAGTTTAGATCCATGAAAGTGTGGCCGCGTCCCTGCTCCCTGCAGGAGGGAGGCCTGTGGGACCTTCTGCTGTGGCTGTTTACAAGGCTTTGCTCCTGGTGCCTAAGGCTGGAACCTTCTCTCTGCAGGAGGAGATGAGCAATTACTACCTCAGAGTCACCCAGAACGCCTTCCTAAACCACACGAGGCAACGCAGCAACAAGTGAGGGAGCCCCTCGGGTCCTGGGCCCCCGGGTAGGGCTGTGCAGCCGTCGCCCTTGGTTCCCACAGAGGGACCTCAGAGGCCCTGGATCACAGTGCTGGGCAGCACCCGTGGCCTCAACGTGTCCACCTCGGATGTCCCCTAGGAATGTCCCAGCTCGGGACAGCATGGGGCGTCACTGAGGAACATGCGGGGGCCTCCTGGGCAGAGCCGGGGTCAGTCCCGTCCTCACGGCCCTGTGCGATGCCGCCCCAGCTTGCACGTCCCTCTGCCCCTGGGTTTCCGCGGTCCTGTGCCAGCAAGGGAGGCGGTCTGATTGTCTGAGGCTCTGCTGGGGCCTCCATTGCAGGCTGTGGGTGCCCTGGGGTGGGAGATGGAGACACTTTTGCTCCCACGGGAAGCTGGGCACGAGCAGGTCCTGTGTGTTTGGGCGGAGCCTGGGGCCTTGGCCCCCCCGCCCAGATGCTGGACAGGGTTGCTCCCTCCAGGCCTGGGGCCCTCCTCACATTGCGCGTCCTCCGTGAGCTGCTACCCAGAGGTCCCCAGTAGGTGGATAGCCCCATGGCCAGGCTCCCTAGCCCCTTTCAAATCCCCTTATTTTGAGTTTTCTTGGTCTCCTGGGCCCCTCCAGCCCCAGTCACGTGTCACACGGAGAATCAAGTCCTGCCGGTCGGCCGTGGCCGAGTCTTCAGGCGTGTTGGGCTCGCTGGCTCAGCTGCTGCCGGTAGACGCTCCCTGGAGCCCTGGCTCAGGTCCTTCCCAGAGAGGCAGGGCTGGGGCCCTGGTGAGCCTCCGCTGCACCCGGGCCCCCAAGGTCCTGCTCCTGGCTCGTGTGGCCACTCTTGGCATGGACTCTGGGTCCCGCATCCCTGCTCCCAGCACAGCAGGGCTCAGGCAGCAGGAGGAGTGGTGGTCCCGACGCTGCCTATCACGCTGGGTGAGGGTCAGCGGGGAAGCGCCACACGGGATGAGAACAGAGGCCCAGGTAGCCGGGCGGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAAGGCTGTGCCACGTCCTCCCGATGTTTCCTGTGCTCACAAGCTGCCGCTTTAGATTCTCCGGGAAAGTCCCCCTGAAGGGACTAAGGAGCCCGCGTTCCCCTCGGGACAGCTTGGCCGGCAGCCCCAGCATTTCCTTCCCCATCCCTGCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCGACCCCAAGCAGTTACACCAGGACATCCACGACCGCATCGACGTGATGTTTTACTTCGACCCGCCCGGGCCAGAGGAGCGGGAGCGCCTGCTGAGAATGTATCTTGACAAGTATGTTCTTATGCCGGCAACAGAAGGAAAGCAGTAAGTGTCTCCCCTCACCCACCCCTGTCCAGGGACCCTCGCTCTGGGCCCACCCCCGGCCCTGCTCTCCGGACGCACACAGCAGGCCCAGTCTCCGGGGTGGCACCGCCTCCCTGCTTTGCGGTTTCGCACAGGAGCCCTGTGGGCCCCAAGGGTCCCAGAGGCTGCACCCAGGGATGTGCCACCACCCTTTCCTCATCCCCACCTGAGAACAGCCTGGTGGTGTCTCCTCGGGTTTGGGGGGCAGAGCCCACCATCACTTACAAACCTTCAACtttttgtttttgagacaaagtcttgctctgtgccccaggctggagtgcagtggcacgatctcagctgactgcaacctccgcctcctgggttcacgcgattctcctgcctcagcctcctgagtagctgcgattataggtgcctgccaccacgccccactgcttttcgcctttttgtagagatgcagtttcaccatgttggccagggtggtctcgaaaccctgacctcgggtgatctgcccgccttggcctcctacagtgcagggattacagatgccagccactgtgcccgaccACCCTCAGGCCCTGGCAGTGCAGGGAGGTGACGTGGAGTGTTGCTCTGAGACCCCCATGTTGGGATTTGAGGGAGACGCTCCTCATGAGAGCCCCGTGTTGGGACTGGAGAGGATCCTCACGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCTGGCCCTGGAGCCTCATGGTGTGGGGCGCGGCTCCGGCTGCACTTGTGCCCTGAGGCCTTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGACAGGCCCTGGGGCTGCACGCCGGCTGCCTCAGGAACACTCCAGATGAGCAGTGGCTGCTCCACCTCTTGGCGTCCCCAGGTCCCAGGTTTCTGAGTCCTTCTGTCCACCTGACCTAAATTCCTGCTCTCTCCAGTGACAGCAAAAGCCGCTCTGTTCCAGAGAGAGCCTGGTTCCCCCTGCCAACCGCTCCGTGGCTGCCTGCTTCATGCTAGCCCAGCTGTCCCGGCCTCAGTTTCCCTTTGGCCCTCCCCTGCCCTGGGCTCTCCCACTCCCACGGCTGCTCATAGACCTGGCACAGTGACTTGGCTTCTATGACCTCCAGGGAGATGCTTTTGCTGGAATTCAGGGCTCTGCCACTGCCACTGTAACGGCCATGAGCCCTGTGGGTGCTGAGTGGGCAGGTGAGGGCAGGGCTGGTGTGAAGAGGGGGTGCGGCCATCTCCAGGCCCCACAGCAGCCACCACCTCCCTGCTCAGCCCAGACCTGGTTTGCATCAGGGAGAGGGCGGAGTTTGGCTGTCACAGGAAGAGTCCCTCCCAAGGGGGCATCTGGCATGGGTGCCCGCCTGGCTGCCTGTCTTCCAGCCCCCACCTCGTGGTGTGGGAGCCGCTGCCTTGGCCGGCCCACTTGGGAACTCCTTCCCCAGGCGCCTGAAGCTGGCCCAGTTTGACTATGGGAGGAAGTGCGAGGAGATCGCTGAGCTGACGAACGGCATGTCGGCCCGGGAGATCGCACAGCTGGCTCAGTCCTGGCAGGTGAGTGGGGCTCGGGCGCACCCACCCAGACAGGAGCCCAACTCCTGTGGAGACGCCGGGTTGCGCCTGTCCCAGCACCAGTGTCACACCGCAGCTTCTGTTGAGGGGTTTTCAGTGCACAGACGTGACACGGGGCACTCGCCCCAGTCGGCCACTCCACACACTGGCGCGCCCCTGCTCCTGCCCTGGGAAGTGTGGGGCATGTCCGTGGCTGACGGTCATAGGTCAGGAAGCCCGTCCGGCATCCTAGTATCCGGGCTCTGCCAGGTGGGGCGGGAGGCTTTCGATGCTCACCTTGGCAGACGGGCACCCCCTGGTGTGAATGGTCATCGGGACAGGCCCCGCCTGAGTTTGGTGGTGGGGCTGGAGGGATGTTGTGTTTCCCGGACCACGTCCGTTGGCTTGATCCTGCTTGACGGGCTCAGACACAGGGGCAGGAGTGACCTCTGATTGTCCCACAGCCGGCTGCTCCTTGGAGGACCCCCTCCTGCAGCTCCGTGGCTGCTGCAGGGACGGGGAGCCGGGACTCAGAGCAGTGTGGGCGTGGCCATCCAGAAAGCTTTGGTCTTTGGGGGTTGCTGGAAAAGCATAACCAGGTCTGTAGAAGGCACCAAGGCCATGCACAGGCATTGCTGCCTCTGGGGTCTGCAGAGTCTGTGACAACCTGGTCACTCAACCTAGCAGCGCTTTCGCGTGTGACAGGTTCATGAAGTAGCCAGTTACCTTGATTTGAACGTTGGAGCTGGGGACTATATGGGCTGTATTAGTCAGTTATGCCGCTGTGACAAAGAGCCTCAGATCTCAAACCCCATCCTTGTGGGTCAGCTGAGGTCTGTGTTCCAGGCCGTCTCCACTTGAGACCAGGTCTGTTTCCACAACTAAGCAAACAGAgaccgggccatggtgttgggctacatttgttcccagcatttgggaggtcgaagtcagcccagattatttgaaggcaggagtcaggaccagccttggggggggggggggggggggggggggggggaaagcaaggggagactccatctacaaaaaataaaaaaattagccggaccctaatgtggcacgcctgtaatgcagctcctgggagcctgaggtgggatgatcactgagtcccaggtaggccagaaatacagtgagcctgtggattgtgccactgcactccagcccgggttacagagcgagaccctggtctttaaaaataagaataaTTTGAgccgggcatggtggctcacgcctgtaatcccagcacgctgggaggccaaggggagaggatcacttgaggccaggagttcgagaccagcctggccaacatgtcgagccccacctctactaaaaatacaagaattggccgggcgcagtggtggtgcatgcctgtattctcagctactcaggaggctgaggcaggagaatcgcttgaacccgggaggtggaggttgcagtgagctgagatggtgccattgaattccagcctggactattcaggatcctttgagattccataagaattttaggagtggttttcctatttttgtaaaacataatttgggttttcacagggaccgcgtttagtctctatgtcgctttgatgtctctcagcaatattCTGTGGttttctcttgttttcgagacggagtctcgctctgctgcccaggctggagtgcagtgttgtgatctcagctcactgcaacgttcccctcccgggttaaagtgattctcctgactcagcctcctgaggagctggaattccaggcaggcgccaccatgcccggctaatttttgtactaagagacggggttttgccatgttggccaggctggtctcgaacctctgacctcaggcaatccacccacctcagcctcctaaagtgctgagattaaaggcacgtgccaccacgcccggctaatttttgtatttttagtagagacgatgattcaccatgccggcgaggttggtcttgaactcctgacatgaggtaatccatctgcctctgcctcccaaagggctgggattcagacatgggccactgcgcccagccagttttcactgtacaagtctttcaccctcttggttaagtgaatttccaagcattttattcttgccgctgctgttgtaaatggaaacggtttcataattccccattcacattattcactgttgggatggagaactgcagctttctttgctgttgattttgtatcctgtaagtttgctgatgtcacggcattttttcttccaatatggattctaggattttctacatataagattatgtcatctgagaacaggtgatttttacctttcccttttcagtttggatgacttttctttttcttgtctaattgcactgtccagagcttccagtggtgtgtggaatagaagcggtaaagcattcttgcctggttccttacctcagaggaaaagctttgtttttcaccactgagtatgtcacctatgggcttgtgatgtgtggccttcattgtgtttagggtgtatccttcaattcttggtttggtgagtgtttttatcataaaagtgtgaggcgggtggatcacctgaggtcggcagttcgaggccagcctgaccaacgtgaagaaaccccatctctcctacaaatacaaacttagttgggcatggtggtgcatgcccgtaatcccagctactcgggaagctgagacaggagaatcgcatgaaggcggcaggcagaggttccagtgagccgagatcgcgccatttgcactccagcctgggcaagaagagcaaaattgtctccaaaaaaaaaaaaGTggccaggcacggtgactcacgcctgtaatcccagcactttgggaggccaaggtgggtggatcacgaggtcaggagatcgataccatcctggctaacacagtgaaaccctgtttctactataaatataaaacatcagctgggcatggtggcaggtgcctgtagtcccagctacctgggaggctggggcaaaagaatggcgtgaacccaggaagcggagcatgcagtgagctgagatgcctgggctacagagtgaggccccaactcaaaaaaaaaaaaaggtgttgtatttggtcgaatactttttctgcaacacttgagacagtcgtgtggtttccttcctccaccctgctaatatcgattgatttttgtatgttgaacatttcatatgcggaacattgattttcatatgttgaactatcgttgcattccaggaataaatcctgcttggtcggctgggcgcggtggctcaagcctgtaatcccagcactttgggaggccgagatgggcggatcacaaggtcaggagatcgagaccatcctgtctaacctggtgaaaccccgtctctactaaaaaatacaaaaaactagccgggcgaggtggcgggcgcctgtagtcccagctactcaggaggctgaggcaggagaatggcgtgaacccgaaaggcggagcttgcagtgagctgagatgcggccactgcactccagcctgggtgacagagcgagactccgtctcaaaaaaaaaaaaaaaaaaaaaaaaaatcctgcttggtcagggtatagagtccttttagtgtgctgctgaattcactctgctggcattttgttgaggactttcccagtgatgctcatcagggatattggcctgtcatttttcttgtggtgtctttgtctgggtttgatatcagggtaatgctggcctcctaggatgagtgaggaaatgttcttcaatttgtccaagagtttgaggtgtgctgctgattcttcttaatgttttgtgaattgacacgtgaagacatcaggtccaggtcttgtgtttCaacttttacagcttgaagactttaggttcccagaaaaattgcaaaggtagcacagagagctcccgGGCCCGGGGCCTTGCCACGTAGTGAACGTCATGTGTCACTGTTGGCCCCACCTGGGACTGGGTCTTGCCCAGAATCCCACCCAGGAGGCCACGTGACATTTAGCTGTCACTTCTGGTGGGCTCTGCCAGGTCCCGTGCTTCCTGGTGGGGTGGCCCCATGAGCATCTGCTCATCCCCTTTCCTCCACTGGGCCCTGGGTGAGGTGCAGCCACTCGGGTGCACCCTGAGGGTTCCTGCACCTGTTTGAACTCTCTTGGGTCGGCTCAAGACCAAAAATGATGCTGAGCAGTCCTGGGCCTCTGATGCATAGTGGTGGTCCGGTTCCGGTCAGCGTCTCCTGCACTCCTGGGCCCCTGAGCCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGGGTGGGCGCTGTCAGTGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCCCATCGTGGCAGCCGTGTTGTGGGAGGATGGTGCGCTGCTGCCCCTTTACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCCGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTGCCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCAAGCACGGTGCCCAGTGGGGGTGCCCAAACCTTCACCCTGACCCATGGGTGACTTCCCTTGGGGACTCCACGCCTTTCACTGGGACTGGGATGGAGAGCGACCTGTCCATGGCAGAAGGGCTGCACCTGAGGTGCTTGAAGCAACACCAAGGGCCACAGTCCCAGCAGCTCCAGCCTCCGCATGCTGGATGCCAAGTCCTGTGCCCAGGACAGGGAGGTGGAGGCACGGGTGATCTTGATGCTAGCACCTATGTGCCCCGAGGTTGGGCAGTGGCTGCCTCTGCTGTGGAGGCCTATGAAGGTGAGGGTCTGAGGATCTGTAGTGCACTGTGACCCGGGGGCACTGCCTGGCCACGGCTGAGACACGCAGAGGGTCTGCAATTCCCTCCTGCCTCTTGGGAGCTGCCCTGGGTCTGCAGTCAGTGGGGCTCGTCCTCGGGCTTTCCGTTATTAGAAAGTCACTGAGAAACTGCAGTGCTGAGGACGCAGGCAGGGCTGTGGCACTGCAGGGGCCGCTCCCGGTGTCCACACGCATGCTGGGCTCTGCCGAGGTGCCGGAAGCCTGTGTTTCACCCTGAGGCCGTCCTGGTGCCCCGGGTTTGGACCCTCCCCACCTCGGGGTCCTGGAGTGCGTTACGGGTGGGGGGTTCCCATGGTGGCCTCCCTCAGCTCCCTCTCTCCTCACTAGGACACGGCGTATGCCTCCGAGGATGGGGTCCTCACCGAGGCCATGTTGGATGCCCATGTTGAAGACTTTGTCGAGCAGCACCAGAAGAAAATGCGCTGGCTGAAGAGGGAGGGCCTGTCCTCATGGACCAGCACCCCTTAACCTGAGTCCGCGGTGAGACCACACGTCACGGAGCCTGGCTGCGGACCCCTCCCACCCCTGCTTTTCCGGTCCCTGCACGTTTAGGAAATGCTTCCCCTAATAAACTCCCACAGGTGCCACAGCGCTGTGTCTATTGGCTGATGTGGTGCGGGGTTTGGGGTCCCCTAGTGTCCTTCTGGGGTCAAAGGTGATAGAAAAGACAGGCTGGAGCTTTCTGGAGAATTTAGGCACAGAAGGGTGGGCTTCACATGAGGTGCCTGCCACAGCGGGGTTGGCTGCCTGAATGCCACCCGGGACCGGCTGCTCGCGCTCCATCCTGCAGCTGTGGAGACGGGGGTGCCCCTTTGCCTCTCTCCACGAAGTGCAGGGCAAACAAGACACAGCGGTTTCAAACAGGCGATGGCCCGGACTGCGTGCCTCGCCGCCCCTGCGCCTTCCCCTGCCCCTGCTTTCCAGCTAGTCCCTGAAAACCTTGATGGggccgggcgcggtggcccatgatggattctcagcactttgtgaggccaaggcgggtggatcacctgaggttaagtgttccagcccagcctggccaacatggtgaaaccccatctctcctaaaaaaaaaaaagaaaagaaaaagaaaaattagccgagcgtcgtggcaggtgtctgaaatctcaggcactcaggaggctgaggcaggagaatcacttgaccccgggaagtggaggttgcagtaagctgagaccatgccattgcagtgcagcctggacaacaagagtcaaactctctcaaaaaaaaaaaaaGgccaggtcaggtggcatgtgcctgtggtcccagcttggtcccagattcttggtttggaggctgaggtaggaggatcacttgagcatgggaggatgaggttgcagtgagccaagatcgcttcagacactccagcctgggtgacagagtgagaccctgtctctaaataatcaaaaCCTTGATTACAGCCATGGGGTGGGGGTTGGGGGGCGTCTGGCTCGGCAGGGAACTATTGGGTTTTTCTGCTCTCtaatttttgtagagacagggtttctctttgttgcccaggctggtctccaactcctgggtcaagcgtcgatcttctgcctcggcctcccaagtggtgaggttacaggcgtgccaccgcacctgaccTGttttctttttttttttttttttttttttgagacggagtcagctctgtcacccagggctggagtgcagtgggcggtctcagctcactgcaagctccgcctcccgggttcacggccattctcctgcctcagcctcccgagtagctgggactacaggtgcgtgccacaacgcccggctaagtttttgtatttttagtagagacagggtttcactgtgttagccagggtggtctcaatctcctgaccttgggatccgcccgtctcggcctcccaaagtgctgggattacaggcttgagccaccgcccccggccCCttttttttttttttttttggcaagggagtcttgctcgcccagggtggagtgcagtgttgcaatctgggctcactgcaacctccacgtccagggtgtcaggcctctgagcccacgctaagccatcatatccccagtgacctgcatgtgtacatctgatggcctgaagcccctgaagatccgcagaagtgaaaacagtcttaactgatgacattccagccttgtgatttgttcctgccccaccctacctgatcaatgtactttgtaatgtcccccacccttaagaaggttctttgtaattctccccaccctggagaatgtactttgtgagatccacccccagcccccaaaatattgctcctaactccactgcctatcccaaaacctctcagaactaacggtaatcccagcaccctttgctgactctttttggactcagctggcctgcacccgggtgaagtaaacagccttgtggttcacacaaaacctgtttcgtggtgtcttcacacggacacgcgtgacacagggttcgaggaaatttcatgcctgaacctccggagtagctgggattacaggcgaacggcaccatgcccaggttaatttttgtattttcggcagagacagaggcccaggtagccgggctggGGGACAGCTGGGTGTGGTGGGGCCGGCGGTGACCAGGGCTGTGCCGCGTCCTCCCGGTGTTTTCTGTGCCCACCAGCTGCCGCTTTAGATTCTCCGGGATAGTCTCCCTGAGGGGGCTGAGGAGCCTGTGTTCCCCTCGGGGCAGCTTGGCCGGCAGCCCCAACATTTCCTTCCTCATCCCTCCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCCACCCCGAGCAGTTGGACTGGGGCATCCATGACTGCATCGATGTGACGGTCCACTGCGACCTGCCACGGCAGGAGGGGCGGCAGCGCCAGGTGAGAATGTATTTTGACAAGTATGTTCTTAAGCCGGCCACAGAAGGGAAACAGTAAGTGTCCCGCCTCACCCGCCCCTGTCCAGGGACCCTCGCTCAGGGCCCACCCCGCCCCTGCTCTCCAGACGCACCCAGCAGGCCCAGTCTCCAGGGTGGGCACCACCTCCGTGCCCTGAGGTTTTGTGCGGGAGCCCTGTGGGCCCCGAGGGTCCCAGAGGCCGCATCCAGGAGGTCACGCCCCCTTTTCCTCATCCCCATCTGAGAACAGCCTGGTGGCGTCTCCTCAGGTTTGGGGGCAAAGTCCACCATCACTTAGAAACTTTCAGCAttccttttttttttttttcttaagacggactcttgctctgtcatccaggctggagtgcagtagcttgacctcggctcactgcaagctctgtctcccaggttcacgccgttctcctgcctcagcctcccaagtagctgggacaacaggcacccgacaccacgcccggctaatttttttgtgtttttttagtagagatgggtttgaccgtattagccaggatggtctcgatctcctgacctcgtgatccacctgcctcggcctcccaaagtggtgggattacaggtgtgagccaccgcatctgacctttttttgaggaagtctcactcttgtccccctggctggagtgcagtgccgggatctcagttcactgcaacctgtgcctcagcctcctgagtagttgggattataggtgcccgccaccgcgcctggctggtttttgtgtttttgtagagatggaatctaactccgtctcccaggctggagtacagtggtgtgatctcagcttactgcaacctccaccctccgggttcaaaccatcctcttgcctgagcctcctgaacagctgcgattacaggcgcccagcacaatgctcgcctcatttttttgtctttttagtagaaacagcttttcaccaaattgaccagactggtcttggacttctgatctcaagtgattcaccctcctcggcctccaaagtgcagggattgcagatgtgagccaccggacccggcctcttttatgttcctcttcagtaCTCAGAGGGCTGTGAGGAAATCCGGTGCCCGGCCACCCCCAGGCCCTGGCAGTGAGGGGAGGTGATGTGGAGTGTTACTCTGAGATTCCCATGTTTGGATTCGAGGGAGACGCTCATCATGAGACCCCTCCGTGTCGGGATTAGAGGGAGAGGCTCCTCATGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCCGGCCCTGGAGCCTCCTGGTGTGGGGCGCGGATCCGGCTGCACTTGTGCCTTGAGGCTCTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGAGAGCCCCCAGGAACACTCCAGATGAGCAGAGGCTGTTCCACCTCTTGGCGTCCCCAGGTCCCCGGTCTGAGTCCTTCTGTGCACCTGACCTAAATTCCTGCTGTCTCCTGTGACAACAAAAGCCACTCTGTTCCAGAGAGAGCCTGGTTCTCCCGTTGACCCCTCCGCTGCCGCCTGCTCCATGCTAGCCCAGCCGTCCAGGCCTCAGTTTCCCTTTGGCTCTCCCCTGCCCCGGTtcccagctgcttgggaggctgaggtaggaggatcatttgagtccaggagcttgaggttgcactgagctgtgactgtgccactgtactccagccttggcaacagagtgagacactgtcttaaaaaagaagaaTTTGggccagatgctgtgtttcatgcctgttcccagcatgctgggaggctgaggagagaagatcactcgaggccaggggttccagaccagcctgccaacatgttgaaccccgcctctacgaaaaatacaaaaattagccgggcgtggtgggtgggtgggtgccagtaatcccagctactcaggaggctgaggcagcaaaatctcttgaacctgggaggtggagattgtggtgagctgagatagtgccgctgtacttcaacctgagcaacagagtgagactccttatcaaaataaagaaaTCAATCAATCAATAAAAATAATCACAATAATTTGggctgggcgtggtggctcactcctgtaatcccagcactttgggaggcgtggatcggttgagttcgaggcaagcctggccaatgtggcgaaaccccatctccactacaaatacaaaaattagccaggtgtggtgacaggcacctgtaatcccagctgctcgggaggctgagacaggagaatctctggaacctaggaggcggaggttgcagtgagccaagatcacgtcagtgcgctccagcctgggtgacagagactgtctcaaaaaagaataataataaTTTgactgggtgtggcggctcactcttgtcatcccacactttgggaggccgaggcaggaggattgcttcagctcaggatttcgagactggcctggacaactggcctggacaacatggtgaaactccatctctacaaaaaatacaaaaattagccaggcatggtatcatgtgcctgtgatctcagctactcaggaagcagagatgggagcattgctggagcctgggagttggaggctgcaatgaaccatgttcgtgccactgcactccagtgtgggtgacagagtgagaccctgtctccaaaaggcatggtggctcacgcctgtaatccctgcactttgggaggccaagctgggtggatcacctgaggtcaagagttggagaccagcctggctaacgtggtgaaaccccatctctaggaaaaatagaaaaaATTggccaggtgcagtggctcacacctgtaatcccggcactttgggaggccgaggcgggcgaatgacctgagatcaggaattccagaccaaccacaccaatatggagaatccccgtctctactcaaaatacaaaatcagccgggcatggtagcaatcccagttactcaggaggccgaggcaggagaatcactggaggtgagccgagaccacgccattgcactgaagcctgagcaacgagagggaaactgtctcaaaaaataaTGCTAATAACAAGGGGGAGAGAACAGGAGTGTGGTCAGCAGCTGGGCCTGCCATAACCCCTGGGTCGTGTGTCCCCACAGCTCTGAAGGCTAGAGGCCCGAGGTCAGGGTGCCAGCTCGGTCCCCCCCGTGGAGTGTTCTCTGTTAGCTTCTCACATGGCAGGGAGAGTGACTGAGCTCTCGCTCTGGTGTCCCTTACGAGGACGTTCATCCCCCACTGCTCAGAGCGGCGGTGAGCCACCACGCCCAGCGCCAACTTTGTCCTTCAAGAGTTGTTTTTTTGTgccgggctcagtggctcatgcctggaatcccagcactttgaaatgccaaggtgggtggagcacctgaggtcaggagtttgactccagcctggtctaaatggtgaaaacctgcctctactaaacataaaaaaatcagctgggcatgttggtgtgtgcctgtaatcccagccactcgggaggctgaggcaggagaatcacttgaacccaagaggtggaggttgcagtgaactgagatcatgtcactgcactgcagcctggatgacaagagtgagactcccttgcaagaaaaaacaaaaattaaaaaagaaGTTGTTGTcttttttttttttttttcccttggacaattcaagatgcctagagattccatatcaattttagtaatgcttcttctatattttaaaaagtaatttgggtttttacagggattgcattcagtctctgtattgccttAATGACTCTTAGCAATGttgttttttttatttattattttttttctagagatggagtctcactctgtcagccaggctggagtttagttgttggccaggatgggcccaatctaatgacgtcaggtgatccgcctgcccctggctcccaaattgctgggattcagacgtgggccaccatgcccagccagtttacattgtacatttctttcaccttcttggttcagtgaagctccaagtattttattctttcggatgctcttgtaaatggaaatggtttcgtcattccccgttcagattatacacttactatgaagaactgcagctttctttgctgttgattttgtatcctgtaactttgctgatgtcgtggggttgttttttccaatatggattctagattttcCTTTTCTTTTTCTtttttttgtttttttgttttttttttttttgatatggggtctccctctgtggcccaagctggagtggaatgcagcggcacgatcttgaatctgcgagctcctctgcccgggtccacgccattctcctgcctcagcctcctgagtagctgagactacaggtgcctgccatcacggccggctaattttgtgtattttttgtgcagatgaggtttcaccgtgttagccaggatggtctcgatctcctaactttgtgatcggcccgcctcggcctcccaatgctgAATGCTGTTGGGACTGGGTCTTGCCCCAGAATCCCACCCAGGAGGCCACCTGACGTTTAGCTGTGACTTCTGGTGGGCTCTGCCAGGTCCCATGCTTCCTGGTGGGGTGGCCCCGTGAACGTCTTCTCAGGCCCTTTCCTCCATTGGGCCCTGGGTGAGGTGCAGCCACTCGGGGGCACCCTGAGGGTTCCTGCACCTGTTTGAAGTCTCTTCGGTCGGCTTGAGACCAAAAATGATGTTTAGCAGCCCTGGCCCCCTGACGCACAGTGGCGGTCCTTCTCCGGTCAGTGTCCCCTGCACCCTTGGGCTCCTGACGCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGTGTGGGCGCTGCCAGAGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCTGGTCGTGGCGGCCGTGTTGTGGGAGGATGGTGTGCTGCTGCCTCTGCACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCTGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTACCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCGAGCTCGGTGCCCTGTGGGCGTCCCCATGCCCAGACCCTGACCCACAGGTGCCTCCCCTTGGGGTCTCCACGCCTTTCCCTGGCCCTGGGATGCAGAGTGACCTGTCCATGGTAGAAGGGCTGGACCTGAGGTGCCTGAGACAGCACCAAGGGCACTGGTCCCAGCAGCTCCAGCCTCTGTGTGCTGGATGCCACACAGACACAAGACTCTTGGGAGACGCATTTTCCATCTGGCTCAGAGGGGGAGGGGGAGGCTTTGCAACCCAGCCCCTGCCCAGGCCCCTGGGAGGGTGGGTGCCTGCTGAGCCCCCGGGGCAGCAGGAGCGGGGCAGGCGGGGTCTTTGTTCTCACTCCCACAGCAGAGGCAGATGTGGGGGCGCCTGCTGGGGCCAGACCAAGGTGGGGTGGCCTGGAGACTGCTTCCAACCGTGGCCGGGAAGCAGGGAACCTGCCCGGCGTGTCTGAGGCCACACTCTCAGCTGGCCGGTCCAAGCCTGCGGCTGGAGCTGGTGTCTGTTTAGCTAATAAAGTCCCACAGTTGCCTCACTGCCGTGTCTATTTGCTGATGCTGCGCGGGGTTTCAGGGGCCGCCTAGCCTCCTCCTGGGGTCAAAGGTGACAGAAGAGGCAGAGGCGGGAGCTTT
Let’s break this command down since there is a lot going on. Here is a table that explains each option:
Command line option | Description |
---|---|
bedtools |
Call the main bedtools interface |
getfasta |
The sub-program in the bedtools program to run |
-fi |
The path to the input whole genome sequence file in FASTA format |
-bed |
The path to the input bed file |
-fo |
The desired name of the file to write the extracted sequences to in FASTA format |
We saw this a bit yesterday as well, but this is another framework
for running commands. While it still follows the Unix philosophy
(formatted text -> command -> processed text),
the Unix commands we’ve seen up to this point generally act on a single
file, the use of multiple command line options allow us to specify
multiple input files, here a FASTA file of sequences
and a bed file of intervals. Also, like Unix commands,
the default action of bedtools
is to simply print output to
the screen. This output can be redirected to a file
with >
, but here we also have an option
(-fo
) to tell the program directly to print output to that
file instead.
The use of a main program (bedtools
) and a sub-program
(getfasta
) is also a norm among bioinformatics tools.
The downside of having multiple input files is that it makes
piping with |
difficult.
Try running the code block below to pipe output from
grep
tobedtools getfasta
:
grep chr10 data2/macaque-svs-filtered.bed | bedtools getfasta -fi data2/rheMac8.fa -fo macaque-svs-filtered-chr10.fa
# grep: The Unix string search command
# chr10: The string to search for in the provided file
# | : The Unix pipe operator to pass output from one command as input to another command
# bedtools: A suite of programs to process bed files
# getfasta: The sub-program of bedtools to execute
# -fi: The genome fasta file as input
# -bed: The bed file as input
# -fo: The desired output fasta file
This doesn’t work because bedtools getfasta
requires the -bed
option to be specified. It
doesn’t know that we’ve given it the bed formatted input through a
pipe.
Luckily, many bioinformatics tools have a shortcut help us pipe output to a specific input option.
-
to pipeFor tools that require an input file to be specified with a command
line option (like -bed
above), we may still want to
pipe the output from another command to it. We can do
so with the -
shortcut. Basically, when this is provided as
an option in lieu of an actual path to a file it tells the command to
read the input for that option from the STDOUT stream (what is printed
to the screen).
grep chr10 data2/macaque-svs-filtered.bed | bedtools getfasta -fi data2/rheMac8.fa -bed - -fo macaque-svs-filtered-chr10.fa
# grep: The Unix string search command
# chr10: The string to search for in the provided file
# | : The Unix pipe operator to pass output from one command as input to another command
# bedtools: A suite of programs to process bed files
# getfasta: The sub-program of bedtools to execute
# -fi: The genome fasta file as input
# - : Another way to pipe the output from the previous command to the input of the current command when an input option is required
# -bed: The bed file as input
# -fo: The desired output fasta file
head macaque-svs-filtered-chr10.fa
# Display the first few lines of the new file with head
## >chr10:52589-53460
## CACCCATCATGACAAGGCCAGGGTCACACACTATGGGATAGTCTAGGGGTCACCACGACTAGTTTGGGGTCAGACACCATGACCAGCCCAGGGTCACATACCACGGCCAGCCCAGGGTCACATACCACAGCCAGCCCAGGGTCACCCACCATGGCCAGCCCAGGGTCACCCACAAAGACCAGCCCAGGGTCACCCACCATGACCAGCCCTGGGTCACCCACCACGACCAGCCCTGGGTCACCCACCACAGCCATCCCAGGGTCACCCACCACGGTCACCCACCACGGCCAGCCCAGGGTCTCCCACCATGACCAGCCCAGGGTCATATACTATGGCCAGCCCAGGGTCACCCACCACGGCCAGCCCAGGGTCACATACCACAGCCAGCCCAGGCTCACCCACCACGGCCAGCCCAGGGTCACCCACCACGGCCAGCCCAGGGTCACATACCACAGCCAGCCCAGGGTCACCCACCACGGCCAGCCCAGGGTCATATACTATGGCCAGCCCAGGGTCACCCACCATGACCAGCCCAGGGTCATATACTATGGCCAGCCCAGGGTCACCCACCAGGGCCAGCCCAGGGTCACCCACCATGACCAGCCCAGGGTCATATACTATGGCCAGCCCAGGGTCACCCACCACGGCCAGCCCAGAGTCACATACCACGGCCTGCCCAGGGTCACCCACCACGGCCAGCCCAGGGTCACCCACCACGGCCAGCCCAGAGTCACATACCACGGCCAGCCCAGGGTCACCCACCAGGGCCAGCCCAGGGTCACTCACCACAGCCAGCCCTGGGTCACCCACCACAGCCAGCCCAGAGTCACATACCACGTCCAGCCCAGGGTCATCCACCATGAGCATCCCA
## >chr10:69728-70192
## GCTCAGAGGGAGACATGTGGGCACACCGTGTGTGCACAACCTCACAGTCAGAGGGGGAGACACGTGGGAACTTGTGCTCACAGCCCTCGGTCCCCCTTCGCTCAGCCTCACAGTCAGAGGGGGAGACACGTGGGAACTTGTGCTCACAACCCTCGGTCCCCCTTCGCTCAGCCTCACAGTCAGAGGGGAAGACACGTGGGAACTTGTGCTCACAGCCCTCGGTCCCCCTTCGCTCAGCCTCACAGTCAGAGGGGGAGACACGTGGGAACTTGTGCTTACAGCCCTCGGTCCCCCTTCGCTCAGCCTCACAGTCAGAGGGGGAGACACGTGGGAACTTGTGCTCACAGCCCTCGGCCCGCCTTCGCTCAGCCTCACAGTCAGAGGGGGAGACACGTGGGAACTTGTGCTCACAGCCCTCGGTCCCCCTTCGCTCAGCCTCACAGTCAGAGGGGGAGACACGTGGG
## >chr10:71484-71997
## GTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGAAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGG
## >chr10:131190-131574
## ggtggggacagggacaggtggggacaggggcaggtgggggcaggggcaggtggagcaggtgaggacagggagaggtggggcaggtggggacagggacaggtggggcaggttgaggacagggacaggtggggacaggggcaggtgggacaggttgaggacagggacaggtggggacacggacaggtggggacagggacaggtgggacaggttggggacagggacaggtgggggcaggggcaggtggagtaggtgaggacagggagaggtggggcaggtggggacagggacaggtggggacaggggcaggtggggcaggttgaggacagggacaggtgggacaggtcgggacagggacaggtggggacaggggcaggtgggaacag
## >chr10:155900-156342
## CAGTACCTCTACACACACACGAACACGCCTGGATTCTCCAGTACCTCTACACACACATGAACACGCCTGGATACTCCAGTGCCTCTATCCACACACGAACACGGCTGGATTCTCCAGTGCCTCTATCCACACACGGACACGCCTGGATTCTCCAGTGCCTCTACACACACACGGACACGCCTGGATTCTCCAGTACCTCTACACACACACGAACACGCCTGGATTCTCCAGTGCCTCTAGGCACACACGGACACGCCTGGATTCTCCAGTGCCTCTAGGCACACACGAACACGCTTGGATTCTCCAGTGCCTCTACACACACATGAACACGCCTGGATACTCCAGTGCCTCTAGGCACACACGAACACGCCTGGATTCTCCAGTGCCTCTAGGCACACACGAACACGCTTGGATTCTCCAGTACCTCTACACACACACGAAC
Did you spot the difference between this command and the one above it?
Here, all we’ve added is -bed -
, which tells
getfasta
that the input for the -bed
option
will come from the output of the previous grep
command.
Note that not all command line tools accept this shortcut, but most of the ones we cover today do.
Exercise: In the code block below, write a command that extracts the sequences of only the duplications in the bed file from the macaque genome. Output these sequences to a file called
macaque-svs-filtered-dups.fa
. BONUS: Figure out how to keep the SV name (4th column of bed file) as the header of the sequences in the output FASTA file (Hint: check the help menu ofbedtools getfasta
!).
## Use grep and bedtools to extract sequences of duplications only
# data2/macaque-svs-filtered.bed
# data2/rheMac8.fa
grep "<DUP>" data2/macaque-svs-filtered.bed | bedtools getfasta -fi data2/rheMac8.fa -bed - -name -fo macaque-svs-filtered-dups.fa
## Use grep and bedtools to extract sequences of duplications only
head macaque-svs-filtered-dups.fa
# View the first few lines of the file you created
## >chr1:89943:<DUP>:528:1907.19::chr1:89943-90471
## TGGGTTGATGGTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCAGGGTTGATTGGTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTGGGGGTCGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTATTTCTGGAGTTCAGGGTTGATTGGTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCAGGGTTGATTGTTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTTGTGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATGGTTTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTGGGGGTCGATTGTTTCTGGAGT
## >chr1:218574:<DUP>:960:5699.01::chr1:218574-219534
## TCTGTCACGGAGGAGGCGGGTCTTTCTCTGTCATGGAGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCACGGGGGAGGAGGATCTTTCTCTGCCAATGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGAAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCATGGGGGAGGCGGGTCTTTCTCTGCCACGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGGAGGCAGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTGTCGTGGGGAAGGCGGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCCATCATGGGGGAGGCGGGTCTTTCTCTGCCTTCAGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCATGTGGGAGGCGGGTCTTTCTCTGCCTTCAGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCATGGGGGAGGCGGGTCTTTCTCCCTCATGGGGAGGCGGGTCTTTCTCCCTCATGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGAAGGCGGGTCTTTCTCTGTCATAGGGGAGGCGGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTGTCATGGGAGAGGCGGGTCTTCGTCTCTCATGGGGGAGGCGGGTCTTCCTCCCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCT
## >chr1:219608:<DUP>:470:2074.69::chr1:219608-220078
## CTCTGTCACGTGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCATGGGGGAGGCGGGTCTTTCTCTGTCACGAGGGAGGCGGGTCTTTCTCTGTCATGGAGGAGGTGGGTCTTTCTCTGTCACGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCTGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGTCATAGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGTGGGCCTTTCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGT
## >chr1:519434:<DUP>:22148:1673.64::chr1:519434-541582
## TCCTGGGGTCAAAGGTGACAGAAGAGGCAGAGGCTGGAGCTTTCTGGAGAATTTACTGACCACAGCGTGGTGCACTTGACATCAGGCGCCCGCCATGGCCGGGCCTGGGTCTGAATGCTGCCCGGGACCAGCTGCCTGCGCTCCAGCAGCCCCTCCCTCCTGAAGGCCAGGTCCCCCGAGAAGAACGAGGCTGCAGAGTGATGTGGGGGCCAGCGGTGACTTCCTACCACACTGTTCTCAGGTGTAAGAGGGCTCGCTTCTGCCCAGGCATTGTCCGTGGAAGACACACAGCCGGCCACTGCAGCCTCAGTCCTGGGATGCCCTGGGGCTGGGTCACAGGGGGCCACGGGCCACGCTGGGAGGCCACAGTCCTGTCGTGCCATGCAGCTCCCTGTCCCCAGATGTCCGCTCAGGGATGCAGAGGGCAGAAACCACACTCGCTGCCTGAATTCTGGGAGCAGAGCCCGGTACCCACTGCCTGGCCGGGGCCTACCCTGGGACTCCAGCCCCTGTTCCCGCTGGCCCGGGCTTCCGGAGGCAACTGTGTCCCTATCCTGGCTCAAGGTCCAGGCTGCACCTGGAACCTGCACGGTCACTCCTCCAGGTCCTCAATGCTGGAGGACTCTCTCAGACAGGAAACCTTTGCGTTGGGCGCAGGGCGGGGTGCGGGGTGGTCACGGGGAATCGCAGGGCAAAACAGCACAGTGCAATCGCGCAGAGCCTGATATTGGCGGATGAAACATAAACTGCTTTCTGCACTTTGTGTCCTTAGGAAGGGTGTGGGGTGTTGGCGGAAGTAGGAAACAGAAGAGGAGCCTGGGCATGCAGCGGGTCTGTCAGAGAGCAGAGCCCTCGGAGCTGCAGTGCTTGGAGGGAGGCGGTTCACCTCTGCCCACTCTCTCCAtttctctctctctcattttccttttagagatggattcttgctctgagcctaggctggagtgcagtggtgtgattatagctcattgcagcctcgcccttccaggctcaagtgatcctcctgcctcagcctgtccagtagcCATACCCTACTAGGTCCTAGTTAGCCCCCAGAGGCGTGCACCACCACGCCCACTAATTGCAAAAATTTGTTggctgggcgcgatggctaacatctgtaatctttgggaggccaaggcgggcggatcacgaggtcaagagatggagaccatcctggctaacacggtgaaacccggtctctactaaaaatacaaaaaattagccgggtgtggtggcgggggcctgtagtcccagctactcaggaggctgaggcaggagaatggcgggaacccgggaagtggagcttgcagtgatctgagatcactccactgcactccagtctgggggacagagcgagactccgtctcaaaataaataaataaataaatatataataaataaataaaaataaaaataaaaCTAAGCCCTTCCTGATGGTCATTGGGGGGTTTGGGGGTTGGGGGGGGTGTCTGGCTATGGCTGGGGAACTCATTTGGTTTTCCTCCTCCTCCTCtttttattttttggtagagacggggtctcttgatttcccaggctgatctccaactcctgggctcaagcaatcctcctgcctcagcctcccaaagtgttgggattacaggcctgagacaccgtagctagccAGCtttctttttttttttgagacggagtcctgctgtcacccaggctggagtgcagtggcgagatctcagcggatcactgtgttatacgtaaattttcggtgtcgcaaaagaagtagcactcgaatgtacacttttctcagctaggaaatttacttctatagaaggggggtctcatagatggagcaatggtgagcatttggacaagggaggggaaggttcttattcctgacgcaggtagcgcctactgctgtgtggttcccttattggacagcgttagacctcacaatctaaatccgattggcCtttttttttttttgagatggagtcttgctgtgtcgcccagactggagtacagtggtgcgatcttggctcactgcaagctctgcctcctgggttcatgccattcttctgccttagcctcctgagtagttgagactacaagtgtatgccatcatgtgcggctaatttttgtgtttttggtagaaagagatttcaccacgttggccaggatggtctcgatctcctgacctcgagatccacctgcctcggcctcccacagtgctgggattataggcatgagccactgcacctggccttaagtggttctttaaagtctgattcgttgtttctactttccctgatgagggtgggtgtcaaggagtgtggtattcttacataatgtctgatgtttggaatagcAttttttttttttttgaggcagagtctcactctgtcgcccatgctggagtgtagtggcaccatcttgtctcactgtaacctttgcctcccgggttcaaacgatcctcctgcctcagccttccacgtagctaggattacaggcgtccaccaccacggccggctagcttttatatttttagtagagacggggtttcaccatgttggccaggctgtacttgaacttctgacctcaatgatctgcccccctcagcctcccgaagtgctgggatacaggtgtgagccaccactccTCGCTCAAGTAATATGTTAAACTTATGCTTTCTTCTTTTCTTCTTTCttttttttttttttttttttggatggagtcttgttctgtctgcccaggcttgagggcatggcataactcggctcactgccctccgccgttccagtcatgcatatctgctgccttcagcctcctttagtacgggacacgaggccacctgccacccgtgcctggctatttttttatttttttttttttttttttttttttttttttttATCAGgacagagtctggctctgccgccaggctggagcttgcagtggcgtcagctcaacctgcaagctccgctccgcgggttcaacgccattctcatgcctcctcagcctccccgagtaattgggactacagcgcgcccgccaccgccccgctcagtttttgtattttttagcagagaggggttaccgtgtagccaggatgggtctcgattcctgacgcctcgtgatccgcccgtctcggctcccaagctgggattacaggcttgagccacgcgccccggcccggcatttttttcatttttagtaagaaacagggtttcaccgtgtttagccaggattggtgtcgatttcctgacccgtgatccgcccccctcggcctcccaaagtgctggattccaggcctgagcctgcaagccgggccTACTCTTTGGCTTTTAAAAGAATGGGCAACATTGCTTTTCTTTACTAACTTCTAATCTTTCCCTCTCTGACTCATCTCTCCTCCCACTTCTCTTGTTCTCCCTGTCAGTGTTCCTTTCCTAAGAGTTTTTCCCTGTCTATGATCTTTTTTTATAGGCTTTTTTCTAGTTTCTCTTTCTTTGTAATTGTGCGTTAATACTGGCCAATTGTTAGTGACAAATTCCTTGCCAAGAGATCCCTGACCCTAAACCAGCATATTCTGTCCATTCGTTTTAATCTGTACtttatttttcttgagatggagttccgctctgtcgccaggtgtggatggtgtagtggcacgttctcgctcactgtcaactcgccctccagggtcaacccgcaccatcctcgctgccttagcctccgagtacggggattgtacaagcgtccaccacccggcctggcgaggcgcttgatttttttatttcagtagagatgggggttttcatcgtgttagccagatggtcccccccatctcctggactcatgctccgcgcaccgccccttggcctccgcaagtgcgcgattaTGATCTCTCTCAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtctcctgcctcagccctctgagtagctgggattacaggcattttttgtatttttagtacagatggggtttcaccattttggtcaggctggtggaggactcctgacctcaaatgacctccccgcctggcctcccgaaatgctgggattacatgcgtgatccaccacgcccagccATACAGttcttatgttaagacaggctctctgtcgcccaggctggagtgcagtggcgcgatcacagctcactgtttgcctcgacctttcaagctcaagctgtcctcctgcctgagccgcccgcgtagccaggactgcaggggcacagtgccatgcccggctaatttttttttttgtgatggcgttttgctcttgttgcccaggctggagtgcggtggcgcaatcttggctcactgcaacctcctccccctgggttcaagcaattctcctgcctcagcctcccaagtagctgggattacagtcatgtaccaccacgcccggttcattttgtatttttttttagtagacaagggatttctccatgtcagtcaggctagtcctaaactcctgacctcaggtgacccgcccacctcagcctcccaaactgctgggattacaggcgtgagccactgtgcctggtcCTGGCTAATAttttttttttttttgagacggagcctcgctctgtcacccagactaaagtacagcggcgcaatctcagctcactgcaagctccgcctcccgggttcatggcattctcctgcctcagcctcccaagtagctgggactacaggctcctgtcacctcgcccggctaattttttgtatttttgtagagacggggtttcacagtgttagccaggatggcctcaatctcctgacctcgtgatccgcccacctcggcctcccaaagtgctgggattataggcgtgagccaccgcgcccagctgtttttttgtaatgttagtagacatggactttccccttgttacccaggctgggctcaaacttctgaggtataagagatgctcccgccttgaccttgtgaagttctgggattacagacgtgagcccccatgcccagtcAGGGGTttgtttgttttggtttttgtttttgtttttgagacagagtctcactctgtcgcccatgctggagtgcagccgtgcaattttggctcgctgcaacctctgcctcccgggttgaagtgattctcctgcttcagcctcccacgtagctgagaccacaggtgtgccaccgcgcctggctgatttttgtatttttagtggagacggggtctcaccatattggccaggatggcctcaaactccctacctcaggtgatctgcccgcctcggcctcccaaaatactacgttacatgcatgagccaccgtccctggcTGTGGTCAGGCTTTTGAGTTTAGATCCATGAAAGTGTGGCCGCGTCCCTGCTCCCTGCAGGAGGGAGGCCTGTGGGACCTTCTGCTGTGGCTGTTTACAAGGCTTTGCTCCTGGTGCCTAAGGCTGGAACCTTCTCTCTGCAGGAGGAGATGAGCAATTACTACCTCAGAGTCACCCAGAACGCCTTCCTAAACCACACGAGGCAACGCAGCAACAAGTGAGGGAGCCCCTCGGGTCCTGGGCCCCCGGGTAGGGCTGTGCAGCCGTCGCCCTTGGTTCCCACAGAGGGACCTCAGAGGCCCTGGATCACAGTGCTGGGCAGCACCCGTGGCCTCAACGTGTCCACCTCGGATGTCCCCTAGGAATGTCCCAGCTCGGGACAGCATGGGGCGTCACTGAGGAACATGCGGGGGCCTCCTGGGCAGAGCCGGGGTCAGTCCCGTCCTCACGGCCCTGTGCGATGCCGCCCCAGCTTGCACGTCCCTCTGCCCCTGGGTTTCCGCGGTCCTGTGCCAGCAAGGGAGGCGGTCTGATTGTCTGAGGCTCTGCTGGGGCCTCCATTGCAGGCTGTGGGTGCCCTGGGGTGGGAGATGGAGACACTTTTGCTCCCACGGGAAGCTGGGCACGAGCAGGTCCTGTGTGTTTGGGCGGAGCCTGGGGCCTTGGCCCCCCCGCCCAGATGCTGGACAGGGTTGCTCCCTCCAGGCCTGGGGCCCTCCTCACATTGCGCGTCCTCCGTGAGCTGCTACCCAGAGGTCCCCAGTAGGTGGATAGCCCCATGGCCAGGCTCCCTAGCCCCTTTCAAATCCCCTTATTTTGAGTTTTCTTGGTCTCCTGGGCCCCTCCAGCCCCAGTCACGTGTCACACGGAGAATCAAGTCCTGCCGGTCGGCCGTGGCCGAGTCTTCAGGCGTGTTGGGCTCGCTGGCTCAGCTGCTGCCGGTAGACGCTCCCTGGAGCCCTGGCTCAGGTCCTTCCCAGAGAGGCAGGGCTGGGGCCCTGGTGAGCCTCCGCTGCACCCGGGCCCCCAAGGTCCTGCTCCTGGCTCGTGTGGCCACTCTTGGCATGGACTCTGGGTCCCGCATCCCTGCTCCCAGCACAGCAGGGCTCAGGCAGCAGGAGGAGTGGTGGTCCCGACGCTGCCTATCACGCTGGGTGAGGGTCAGCGGGGAAGCGCCACACGGGATGAGAACAGAGGCCCAGGTAGCCGGGCGGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAAGGCTGTGCCACGTCCTCCCGATGTTTCCTGTGCTCACAAGCTGCCGCTTTAGATTCTCCGGGAAAGTCCCCCTGAAGGGACTAAGGAGCCCGCGTTCCCCTCGGGACAGCTTGGCCGGCAGCCCCAGCATTTCCTTCCCCATCCCTGCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCGACCCCAAGCAGTTACACCAGGACATCCACGACCGCATCGACGTGATGTTTTACTTCGACCCGCCCGGGCCAGAGGAGCGGGAGCGCCTGCTGAGAATGTATCTTGACAAGTATGTTCTTATGCCGGCAACAGAAGGAAAGCAGTAAGTGTCTCCCCTCACCCACCCCTGTCCAGGGACCCTCGCTCTGGGCCCACCCCCGGCCCTGCTCTCCGGACGCACACAGCAGGCCCAGTCTCCGGGGTGGCACCGCCTCCCTGCTTTGCGGTTTCGCACAGGAGCCCTGTGGGCCCCAAGGGTCCCAGAGGCTGCACCCAGGGATGTGCCACCACCCTTTCCTCATCCCCACCTGAGAACAGCCTGGTGGTGTCTCCTCGGGTTTGGGGGGCAGAGCCCACCATCACTTACAAACCTTCAACtttttgtttttgagacaaagtcttgctctgtgccccaggctggagtgcagtggcacgatctcagctgactgcaacctccgcctcctgggttcacgcgattctcctgcctcagcctcctgagtagctgcgattataggtgcctgccaccacgccccactgcttttcgcctttttgtagagatgcagtttcaccatgttggccagggtggtctcgaaaccctgacctcgggtgatctgcccgccttggcctcctacagtgcagggattacagatgccagccactgtgcccgaccACCCTCAGGCCCTGGCAGTGCAGGGAGGTGACGTGGAGTGTTGCTCTGAGACCCCCATGTTGGGATTTGAGGGAGACGCTCCTCATGAGAGCCCCGTGTTGGGACTGGAGAGGATCCTCACGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCTGGCCCTGGAGCCTCATGGTGTGGGGCGCGGCTCCGGCTGCACTTGTGCCCTGAGGCCTTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGACAGGCCCTGGGGCTGCACGCCGGCTGCCTCAGGAACACTCCAGATGAGCAGTGGCTGCTCCACCTCTTGGCGTCCCCAGGTCCCAGGTTTCTGAGTCCTTCTGTCCACCTGACCTAAATTCCTGCTCTCTCCAGTGACAGCAAAAGCCGCTCTGTTCCAGAGAGAGCCTGGTTCCCCCTGCCAACCGCTCCGTGGCTGCCTGCTTCATGCTAGCCCAGCTGTCCCGGCCTCAGTTTCCCTTTGGCCCTCCCCTGCCCTGGGCTCTCCCACTCCCACGGCTGCTCATAGACCTGGCACAGTGACTTGGCTTCTATGACCTCCAGGGAGATGCTTTTGCTGGAATTCAGGGCTCTGCCACTGCCACTGTAACGGCCATGAGCCCTGTGGGTGCTGAGTGGGCAGGTGAGGGCAGGGCTGGTGTGAAGAGGGGGTGCGGCCATCTCCAGGCCCCACAGCAGCCACCACCTCCCTGCTCAGCCCAGACCTGGTTTGCATCAGGGAGAGGGCGGAGTTTGGCTGTCACAGGAAGAGTCCCTCCCAAGGGGGCATCTGGCATGGGTGCCCGCCTGGCTGCCTGTCTTCCAGCCCCCACCTCGTGGTGTGGGAGCCGCTGCCTTGGCCGGCCCACTTGGGAACTCCTTCCCCAGGCGCCTGAAGCTGGCCCAGTTTGACTATGGGAGGAAGTGCGAGGAGATCGCTGAGCTGACGAACGGCATGTCGGCCCGGGAGATCGCACAGCTGGCTCAGTCCTGGCAGGTGAGTGGGGCTCGGGCGCACCCACCCAGACAGGAGCCCAACTCCTGTGGAGACGCCGGGTTGCGCCTGTCCCAGCACCAGTGTCACACCGCAGCTTCTGTTGAGGGGTTTTCAGTGCACAGACGTGACACGGGGCACTCGCCCCAGTCGGCCACTCCACACACTGGCGCGCCCCTGCTCCTGCCCTGGGAAGTGTGGGGCATGTCCGTGGCTGACGGTCATAGGTCAGGAAGCCCGTCCGGCATCCTAGTATCCGGGCTCTGCCAGGTGGGGCGGGAGGCTTTCGATGCTCACCTTGGCAGACGGGCACCCCCTGGTGTGAATGGTCATCGGGACAGGCCCCGCCTGAGTTTGGTGGTGGGGCTGGAGGGATGTTGTGTTTCCCGGACCACGTCCGTTGGCTTGATCCTGCTTGACGGGCTCAGACACAGGGGCAGGAGTGACCTCTGATTGTCCCACAGCCGGCTGCTCCTTGGAGGACCCCCTCCTGCAGCTCCGTGGCTGCTGCAGGGACGGGGAGCCGGGACTCAGAGCAGTGTGGGCGTGGCCATCCAGAAAGCTTTGGTCTTTGGGGGTTGCTGGAAAAGCATAACCAGGTCTGTAGAAGGCACCAAGGCCATGCACAGGCATTGCTGCCTCTGGGGTCTGCAGAGTCTGTGACAACCTGGTCACTCAACCTAGCAGCGCTTTCGCGTGTGACAGGTTCATGAAGTAGCCAGTTACCTTGATTTGAACGTTGGAGCTGGGGACTATATGGGCTGTATTAGTCAGTTATGCCGCTGTGACAAAGAGCCTCAGATCTCAAACCCCATCCTTGTGGGTCAGCTGAGGTCTGTGTTCCAGGCCGTCTCCACTTGAGACCAGGTCTGTTTCCACAACTAAGCAAACAGAgaccgggccatggtgttgggctacatttgttcccagcatttgggaggtcgaagtcagcccagattatttgaaggcaggagtcaggaccagccttggggggggggggggggggggggggggggggaaagcaaggggagactccatctacaaaaaataaaaaaattagccggaccctaatgtggcacgcctgtaatgcagctcctgggagcctgaggtgggatgatcactgagtcccaggtaggccagaaatacagtgagcctgtggattgtgccactgcactccagcccgggttacagagcgagaccctggtctttaaaaataagaataaTTTGAgccgggcatggtggctcacgcctgtaatcccagcacgctgggaggccaaggggagaggatcacttgaggccaggagttcgagaccagcctggccaacatgtcgagccccacctctactaaaaatacaagaattggccgggcgcagtggtggtgcatgcctgtattctcagctactcaggaggctgaggcaggagaatcgcttgaacccgggaggtggaggttgcagtgagctgagatggtgccattgaattccagcctggactattcaggatcctttgagattccataagaattttaggagtggttttcctatttttgtaaaacataatttgggttttcacagggaccgcgtttagtctctatgtcgctttgatgtctctcagcaatattCTGTGGttttctcttgttttcgagacggagtctcgctctgctgcccaggctggagtgcagtgttgtgatctcagctcactgcaacgttcccctcccgggttaaagtgattctcctgactcagcctcctgaggagctggaattccaggcaggcgccaccatgcccggctaatttttgtactaagagacggggttttgccatgttggccaggctggtctcgaacctctgacctcaggcaatccacccacctcagcctcctaaagtgctgagattaaaggcacgtgccaccacgcccggctaatttttgtatttttagtagagacgatgattcaccatgccggcgaggttggtcttgaactcctgacatgaggtaatccatctgcctctgcctcccaaagggctgggattcagacatgggccactgcgcccagccagttttcactgtacaagtctttcaccctcttggttaagtgaatttccaagcattttattcttgccgctgctgttgtaaatggaaacggtttcataattccccattcacattattcactgttgggatggagaactgcagctttctttgctgttgattttgtatcctgtaagtttgctgatgtcacggcattttttcttccaatatggattctaggattttctacatataagattatgtcatctgagaacaggtgatttttacctttcccttttcagtttggatgacttttctttttcttgtctaattgcactgtccagagcttccagtggtgtgtggaatagaagcggtaaagcattcttgcctggttccttacctcagaggaaaagctttgtttttcaccactgagtatgtcacctatgggcttgtgatgtgtggccttcattgtgtttagggtgtatccttcaattcttggtttggtgagtgtttttatcataaaagtgtgaggcgggtggatcacctgaggtcggcagttcgaggccagcctgaccaacgtgaagaaaccccatctctcctacaaatacaaacttagttgggcatggtggtgcatgcccgtaatcccagctactcgggaagctgagacaggagaatcgcatgaaggcggcaggcagaggttccagtgagccgagatcgcgccatttgcactccagcctgggcaagaagagcaaaattgtctccaaaaaaaaaaaaGTggccaggcacggtgactcacgcctgtaatcccagcactttgggaggccaaggtgggtggatcacgaggtcaggagatcgataccatcctggctaacacagtgaaaccctgtttctactataaatataaaacatcagctgggcatggtggcaggtgcctgtagtcccagctacctgggaggctggggcaaaagaatggcgtgaacccaggaagcggagcatgcagtgagctgagatgcctgggctacagagtgaggccccaactcaaaaaaaaaaaaaggtgttgtatttggtcgaatactttttctgcaacacttgagacagtcgtgtggtttccttcctccaccctgctaatatcgattgatttttgtatgttgaacatttcatatgcggaacattgattttcatatgttgaactatcgttgcattccaggaataaatcctgcttggtcggctgggcgcggtggctcaagcctgtaatcccagcactttgggaggccgagatgggcggatcacaaggtcaggagatcgagaccatcctgtctaacctggtgaaaccccgtctctactaaaaaatacaaaaaactagccgggcgaggtggcgggcgcctgtagtcccagctactcaggaggctgaggcaggagaatggcgtgaacccgaaaggcggagcttgcagtgagctgagatgcggccactgcactccagcctgggtgacagagcgagactccgtctcaaaaaaaaaaaaaaaaaaaaaaaaaatcctgcttggtcagggtatagagtccttttagtgtgctgctgaattcactctgctggcattttgttgaggactttcccagtgatgctcatcagggatattggcctgtcatttttcttgtggtgtctttgtctgggtttgatatcagggtaatgctggcctcctaggatgagtgaggaaatgttcttcaatttgtccaagagtttgaggtgtgctgctgattcttcttaatgttttgtgaattgacacgtgaagacatcaggtccaggtcttgtgtttCaacttttacagcttgaagactttaggttcccagaaaaattgcaaaggtagcacagagagctcccgGGCCCGGGGCCTTGCCACGTAGTGAACGTCATGTGTCACTGTTGGCCCCACCTGGGACTGGGTCTTGCCCAGAATCCCACCCAGGAGGCCACGTGACATTTAGCTGTCACTTCTGGTGGGCTCTGCCAGGTCCCGTGCTTCCTGGTGGGGTGGCCCCATGAGCATCTGCTCATCCCCTTTCCTCCACTGGGCCCTGGGTGAGGTGCAGCCACTCGGGTGCACCCTGAGGGTTCCTGCACCTGTTTGAACTCTCTTGGGTCGGCTCAAGACCAAAAATGATGCTGAGCAGTCCTGGGCCTCTGATGCATAGTGGTGGTCCGGTTCCGGTCAGCGTCTCCTGCACTCCTGGGCCCCTGAGCCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGGGTGGGCGCTGTCAGTGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCCCATCGTGGCAGCCGTGTTGTGGGAGGATGGTGCGCTGCTGCCCCTTTACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCCGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTGCCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCAAGCACGGTGCCCAGTGGGGGTGCCCAAACCTTCACCCTGACCCATGGGTGACTTCCCTTGGGGACTCCACGCCTTTCACTGGGACTGGGATGGAGAGCGACCTGTCCATGGCAGAAGGGCTGCACCTGAGGTGCTTGAAGCAACACCAAGGGCCACAGTCCCAGCAGCTCCAGCCTCCGCATGCTGGATGCCAAGTCCTGTGCCCAGGACAGGGAGGTGGAGGCACGGGTGATCTTGATGCTAGCACCTATGTGCCCCGAGGTTGGGCAGTGGCTGCCTCTGCTGTGGAGGCCTATGAAGGTGAGGGTCTGAGGATCTGTAGTGCACTGTGACCCGGGGGCACTGCCTGGCCACGGCTGAGACACGCAGAGGGTCTGCAATTCCCTCCTGCCTCTTGGGAGCTGCCCTGGGTCTGCAGTCAGTGGGGCTCGTCCTCGGGCTTTCCGTTATTAGAAAGTCACTGAGAAACTGCAGTGCTGAGGACGCAGGCAGGGCTGTGGCACTGCAGGGGCCGCTCCCGGTGTCCACACGCATGCTGGGCTCTGCCGAGGTGCCGGAAGCCTGTGTTTCACCCTGAGGCCGTCCTGGTGCCCCGGGTTTGGACCCTCCCCACCTCGGGGTCCTGGAGTGCGTTACGGGTGGGGGGTTCCCATGGTGGCCTCCCTCAGCTCCCTCTCTCCTCACTAGGACACGGCGTATGCCTCCGAGGATGGGGTCCTCACCGAGGCCATGTTGGATGCCCATGTTGAAGACTTTGTCGAGCAGCACCAGAAGAAAATGCGCTGGCTGAAGAGGGAGGGCCTGTCCTCATGGACCAGCACCCCTTAACCTGAGTCCGCGGTGAGACCACACGTCACGGAGCCTGGCTGCGGACCCCTCCCACCCCTGCTTTTCCGGTCCCTGCACGTTTAGGAAATGCTTCCCCTAATAAACTCCCACAGGTGCCACAGCGCTGTGTCTATTGGCTGATGTGGTGCGGGGTTTGGGGTCCCCTAGTGTCCTTCTGGGGTCAAAGGTGATAGAAAAGACAGGCTGGAGCTTTCTGGAGAATTTAGGCACAGAAGGGTGGGCTTCACATGAGGTGCCTGCCACAGCGGGGTTGGCTGCCTGAATGCCACCCGGGACCGGCTGCTCGCGCTCCATCCTGCAGCTGTGGAGACGGGGGTGCCCCTTTGCCTCTCTCCACGAAGTGCAGGGCAAACAAGACACAGCGGTTTCAAACAGGCGATGGCCCGGACTGCGTGCCTCGCCGCCCCTGCGCCTTCCCCTGCCCCTGCTTTCCAGCTAGTCCCTGAAAACCTTGATGGggccgggcgcggtggcccatgatggattctcagcactttgtgaggccaaggcgggtggatcacctgaggttaagtgttccagcccagcctggccaacatggtgaaaccccatctctcctaaaaaaaaaaaagaaaagaaaaagaaaaattagccgagcgtcgtggcaggtgtctgaaatctcaggcactcaggaggctgaggcaggagaatcacttgaccccgggaagtggaggttgcagtaagctgagaccatgccattgcagtgcagcctggacaacaagagtcaaactctctcaaaaaaaaaaaaaGgccaggtcaggtggcatgtgcctgtggtcccagcttggtcccagattcttggtttggaggctgaggtaggaggatcacttgagcatgggaggatgaggttgcagtgagccaagatcgcttcagacactccagcctgggtgacagagtgagaccctgtctctaaataatcaaaaCCTTGATTACAGCCATGGGGTGGGGGTTGGGGGGCGTCTGGCTCGGCAGGGAACTATTGGGTTTTTCTGCTCTCtaatttttgtagagacagggtttctctttgttgcccaggctggtctccaactcctgggtcaagcgtcgatcttctgcctcggcctcccaagtggtgaggttacaggcgtgccaccgcacctgaccTGttttctttttttttttttttttttttttgagacggagtcagctctgtcacccagggctggagtgcagtgggcggtctcagctcactgcaagctccgcctcccgggttcacggccattctcctgcctcagcctcccgagtagctgggactacaggtgcgtgccacaacgcccggctaagtttttgtatttttagtagagacagggtttcactgtgttagccagggtggtctcaatctcctgaccttgggatccgcccgtctcggcctcccaaagtgctgggattacaggcttgagccaccgcccccggccCCttttttttttttttttttggcaagggagtcttgctcgcccagggtggagtgcagtgttgcaatctgggctcactgcaacctccacgtccagggtgtcaggcctctgagcccacgctaagccatcatatccccagtgacctgcatgtgtacatctgatggcctgaagcccctgaagatccgcagaagtgaaaacagtcttaactgatgacattccagccttgtgatttgttcctgccccaccctacctgatcaatgtactttgtaatgtcccccacccttaagaaggttctttgtaattctccccaccctggagaatgtactttgtgagatccacccccagcccccaaaatattgctcctaactccactgcctatcccaaaacctctcagaactaacggtaatcccagcaccctttgctgactctttttggactcagctggcctgcacccgggtgaagtaaacagccttgtggttcacacaaaacctgtttcgtggtgtcttcacacggacacgcgtgacacagggttcgaggaaatttcatgcctgaacctccggagtagctgggattacaggcgaacggcaccatgcccaggttaatttttgtattttcggcagagacagaggcccaggtagccgggctggGGGACAGCTGGGTGTGGTGGGGCCGGCGGTGACCAGGGCTGTGCCGCGTCCTCCCGGTGTTTTCTGTGCCCACCAGCTGCCGCTTTAGATTCTCCGGGATAGTCTCCCTGAGGGGGCTGAGGAGCCTGTGTTCCCCTCGGGGCAGCTTGGCCGGCAGCCCCAACATTTCCTTCCTCATCCCTCCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCCACCCCGAGCAGTTGGACTGGGGCATCCATGACTGCATCGATGTGACGGTCCACTGCGACCTGCCACGGCAGGAGGGGCGGCAGCGCCAGGTGAGAATGTATTTTGACAAGTATGTTCTTAAGCCGGCCACAGAAGGGAAACAGTAAGTGTCCCGCCTCACCCGCCCCTGTCCAGGGACCCTCGCTCAGGGCCCACCCCGCCCCTGCTCTCCAGACGCACCCAGCAGGCCCAGTCTCCAGGGTGGGCACCACCTCCGTGCCCTGAGGTTTTGTGCGGGAGCCCTGTGGGCCCCGAGGGTCCCAGAGGCCGCATCCAGGAGGTCACGCCCCCTTTTCCTCATCCCCATCTGAGAACAGCCTGGTGGCGTCTCCTCAGGTTTGGGGGCAAAGTCCACCATCACTTAGAAACTTTCAGCAttccttttttttttttttcttaagacggactcttgctctgtcatccaggctggagtgcagtagcttgacctcggctcactgcaagctctgtctcccaggttcacgccgttctcctgcctcagcctcccaagtagctgggacaacaggcacccgacaccacgcccggctaatttttttgtgtttttttagtagagatgggtttgaccgtattagccaggatggtctcgatctcctgacctcgtgatccacctgcctcggcctcccaaagtggtgggattacaggtgtgagccaccgcatctgacctttttttgaggaagtctcactcttgtccccctggctggagtgcagtgccgggatctcagttcactgcaacctgtgcctcagcctcctgagtagttgggattataggtgcccgccaccgcgcctggctggtttttgtgtttttgtagagatggaatctaactccgtctcccaggctggagtacagtggtgtgatctcagcttactgcaacctccaccctccgggttcaaaccatcctcttgcctgagcctcctgaacagctgcgattacaggcgcccagcacaatgctcgcctcatttttttgtctttttagtagaaacagcttttcaccaaattgaccagactggtcttggacttctgatctcaagtgattcaccctcctcggcctccaaagtgcagggattgcagatgtgagccaccggacccggcctcttttatgttcctcttcagtaCTCAGAGGGCTGTGAGGAAATCCGGTGCCCGGCCACCCCCAGGCCCTGGCAGTGAGGGGAGGTGATGTGGAGTGTTACTCTGAGATTCCCATGTTTGGATTCGAGGGAGACGCTCATCATGAGACCCCTCCGTGTCGGGATTAGAGGGAGAGGCTCCTCATGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCCGGCCCTGGAGCCTCCTGGTGTGGGGCGCGGATCCGGCTGCACTTGTGCCTTGAGGCTCTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGAGAGCCCCCAGGAACACTCCAGATGAGCAGAGGCTGTTCCACCTCTTGGCGTCCCCAGGTCCCCGGTCTGAGTCCTTCTGTGCACCTGACCTAAATTCCTGCTGTCTCCTGTGACAACAAAAGCCACTCTGTTCCAGAGAGAGCCTGGTTCTCCCGTTGACCCCTCCGCTGCCGCCTGCTCCATGCTAGCCCAGCCGTCCAGGCCTCAGTTTCCCTTTGGCTCTCCCCTGCCCCGGTtcccagctgcttgggaggctgaggtaggaggatcatttgagtccaggagcttgaggttgcactgagctgtgactgtgccactgtactccagccttggcaacagagtgagacactgtcttaaaaaagaagaaTTTGggccagatgctgtgtttcatgcctgttcccagcatgctgggaggctgaggagagaagatcactcgaggccaggggttccagaccagcctgccaacatgttgaaccccgcctctacgaaaaatacaaaaattagccgggcgtggtgggtgggtgggtgccagtaatcccagctactcaggaggctgaggcagcaaaatctcttgaacctgggaggtggagattgtggtgagctgagatagtgccgctgtacttcaacctgagcaacagagtgagactccttatcaaaataaagaaaTCAATCAATCAATAAAAATAATCACAATAATTTGggctgggcgtggtggctcactcctgtaatcccagcactttgggaggcgtggatcggttgagttcgaggcaagcctggccaatgtggcgaaaccccatctccactacaaatacaaaaattagccaggtgtggtgacaggcacctgtaatcccagctgctcgggaggctgagacaggagaatctctggaacctaggaggcggaggttgcagtgagccaagatcacgtcagtgcgctccagcctgggtgacagagactgtctcaaaaaagaataataataaTTTgactgggtgtggcggctcactcttgtcatcccacactttgggaggccgaggcaggaggattgcttcagctcaggatttcgagactggcctggacaactggcctggacaacatggtgaaactccatctctacaaaaaatacaaaaattagccaggcatggtatcatgtgcctgtgatctcagctactcaggaagcagagatgggagcattgctggagcctgggagttggaggctgcaatgaaccatgttcgtgccactgcactccagtgtgggtgacagagtgagaccctgtctccaaaaggcatggtggctcacgcctgtaatccctgcactttgggaggccaagctgggtggatcacctgaggtcaagagttggagaccagcctggctaacgtggtgaaaccccatctctaggaaaaatagaaaaaATTggccaggtgcagtggctcacacctgtaatcccggcactttgggaggccgaggcgggcgaatgacctgagatcaggaattccagaccaaccacaccaatatggagaatccccgtctctactcaaaatacaaaatcagccgggcatggtagcaatcccagttactcaggaggccgaggcaggagaatcactggaggtgagccgagaccacgccattgcactgaagcctgagcaacgagagggaaactgtctcaaaaaataaTGCTAATAACAAGGGGGAGAGAACAGGAGTGTGGTCAGCAGCTGGGCCTGCCATAACCCCTGGGTCGTGTGTCCCCACAGCTCTGAAGGCTAGAGGCCCGAGGTCAGGGTGCCAGCTCGGTCCCCCCCGTGGAGTGTTCTCTGTTAGCTTCTCACATGGCAGGGAGAGTGACTGAGCTCTCGCTCTGGTGTCCCTTACGAGGACGTTCATCCCCCACTGCTCAGAGCGGCGGTGAGCCACCACGCCCAGCGCCAACTTTGTCCTTCAAGAGTTGTTTTTTTGTgccgggctcagtggctcatgcctggaatcccagcactttgaaatgccaaggtgggtggagcacctgaggtcaggagtttgactccagcctggtctaaatggtgaaaacctgcctctactaaacataaaaaaatcagctgggcatgttggtgtgtgcctgtaatcccagccactcgggaggctgaggcaggagaatcacttgaacccaagaggtggaggttgcagtgaactgagatcatgtcactgcactgcagcctggatgacaagagtgagactcccttgcaagaaaaaacaaaaattaaaaaagaaGTTGTTGTcttttttttttttttttcccttggacaattcaagatgcctagagattccatatcaattttagtaatgcttcttctatattttaaaaagtaatttgggtttttacagggattgcattcagtctctgtattgccttAATGACTCTTAGCAATGttgttttttttatttattattttttttctagagatggagtctcactctgtcagccaggctggagtttagttgttggccaggatgggcccaatctaatgacgtcaggtgatccgcctgcccctggctcccaaattgctgggattcagacgtgggccaccatgcccagccagtttacattgtacatttctttcaccttcttggttcagtgaagctccaagtattttattctttcggatgctcttgtaaatggaaatggtttcgtcattccccgttcagattatacacttactatgaagaactgcagctttctttgctgttgattttgtatcctgtaactttgctgatgtcgtggggttgttttttccaatatggattctagattttcCTTTTCTTTTTCTtttttttgtttttttgttttttttttttttgatatggggtctccctctgtggcccaagctggagtggaatgcagcggcacgatcttgaatctgcgagctcctctgcccgggtccacgccattctcctgcctcagcctcctgagtagctgagactacaggtgcctgccatcacggccggctaattttgtgtattttttgtgcagatgaggtttcaccgtgttagccaggatggtctcgatctcctaactttgtgatcggcccgcctcggcctcccaatgctgAATGCTGTTGGGACTGGGTCTTGCCCCAGAATCCCACCCAGGAGGCCACCTGACGTTTAGCTGTGACTTCTGGTGGGCTCTGCCAGGTCCCATGCTTCCTGGTGGGGTGGCCCCGTGAACGTCTTCTCAGGCCCTTTCCTCCATTGGGCCCTGGGTGAGGTGCAGCCACTCGGGGGCACCCTGAGGGTTCCTGCACCTGTTTGAAGTCTCTTCGGTCGGCTTGAGACCAAAAATGATGTTTAGCAGCCCTGGCCCCCTGACGCACAGTGGCGGTCCTTCTCCGGTCAGTGTCCCCTGCACCCTTGGGCTCCTGACGCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGTGTGGGCGCTGCCAGAGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCTGGTCGTGGCGGCCGTGTTGTGGGAGGATGGTGTGCTGCTGCCTCTGCACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCTGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTACCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCGAGCTCGGTGCCCTGTGGGCGTCCCCATGCCCAGACCCTGACCCACAGGTGCCTCCCCTTGGGGTCTCCACGCCTTTCCCTGGCCCTGGGATGCAGAGTGACCTGTCCATGGTAGAAGGGCTGGACCTGAGGTGCCTGAGACAGCACCAAGGGCACTGGTCCCAGCAGCTCCAGCCTCTGTGTGCTGGATGCCACACAGACACAAGACTCTTGGGAGACGCATTTTCCATCTGGCTCAGAGGGGGAGGGGGAGGCTTTGCAACCCAGCCCCTGCCCAGGCCCCTGGGAGGGTGGGTGCCTGCTGAGCCCCCGGGGCAGCAGGAGCGGGGCAGGCGGGGTCTTTGTTCTCACTCCCACAGCAGAGGCAGATGTGGGGGCGCCTGCTGGGGCCAGACCAAGGTGGGGTGGCCTGGAGACTGCTTCCAACCGTGGCCGGGAAGCAGGGAACCTGCCCGGCGTGTCTGAGGCCACACTCTCAGCTGGCCGGTCCAAGCCTGCGGCTGGAGCTGGTGTCTGTTTAGCTAATAAAGTCCCACAGTTGCCTCACTGCCGTGTCTATTTGCTGATGCTGCGCGGGGTTTCAGGGGCCGCCTAGCCTCCTCCTGGGGTCAAAGGTGACAGAAGAGGCAGAGGCGGGAGCTTT
## >chr1:519473:<DUP>:22560:2560.16::chr1:519473-542033
## CTTTCTGGAGAATTTACTGACCACAGCGTGGTGCACTTGACATCAGGCGCCCGCCATGGCCGGGCCTGGGTCTGAATGCTGCCCGGGACCAGCTGCCTGCGCTCCAGCAGCCCCTCCCTCCTGAAGGCCAGGTCCCCCGAGAAGAACGAGGCTGCAGAGTGATGTGGGGGCCAGCGGTGACTTCCTACCACACTGTTCTCAGGTGTAAGAGGGCTCGCTTCTGCCCAGGCATTGTCCGTGGAAGACACACAGCCGGCCACTGCAGCCTCAGTCCTGGGATGCCCTGGGGCTGGGTCACAGGGGGCCACGGGCCACGCTGGGAGGCCACAGTCCTGTCGTGCCATGCAGCTCCCTGTCCCCAGATGTCCGCTCAGGGATGCAGAGGGCAGAAACCACACTCGCTGCCTGAATTCTGGGAGCAGAGCCCGGTACCCACTGCCTGGCCGGGGCCTACCCTGGGACTCCAGCCCCTGTTCCCGCTGGCCCGGGCTTCCGGAGGCAACTGTGTCCCTATCCTGGCTCAAGGTCCAGGCTGCACCTGGAACCTGCACGGTCACTCCTCCAGGTCCTCAATGCTGGAGGACTCTCTCAGACAGGAAACCTTTGCGTTGGGCGCAGGGCGGGGTGCGGGGTGGTCACGGGGAATCGCAGGGCAAAACAGCACAGTGCAATCGCGCAGAGCCTGATATTGGCGGATGAAACATAAACTGCTTTCTGCACTTTGTGTCCTTAGGAAGGGTGTGGGGTGTTGGCGGAAGTAGGAAACAGAAGAGGAGCCTGGGCATGCAGCGGGTCTGTCAGAGAGCAGAGCCCTCGGAGCTGCAGTGCTTGGAGGGAGGCGGTTCACCTCTGCCCACTCTCTCCAtttctctctctctcattttccttttagagatggattcttgctctgagcctaggctggagtgcagtggtgtgattatagctcattgcagcctcgcccttccaggctcaagtgatcctcctgcctcagcctgtccagtagcCATACCCTACTAGGTCCTAGTTAGCCCCCAGAGGCGTGCACCACCACGCCCACTAATTGCAAAAATTTGTTggctgggcgcgatggctaacatctgtaatctttgggaggccaaggcgggcggatcacgaggtcaagagatggagaccatcctggctaacacggtgaaacccggtctctactaaaaatacaaaaaattagccgggtgtggtggcgggggcctgtagtcccagctactcaggaggctgaggcaggagaatggcgggaacccgggaagtggagcttgcagtgatctgagatcactccactgcactccagtctgggggacagagcgagactccgtctcaaaataaataaataaataaatatataataaataaataaaaataaaaataaaaCTAAGCCCTTCCTGATGGTCATTGGGGGGTTTGGGGGTTGGGGGGGGTGTCTGGCTATGGCTGGGGAACTCATTTGGTTTTCCTCCTCCTCCTCtttttattttttggtagagacggggtctcttgatttcccaggctgatctccaactcctgggctcaagcaatcctcctgcctcagcctcccaaagtgttgggattacaggcctgagacaccgtagctagccAGCtttctttttttttttgagacggagtcctgctgtcacccaggctggagtgcagtggcgagatctcagcggatcactgtgttatacgtaaattttcggtgtcgcaaaagaagtagcactcgaatgtacacttttctcagctaggaaatttacttctatagaaggggggtctcatagatggagcaatggtgagcatttggacaagggaggggaaggttcttattcctgacgcaggtagcgcctactgctgtgtggttcccttattggacagcgttagacctcacaatctaaatccgattggcCtttttttttttttgagatggagtcttgctgtgtcgcccagactggagtacagtggtgcgatcttggctcactgcaagctctgcctcctgggttcatgccattcttctgccttagcctcctgagtagttgagactacaagtgtatgccatcatgtgcggctaatttttgtgtttttggtagaaagagatttcaccacgttggccaggatggtctcgatctcctgacctcgagatccacctgcctcggcctcccacagtgctgggattataggcatgagccactgcacctggccttaagtggttctttaaagtctgattcgttgtttctactttccctgatgagggtgggtgtcaaggagtgtggtattcttacataatgtctgatgtttggaatagcAttttttttttttttgaggcagagtctcactctgtcgcccatgctggagtgtagtggcaccatcttgtctcactgtaacctttgcctcccgggttcaaacgatcctcctgcctcagccttccacgtagctaggattacaggcgtccaccaccacggccggctagcttttatatttttagtagagacggggtttcaccatgttggccaggctgtacttgaacttctgacctcaatgatctgcccccctcagcctcccgaagtgctgggatacaggtgtgagccaccactccTCGCTCAAGTAATATGTTAAACTTATGCTTTCTTCTTTTCTTCTTTCttttttttttttttttttttggatggagtcttgttctgtctgcccaggcttgagggcatggcataactcggctcactgccctccgccgttccagtcatgcatatctgctgccttcagcctcctttagtacgggacacgaggccacctgccacccgtgcctggctatttttttatttttttttttttttttttttttttttttttttATCAGgacagagtctggctctgccgccaggctggagcttgcagtggcgtcagctcaacctgcaagctccgctccgcgggttcaacgccattctcatgcctcctcagcctccccgagtaattgggactacagcgcgcccgccaccgccccgctcagtttttgtattttttagcagagaggggttaccgtgtagccaggatgggtctcgattcctgacgcctcgtgatccgcccgtctcggctcccaagctgggattacaggcttgagccacgcgccccggcccggcatttttttcatttttagtaagaaacagggtttcaccgtgtttagccaggattggtgtcgatttcctgacccgtgatccgcccccctcggcctcccaaagtgctggattccaggcctgagcctgcaagccgggccTACTCTTTGGCTTTTAAAAGAATGGGCAACATTGCTTTTCTTTACTAACTTCTAATCTTTCCCTCTCTGACTCATCTCTCCTCCCACTTCTCTTGTTCTCCCTGTCAGTGTTCCTTTCCTAAGAGTTTTTCCCTGTCTATGATCTTTTTTTATAGGCTTTTTTCTAGTTTCTCTTTCTTTGTAATTGTGCGTTAATACTGGCCAATTGTTAGTGACAAATTCCTTGCCAAGAGATCCCTGACCCTAAACCAGCATATTCTGTCCATTCGTTTTAATCTGTACtttatttttcttgagatggagttccgctctgtcgccaggtgtggatggtgtagtggcacgttctcgctcactgtcaactcgccctccagggtcaacccgcaccatcctcgctgccttagcctccgagtacggggattgtacaagcgtccaccacccggcctggcgaggcgcttgatttttttatttcagtagagatgggggttttcatcgtgttagccagatggtcccccccatctcctggactcatgctccgcgcaccgccccttggcctccgcaagtgcgcgattaTGATCTCTCTCAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtctcctgcctcagccctctgagtagctgggattacaggcattttttgtatttttagtacagatggggtttcaccattttggtcaggctggtggaggactcctgacctcaaatgacctccccgcctggcctcccgaaatgctgggattacatgcgtgatccaccacgcccagccATACAGttcttatgttaagacaggctctctgtcgcccaggctggagtgcagtggcgcgatcacagctcactgtttgcctcgacctttcaagctcaagctgtcctcctgcctgagccgcccgcgtagccaggactgcaggggcacagtgccatgcccggctaatttttttttttgtgatggcgttttgctcttgttgcccaggctggagtgcggtggcgcaatcttggctcactgcaacctcctccccctgggttcaagcaattctcctgcctcagcctcccaagtagctgggattacagtcatgtaccaccacgcccggttcattttgtatttttttttagtagacaagggatttctccatgtcagtcaggctagtcctaaactcctgacctcaggtgacccgcccacctcagcctcccaaactgctgggattacaggcgtgagccactgtgcctggtcCTGGCTAATAttttttttttttttgagacggagcctcgctctgtcacccagactaaagtacagcggcgcaatctcagctcactgcaagctccgcctcccgggttcatggcattctcctgcctcagcctcccaagtagctgggactacaggctcctgtcacctcgcccggctaattttttgtatttttgtagagacggggtttcacagtgttagccaggatggcctcaatctcctgacctcgtgatccgcccacctcggcctcccaaagtgctgggattataggcgtgagccaccgcgcccagctgtttttttgtaatgttagtagacatggactttccccttgttacccaggctgggctcaaacttctgaggtataagagatgctcccgccttgaccttgtgaagttctgggattacagacgtgagcccccatgcccagtcAGGGGTttgtttgttttggtttttgtttttgtttttgagacagagtctcactctgtcgcccatgctggagtgcagccgtgcaattttggctcgctgcaacctctgcctcccgggttgaagtgattctcctgcttcagcctcccacgtagctgagaccacaggtgtgccaccgcgcctggctgatttttgtatttttagtggagacggggtctcaccatattggccaggatggcctcaaactccctacctcaggtgatctgcccgcctcggcctcccaaaatactacgttacatgcatgagccaccgtccctggcTGTGGTCAGGCTTTTGAGTTTAGATCCATGAAAGTGTGGCCGCGTCCCTGCTCCCTGCAGGAGGGAGGCCTGTGGGACCTTCTGCTGTGGCTGTTTACAAGGCTTTGCTCCTGGTGCCTAAGGCTGGAACCTTCTCTCTGCAGGAGGAGATGAGCAATTACTACCTCAGAGTCACCCAGAACGCCTTCCTAAACCACACGAGGCAACGCAGCAACAAGTGAGGGAGCCCCTCGGGTCCTGGGCCCCCGGGTAGGGCTGTGCAGCCGTCGCCCTTGGTTCCCACAGAGGGACCTCAGAGGCCCTGGATCACAGTGCTGGGCAGCACCCGTGGCCTCAACGTGTCCACCTCGGATGTCCCCTAGGAATGTCCCAGCTCGGGACAGCATGGGGCGTCACTGAGGAACATGCGGGGGCCTCCTGGGCAGAGCCGGGGTCAGTCCCGTCCTCACGGCCCTGTGCGATGCCGCCCCAGCTTGCACGTCCCTCTGCCCCTGGGTTTCCGCGGTCCTGTGCCAGCAAGGGAGGCGGTCTGATTGTCTGAGGCTCTGCTGGGGCCTCCATTGCAGGCTGTGGGTGCCCTGGGGTGGGAGATGGAGACACTTTTGCTCCCACGGGAAGCTGGGCACGAGCAGGTCCTGTGTGTTTGGGCGGAGCCTGGGGCCTTGGCCCCCCCGCCCAGATGCTGGACAGGGTTGCTCCCTCCAGGCCTGGGGCCCTCCTCACATTGCGCGTCCTCCGTGAGCTGCTACCCAGAGGTCCCCAGTAGGTGGATAGCCCCATGGCCAGGCTCCCTAGCCCCTTTCAAATCCCCTTATTTTGAGTTTTCTTGGTCTCCTGGGCCCCTCCAGCCCCAGTCACGTGTCACACGGAGAATCAAGTCCTGCCGGTCGGCCGTGGCCGAGTCTTCAGGCGTGTTGGGCTCGCTGGCTCAGCTGCTGCCGGTAGACGCTCCCTGGAGCCCTGGCTCAGGTCCTTCCCAGAGAGGCAGGGCTGGGGCCCTGGTGAGCCTCCGCTGCACCCGGGCCCCCAAGGTCCTGCTCCTGGCTCGTGTGGCCACTCTTGGCATGGACTCTGGGTCCCGCATCCCTGCTCCCAGCACAGCAGGGCTCAGGCAGCAGGAGGAGTGGTGGTCCCGACGCTGCCTATCACGCTGGGTGAGGGTCAGCGGGGAAGCGCCACACGGGATGAGAACAGAGGCCCAGGTAGCCGGGCGGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAAGGCTGTGCCACGTCCTCCCGATGTTTCCTGTGCTCACAAGCTGCCGCTTTAGATTCTCCGGGAAAGTCCCCCTGAAGGGACTAAGGAGCCCGCGTTCCCCTCGGGACAGCTTGGCCGGCAGCCCCAGCATTTCCTTCCCCATCCCTGCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCGACCCCAAGCAGTTACACCAGGACATCCACGACCGCATCGACGTGATGTTTTACTTCGACCCGCCCGGGCCAGAGGAGCGGGAGCGCCTGCTGAGAATGTATCTTGACAAGTATGTTCTTATGCCGGCAACAGAAGGAAAGCAGTAAGTGTCTCCCCTCACCCACCCCTGTCCAGGGACCCTCGCTCTGGGCCCACCCCCGGCCCTGCTCTCCGGACGCACACAGCAGGCCCAGTCTCCGGGGTGGCACCGCCTCCCTGCTTTGCGGTTTCGCACAGGAGCCCTGTGGGCCCCAAGGGTCCCAGAGGCTGCACCCAGGGATGTGCCACCACCCTTTCCTCATCCCCACCTGAGAACAGCCTGGTGGTGTCTCCTCGGGTTTGGGGGGCAGAGCCCACCATCACTTACAAACCTTCAACtttttgtttttgagacaaagtcttgctctgtgccccaggctggagtgcagtggcacgatctcagctgactgcaacctccgcctcctgggttcacgcgattctcctgcctcagcctcctgagtagctgcgattataggtgcctgccaccacgccccactgcttttcgcctttttgtagagatgcagtttcaccatgttggccagggtggtctcgaaaccctgacctcgggtgatctgcccgccttggcctcctacagtgcagggattacagatgccagccactgtgcccgaccACCCTCAGGCCCTGGCAGTGCAGGGAGGTGACGTGGAGTGTTGCTCTGAGACCCCCATGTTGGGATTTGAGGGAGACGCTCCTCATGAGAGCCCCGTGTTGGGACTGGAGAGGATCCTCACGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCTGGCCCTGGAGCCTCATGGTGTGGGGCGCGGCTCCGGCTGCACTTGTGCCCTGAGGCCTTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGACAGGCCCTGGGGCTGCACGCCGGCTGCCTCAGGAACACTCCAGATGAGCAGTGGCTGCTCCACCTCTTGGCGTCCCCAGGTCCCAGGTTTCTGAGTCCTTCTGTCCACCTGACCTAAATTCCTGCTCTCTCCAGTGACAGCAAAAGCCGCTCTGTTCCAGAGAGAGCCTGGTTCCCCCTGCCAACCGCTCCGTGGCTGCCTGCTTCATGCTAGCCCAGCTGTCCCGGCCTCAGTTTCCCTTTGGCCCTCCCCTGCCCTGGGCTCTCCCACTCCCACGGCTGCTCATAGACCTGGCACAGTGACTTGGCTTCTATGACCTCCAGGGAGATGCTTTTGCTGGAATTCAGGGCTCTGCCACTGCCACTGTAACGGCCATGAGCCCTGTGGGTGCTGAGTGGGCAGGTGAGGGCAGGGCTGGTGTGAAGAGGGGGTGCGGCCATCTCCAGGCCCCACAGCAGCCACCACCTCCCTGCTCAGCCCAGACCTGGTTTGCATCAGGGAGAGGGCGGAGTTTGGCTGTCACAGGAAGAGTCCCTCCCAAGGGGGCATCTGGCATGGGTGCCCGCCTGGCTGCCTGTCTTCCAGCCCCCACCTCGTGGTGTGGGAGCCGCTGCCTTGGCCGGCCCACTTGGGAACTCCTTCCCCAGGCGCCTGAAGCTGGCCCAGTTTGACTATGGGAGGAAGTGCGAGGAGATCGCTGAGCTGACGAACGGCATGTCGGCCCGGGAGATCGCACAGCTGGCTCAGTCCTGGCAGGTGAGTGGGGCTCGGGCGCACCCACCCAGACAGGAGCCCAACTCCTGTGGAGACGCCGGGTTGCGCCTGTCCCAGCACCAGTGTCACACCGCAGCTTCTGTTGAGGGGTTTTCAGTGCACAGACGTGACACGGGGCACTCGCCCCAGTCGGCCACTCCACACACTGGCGCGCCCCTGCTCCTGCCCTGGGAAGTGTGGGGCATGTCCGTGGCTGACGGTCATAGGTCAGGAAGCCCGTCCGGCATCCTAGTATCCGGGCTCTGCCAGGTGGGGCGGGAGGCTTTCGATGCTCACCTTGGCAGACGGGCACCCCCTGGTGTGAATGGTCATCGGGACAGGCCCCGCCTGAGTTTGGTGGTGGGGCTGGAGGGATGTTGTGTTTCCCGGACCACGTCCGTTGGCTTGATCCTGCTTGACGGGCTCAGACACAGGGGCAGGAGTGACCTCTGATTGTCCCACAGCCGGCTGCTCCTTGGAGGACCCCCTCCTGCAGCTCCGTGGCTGCTGCAGGGACGGGGAGCCGGGACTCAGAGCAGTGTGGGCGTGGCCATCCAGAAAGCTTTGGTCTTTGGGGGTTGCTGGAAAAGCATAACCAGGTCTGTAGAAGGCACCAAGGCCATGCACAGGCATTGCTGCCTCTGGGGTCTGCAGAGTCTGTGACAACCTGGTCACTCAACCTAGCAGCGCTTTCGCGTGTGACAGGTTCATGAAGTAGCCAGTTACCTTGATTTGAACGTTGGAGCTGGGGACTATATGGGCTGTATTAGTCAGTTATGCCGCTGTGACAAAGAGCCTCAGATCTCAAACCCCATCCTTGTGGGTCAGCTGAGGTCTGTGTTCCAGGCCGTCTCCACTTGAGACCAGGTCTGTTTCCACAACTAAGCAAACAGAgaccgggccatggtgttgggctacatttgttcccagcatttgggaggtcgaagtcagcccagattatttgaaggcaggagtcaggaccagccttggggggggggggggggggggggggggggggaaagcaaggggagactccatctacaaaaaataaaaaaattagccggaccctaatgtggcacgcctgtaatgcagctcctgggagcctgaggtgggatgatcactgagtcccaggtaggccagaaatacagtgagcctgtggattgtgccactgcactccagcccgggttacagagcgagaccctggtctttaaaaataagaataaTTTGAgccgggcatggtggctcacgcctgtaatcccagcacgctgggaggccaaggggagaggatcacttgaggccaggagttcgagaccagcctggccaacatgtcgagccccacctctactaaaaatacaagaattggccgggcgcagtggtggtgcatgcctgtattctcagctactcaggaggctgaggcaggagaatcgcttgaacccgggaggtggaggttgcagtgagctgagatggtgccattgaattccagcctggactattcaggatcctttgagattccataagaattttaggagtggttttcctatttttgtaaaacataatttgggttttcacagggaccgcgtttagtctctatgtcgctttgatgtctctcagcaatattCTGTGGttttctcttgttttcgagacggagtctcgctctgctgcccaggctggagtgcagtgttgtgatctcagctcactgcaacgttcccctcccgggttaaagtgattctcctgactcagcctcctgaggagctggaattccaggcaggcgccaccatgcccggctaatttttgtactaagagacggggttttgccatgttggccaggctggtctcgaacctctgacctcaggcaatccacccacctcagcctcctaaagtgctgagattaaaggcacgtgccaccacgcccggctaatttttgtatttttagtagagacgatgattcaccatgccggcgaggttggtcttgaactcctgacatgaggtaatccatctgcctctgcctcccaaagggctgggattcagacatgggccactgcgcccagccagttttcactgtacaagtctttcaccctcttggttaagtgaatttccaagcattttattcttgccgctgctgttgtaaatggaaacggtttcataattccccattcacattattcactgttgggatggagaactgcagctttctttgctgttgattttgtatcctgtaagtttgctgatgtcacggcattttttcttccaatatggattctaggattttctacatataagattatgtcatctgagaacaggtgatttttacctttcccttttcagtttggatgacttttctttttcttgtctaattgcactgtccagagcttccagtggtgtgtggaatagaagcggtaaagcattcttgcctggttccttacctcagaggaaaagctttgtttttcaccactgagtatgtcacctatgggcttgtgatgtgtggccttcattgtgtttagggtgtatccttcaattcttggtttggtgagtgtttttatcataaaagtgtgaggcgggtggatcacctgaggtcggcagttcgaggccagcctgaccaacgtgaagaaaccccatctctcctacaaatacaaacttagttgggcatggtggtgcatgcccgtaatcccagctactcgggaagctgagacaggagaatcgcatgaaggcggcaggcagaggttccagtgagccgagatcgcgccatttgcactccagcctgggcaagaagagcaaaattgtctccaaaaaaaaaaaaGTggccaggcacggtgactcacgcctgtaatcccagcactttgggaggccaaggtgggtggatcacgaggtcaggagatcgataccatcctggctaacacagtgaaaccctgtttctactataaatataaaacatcagctgggcatggtggcaggtgcctgtagtcccagctacctgggaggctggggcaaaagaatggcgtgaacccaggaagcggagcatgcagtgagctgagatgcctgggctacagagtgaggccccaactcaaaaaaaaaaaaaggtgttgtatttggtcgaatactttttctgcaacacttgagacagtcgtgtggtttccttcctccaccctgctaatatcgattgatttttgtatgttgaacatttcatatgcggaacattgattttcatatgttgaactatcgttgcattccaggaataaatcctgcttggtcggctgggcgcggtggctcaagcctgtaatcccagcactttgggaggccgagatgggcggatcacaaggtcaggagatcgagaccatcctgtctaacctggtgaaaccccgtctctactaaaaaatacaaaaaactagccgggcgaggtggcgggcgcctgtagtcccagctactcaggaggctgaggcaggagaatggcgtgaacccgaaaggcggagcttgcagtgagctgagatgcggccactgcactccagcctgggtgacagagcgagactccgtctcaaaaaaaaaaaaaaaaaaaaaaaaaatcctgcttggtcagggtatagagtccttttagtgtgctgctgaattcactctgctggcattttgttgaggactttcccagtgatgctcatcagggatattggcctgtcatttttcttgtggtgtctttgtctgggtttgatatcagggtaatgctggcctcctaggatgagtgaggaaatgttcttcaatttgtccaagagtttgaggtgtgctgctgattcttcttaatgttttgtgaattgacacgtgaagacatcaggtccaggtcttgtgtttCaacttttacagcttgaagactttaggttcccagaaaaattgcaaaggtagcacagagagctcccgGGCCCGGGGCCTTGCCACGTAGTGAACGTCATGTGTCACTGTTGGCCCCACCTGGGACTGGGTCTTGCCCAGAATCCCACCCAGGAGGCCACGTGACATTTAGCTGTCACTTCTGGTGGGCTCTGCCAGGTCCCGTGCTTCCTGGTGGGGTGGCCCCATGAGCATCTGCTCATCCCCTTTCCTCCACTGGGCCCTGGGTGAGGTGCAGCCACTCGGGTGCACCCTGAGGGTTCCTGCACCTGTTTGAACTCTCTTGGGTCGGCTCAAGACCAAAAATGATGCTGAGCAGTCCTGGGCCTCTGATGCATAGTGGTGGTCCGGTTCCGGTCAGCGTCTCCTGCACTCCTGGGCCCCTGAGCCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGGGTGGGCGCTGTCAGTGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCCCATCGTGGCAGCCGTGTTGTGGGAGGATGGTGCGCTGCTGCCCCTTTACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCCGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTGCCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCAAGCACGGTGCCCAGTGGGGGTGCCCAAACCTTCACCCTGACCCATGGGTGACTTCCCTTGGGGACTCCACGCCTTTCACTGGGACTGGGATGGAGAGCGACCTGTCCATGGCAGAAGGGCTGCACCTGAGGTGCTTGAAGCAACACCAAGGGCCACAGTCCCAGCAGCTCCAGCCTCCGCATGCTGGATGCCAAGTCCTGTGCCCAGGACAGGGAGGTGGAGGCACGGGTGATCTTGATGCTAGCACCTATGTGCCCCGAGGTTGGGCAGTGGCTGCCTCTGCTGTGGAGGCCTATGAAGGTGAGGGTCTGAGGATCTGTAGTGCACTGTGACCCGGGGGCACTGCCTGGCCACGGCTGAGACACGCAGAGGGTCTGCAATTCCCTCCTGCCTCTTGGGAGCTGCCCTGGGTCTGCAGTCAGTGGGGCTCGTCCTCGGGCTTTCCGTTATTAGAAAGTCACTGAGAAACTGCAGTGCTGAGGACGCAGGCAGGGCTGTGGCACTGCAGGGGCCGCTCCCGGTGTCCACACGCATGCTGGGCTCTGCCGAGGTGCCGGAAGCCTGTGTTTCACCCTGAGGCCGTCCTGGTGCCCCGGGTTTGGACCCTCCCCACCTCGGGGTCCTGGAGTGCGTTACGGGTGGGGGGTTCCCATGGTGGCCTCCCTCAGCTCCCTCTCTCCTCACTAGGACACGGCGTATGCCTCCGAGGATGGGGTCCTCACCGAGGCCATGTTGGATGCCCATGTTGAAGACTTTGTCGAGCAGCACCAGAAGAAAATGCGCTGGCTGAAGAGGGAGGGCCTGTCCTCATGGACCAGCACCCCTTAACCTGAGTCCGCGGTGAGACCACACGTCACGGAGCCTGGCTGCGGACCCCTCCCACCCCTGCTTTTCCGGTCCCTGCACGTTTAGGAAATGCTTCCCCTAATAAACTCCCACAGGTGCCACAGCGCTGTGTCTATTGGCTGATGTGGTGCGGGGTTTGGGGTCCCCTAGTGTCCTTCTGGGGTCAAAGGTGATAGAAAAGACAGGCTGGAGCTTTCTGGAGAATTTAGGCACAGAAGGGTGGGCTTCACATGAGGTGCCTGCCACAGCGGGGTTGGCTGCCTGAATGCCACCCGGGACCGGCTGCTCGCGCTCCATCCTGCAGCTGTGGAGACGGGGGTGCCCCTTTGCCTCTCTCCACGAAGTGCAGGGCAAACAAGACACAGCGGTTTCAAACAGGCGATGGCCCGGACTGCGTGCCTCGCCGCCCCTGCGCCTTCCCCTGCCCCTGCTTTCCAGCTAGTCCCTGAAAACCTTGATGGggccgggcgcggtggcccatgatggattctcagcactttgtgaggccaaggcgggtggatcacctgaggttaagtgttccagcccagcctggccaacatggtgaaaccccatctctcctaaaaaaaaaaaagaaaagaaaaagaaaaattagccgagcgtcgtggcaggtgtctgaaatctcaggcactcaggaggctgaggcaggagaatcacttgaccccgggaagtggaggttgcagtaagctgagaccatgccattgcagtgcagcctggacaacaagagtcaaactctctcaaaaaaaaaaaaaGgccaggtcaggtggcatgtgcctgtggtcccagcttggtcccagattcttggtttggaggctgaggtaggaggatcacttgagcatgggaggatgaggttgcagtgagccaagatcgcttcagacactccagcctgggtgacagagtgagaccctgtctctaaataatcaaaaCCTTGATTACAGCCATGGGGTGGGGGTTGGGGGGCGTCTGGCTCGGCAGGGAACTATTGGGTTTTTCTGCTCTCtaatttttgtagagacagggtttctctttgttgcccaggctggtctccaactcctgggtcaagcgtcgatcttctgcctcggcctcccaagtggtgaggttacaggcgtgccaccgcacctgaccTGttttctttttttttttttttttttttttgagacggagtcagctctgtcacccagggctggagtgcagtgggcggtctcagctcactgcaagctccgcctcccgggttcacggccattctcctgcctcagcctcccgagtagctgggactacaggtgcgtgccacaacgcccggctaagtttttgtatttttagtagagacagggtttcactgtgttagccagggtggtctcaatctcctgaccttgggatccgcccgtctcggcctcccaaagtgctgggattacaggcttgagccaccgcccccggccCCttttttttttttttttttggcaagggagtcttgctcgcccagggtggagtgcagtgttgcaatctgggctcactgcaacctccacgtccagggtgtcaggcctctgagcccacgctaagccatcatatccccagtgacctgcatgtgtacatctgatggcctgaagcccctgaagatccgcagaagtgaaaacagtcttaactgatgacattccagccttgtgatttgttcctgccccaccctacctgatcaatgtactttgtaatgtcccccacccttaagaaggttctttgtaattctccccaccctggagaatgtactttgtgagatccacccccagcccccaaaatattgctcctaactccactgcctatcccaaaacctctcagaactaacggtaatcccagcaccctttgctgactctttttggactcagctggcctgcacccgggtgaagtaaacagccttgtggttcacacaaaacctgtttcgtggtgtcttcacacggacacgcgtgacacagggttcgaggaaatttcatgcctgaacctccggagtagctgggattacaggcgaacggcaccatgcccaggttaatttttgtattttcggcagagacagaggcccaggtagccgggctggGGGACAGCTGGGTGTGGTGGGGCCGGCGGTGACCAGGGCTGTGCCGCGTCCTCCCGGTGTTTTCTGTGCCCACCAGCTGCCGCTTTAGATTCTCCGGGATAGTCTCCCTGAGGGGGCTGAGGAGCCTGTGTTCCCCTCGGGGCAGCTTGGCCGGCAGCCCCAACATTTCCTTCCTCATCCCTCCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCCACCCCGAGCAGTTGGACTGGGGCATCCATGACTGCATCGATGTGACGGTCCACTGCGACCTGCCACGGCAGGAGGGGCGGCAGCGCCAGGTGAGAATGTATTTTGACAAGTATGTTCTTAAGCCGGCCACAGAAGGGAAACAGTAAGTGTCCCGCCTCACCCGCCCCTGTCCAGGGACCCTCGCTCAGGGCCCACCCCGCCCCTGCTCTCCAGACGCACCCAGCAGGCCCAGTCTCCAGGGTGGGCACCACCTCCGTGCCCTGAGGTTTTGTGCGGGAGCCCTGTGGGCCCCGAGGGTCCCAGAGGCCGCATCCAGGAGGTCACGCCCCCTTTTCCTCATCCCCATCTGAGAACAGCCTGGTGGCGTCTCCTCAGGTTTGGGGGCAAAGTCCACCATCACTTAGAAACTTTCAGCAttccttttttttttttttcttaagacggactcttgctctgtcatccaggctggagtgcagtagcttgacctcggctcactgcaagctctgtctcccaggttcacgccgttctcctgcctcagcctcccaagtagctgggacaacaggcacccgacaccacgcccggctaatttttttgtgtttttttagtagagatgggtttgaccgtattagccaggatggtctcgatctcctgacctcgtgatccacctgcctcggcctcccaaagtggtgggattacaggtgtgagccaccgcatctgacctttttttgaggaagtctcactcttgtccccctggctggagtgcagtgccgggatctcagttcactgcaacctgtgcctcagcctcctgagtagttgggattataggtgcccgccaccgcgcctggctggtttttgtgtttttgtagagatggaatctaactccgtctcccaggctggagtacagtggtgtgatctcagcttactgcaacctccaccctccgggttcaaaccatcctcttgcctgagcctcctgaacagctgcgattacaggcgcccagcacaatgctcgcctcatttttttgtctttttagtagaaacagcttttcaccaaattgaccagactggtcttggacttctgatctcaagtgattcaccctcctcggcctccaaagtgcagggattgcagatgtgagccaccggacccggcctcttttatgttcctcttcagtaCTCAGAGGGCTGTGAGGAAATCCGGTGCCCGGCCACCCCCAGGCCCTGGCAGTGAGGGGAGGTGATGTGGAGTGTTACTCTGAGATTCCCATGTTTGGATTCGAGGGAGACGCTCATCATGAGACCCCTCCGTGTCGGGATTAGAGGGAGAGGCTCCTCATGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCCGGCCCTGGAGCCTCCTGGTGTGGGGCGCGGATCCGGCTGCACTTGTGCCTTGAGGCTCTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGAGAGCCCCCAGGAACACTCCAGATGAGCAGAGGCTGTTCCACCTCTTGGCGTCCCCAGGTCCCCGGTCTGAGTCCTTCTGTGCACCTGACCTAAATTCCTGCTGTCTCCTGTGACAACAAAAGCCACTCTGTTCCAGAGAGAGCCTGGTTCTCCCGTTGACCCCTCCGCTGCCGCCTGCTCCATGCTAGCCCAGCCGTCCAGGCCTCAGTTTCCCTTTGGCTCTCCCCTGCCCCGGTtcccagctgcttgggaggctgaggtaggaggatcatttgagtccaggagcttgaggttgcactgagctgtgactgtgccactgtactccagccttggcaacagagtgagacactgtcttaaaaaagaagaaTTTGggccagatgctgtgtttcatgcctgttcccagcatgctgggaggctgaggagagaagatcactcgaggccaggggttccagaccagcctgccaacatgttgaaccccgcctctacgaaaaatacaaaaattagccgggcgtggtgggtgggtgggtgccagtaatcccagctactcaggaggctgaggcagcaaaatctcttgaacctgggaggtggagattgtggtgagctgagatagtgccgctgtacttcaacctgagcaacagagtgagactccttatcaaaataaagaaaTCAATCAATCAATAAAAATAATCACAATAATTTGggctgggcgtggtggctcactcctgtaatcccagcactttgggaggcgtggatcggttgagttcgaggcaagcctggccaatgtggcgaaaccccatctccactacaaatacaaaaattagccaggtgtggtgacaggcacctgtaatcccagctgctcgggaggctgagacaggagaatctctggaacctaggaggcggaggttgcagtgagccaagatcacgtcagtgcgctccagcctgggtgacagagactgtctcaaaaaagaataataataaTTTgactgggtgtggcggctcactcttgtcatcccacactttgggaggccgaggcaggaggattgcttcagctcaggatttcgagactggcctggacaactggcctggacaacatggtgaaactccatctctacaaaaaatacaaaaattagccaggcatggtatcatgtgcctgtgatctcagctactcaggaagcagagatgggagcattgctggagcctgggagttggaggctgcaatgaaccatgttcgtgccactgcactccagtgtgggtgacagagtgagaccctgtctccaaaaggcatggtggctcacgcctgtaatccctgcactttgggaggccaagctgggtggatcacctgaggtcaagagttggagaccagcctggctaacgtggtgaaaccccatctctaggaaaaatagaaaaaATTggccaggtgcagtggctcacacctgtaatcccggcactttgggaggccgaggcgggcgaatgacctgagatcaggaattccagaccaaccacaccaatatggagaatccccgtctctactcaaaatacaaaatcagccgggcatggtagcaatcccagttactcaggaggccgaggcaggagaatcactggaggtgagccgagaccacgccattgcactgaagcctgagcaacgagagggaaactgtctcaaaaaataaTGCTAATAACAAGGGGGAGAGAACAGGAGTGTGGTCAGCAGCTGGGCCTGCCATAACCCCTGGGTCGTGTGTCCCCACAGCTCTGAAGGCTAGAGGCCCGAGGTCAGGGTGCCAGCTCGGTCCCCCCCGTGGAGTGTTCTCTGTTAGCTTCTCACATGGCAGGGAGAGTGACTGAGCTCTCGCTCTGGTGTCCCTTACGAGGACGTTCATCCCCCACTGCTCAGAGCGGCGGTGAGCCACCACGCCCAGCGCCAACTTTGTCCTTCAAGAGTTGTTTTTTTGTgccgggctcagtggctcatgcctggaatcccagcactttgaaatgccaaggtgggtggagcacctgaggtcaggagtttgactccagcctggtctaaatggtgaaaacctgcctctactaaacataaaaaaatcagctgggcatgttggtgtgtgcctgtaatcccagccactcgggaggctgaggcaggagaatcacttgaacccaagaggtggaggttgcagtgaactgagatcatgtcactgcactgcagcctggatgacaagagtgagactcccttgcaagaaaaaacaaaaattaaaaaagaaGTTGTTGTcttttttttttttttttcccttggacaattcaagatgcctagagattccatatcaattttagtaatgcttcttctatattttaaaaagtaatttgggtttttacagggattgcattcagtctctgtattgccttAATGACTCTTAGCAATGttgttttttttatttattattttttttctagagatggagtctcactctgtcagccaggctggagtttagttgttggccaggatgggcccaatctaatgacgtcaggtgatccgcctgcccctggctcccaaattgctgggattcagacgtgggccaccatgcccagccagtttacattgtacatttctttcaccttcttggttcagtgaagctccaagtattttattctttcggatgctcttgtaaatggaaatggtttcgtcattccccgttcagattatacacttactatgaagaactgcagctttctttgctgttgattttgtatcctgtaactttgctgatgtcgtggggttgttttttccaatatggattctagattttcCTTTTCTTTTTCTtttttttgtttttttgttttttttttttttgatatggggtctccctctgtggcccaagctggagtggaatgcagcggcacgatcttgaatctgcgagctcctctgcccgggtccacgccattctcctgcctcagcctcctgagtagctgagactacaggtgcctgccatcacggccggctaattttgtgtattttttgtgcagatgaggtttcaccgtgttagccaggatggtctcgatctcctaactttgtgatcggcccgcctcggcctcccaatgctgAATGCTGTTGGGACTGGGTCTTGCCCCAGAATCCCACCCAGGAGGCCACCTGACGTTTAGCTGTGACTTCTGGTGGGCTCTGCCAGGTCCCATGCTTCCTGGTGGGGTGGCCCCGTGAACGTCTTCTCAGGCCCTTTCCTCCATTGGGCCCTGGGTGAGGTGCAGCCACTCGGGGGCACCCTGAGGGTTCCTGCACCTGTTTGAAGTCTCTTCGGTCGGCTTGAGACCAAAAATGATGTTTAGCAGCCCTGGCCCCCTGACGCACAGTGGCGGTCCTTCTCCGGTCAGTGTCCCCTGCACCCTTGGGCTCCTGACGCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGTGTGGGCGCTGCCAGAGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCTGGTCGTGGCGGCCGTGTTGTGGGAGGATGGTGTGCTGCTGCCTCTGCACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCTGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTACCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCGAGCTCGGTGCCCTGTGGGCGTCCCCATGCCCAGACCCTGACCCACAGGTGCCTCCCCTTGGGGTCTCCACGCCTTTCCCTGGCCCTGGGATGCAGAGTGACCTGTCCATGGTAGAAGGGCTGGACCTGAGGTGCCTGAGACAGCACCAAGGGCACTGGTCCCAGCAGCTCCAGCCTCTGTGTGCTGGATGCCACACAGACACAAGACTCTTGGGAGACGCATTTTCCATCTGGCTCAGAGGGGGAGGGGGAGGCTTTGCAACCCAGCCCCTGCCCAGGCCCCTGGGAGGGTGGGTGCCTGCTGAGCCCCCGGGGCAGCAGGAGCGGGGCAGGCGGGGTCTTTGTTCTCACTCCCACAGCAGAGGCAGATGTGGGGGCGCCTGCTGGGGCCAGACCAAGGTGGGGTGGCCTGGAGACTGCTTCCAACCGTGGCCGGGAAGCAGGGAACCTGCCCGGCGTGTCTGAGGCCACACTCTCAGCTGGCCGGTCCAAGCCTGCGGCTGGAGCTGGTGTCTGTTTAGCTAATAAAGTCCCACAGTTGCCTCACTGCCGTGTCTATTTGCTGATGCTGCGCGGGGTTTCAGGGGCCGCCTAGCCTCCTCCTGGGGTCAAAGGTGACAGAAGAGGCAGAGGCGGGAGCTTTCTGGAGAATTTACTGACCACAGCGTGGTGCACTTGACATCAGGTGCCCGCCATGGCCGGGCCGTGGTCTGAAGGCTGCCCGGGACCAGCTGCCTGCGCTCCAGCAGCCCCTCCCTCCTGAAGGCCGGGCCCCCGAGAAGAACGAGGCTGCAGAGTGATGTGGGGGCCAGCGGTGACTTCCTACCACACTGTTCTCAGGTGTAAGAGGCCGCTTCTGCCCAGGCATTGTCCATGGAAGACACACAGCCGGCCACTGCAGCCTCGGTTCTGGGATGCCCTGCGGCTGGGTCACAGGGGGCCACGGGCCACGCTGGGAGGCCACAGTCCTGTCGTGCCACGCAGCTCCCTGTCCCCAGAGGTCTGCTCAGATGCAGAGATCAGAAACCACACTCGCTGCCTGAATTCTGGGAGCAGAGCCCGGTACCCACTGCCTGGCCGGGGCCTACCCTGGG
Like we said, bedtools
has a ton of features – we could
write a whole workshop about it. And I wanted to give one more example
before we move on. Something else we might want to do with the regions
in a bed file would be to merge ones that are
overlapping or within some distance of each other. For instance, we may
think the method we used to call SVs may be slightly inaccurate and is
calling the same polymorphism as separate mutations in different
individuals, so we want to merge overlapping events.
For this we can use bedtools merge
. There is one catch,
however.
Run the code block below to see what happens when we run
bedtools merge
on the bed file with macaque SVs:
bedtools merge -i data2/macaque-svs-filtered.bed
# bedtools: A suite of programs to process bed files
# merge : The sub-program of bedtools to execute
The input bed file must be sorted! There are a
couple of ways we could do this. If you look at the documentation
for bedtools merge
, they suggest using the native Unix
sort
command. However, bedtools itself
also has a sort
command. Let’s try that.
Run the code block below to sort the bed file with macaque SVs and then merge overlapping SV calls:
bedtools sort -i data2/macaque-svs-filtered.bed | bedtools merge > macaque-svs-filtered.sorted.merged.bed
# bedtools: A suite of programs to process bed files
# sort: The sub-program of bedtools to execute
# -i: The input bed file
# | : The Unix pipe operator to pass output from one command as input to another command
# bedtools: A suite of programs to process bed files
# merge: The sub-program of bedtools to execute
# > : The Unix redirect operator to write the output of the command to the following file
wc -l data2/macaque-svs-filtered.bed
wc -l macaque-svs-filtered.sorted.merged.bed
# Use wc -l to count the number of un-merged SVs in the original file and the number after merging
## 3646 data2/macaque-svs-filtered.bed
## 3372 macaque-svs-filtered.sorted.merged.bed
So we merged a few hundred calls. Note that because
bedtools merge
only requires one input file, we can default
back to the standard Unix piping procedure without
having to use the -
shortcut (though we still could specify
-i -
).
Of course, in actuality we would only want to merger duplications with other duplications and deletions with other deletions.
Exercise: In the code block below, write a command that merges only duplications with other duplications. Save the result in a file called
macaque-svs-filtered-dups.sorted.merged.bed
. BONUS: Adjust the settings to merge any duplications within 1000bp of each other as well as directly overlapping (Hint: Check the help menu ofbedtools merge
!).
## Use the tools you've learned to merge only duplications with other duplications
# data2/macaque-svs-filtered.bed
grep "<DUP>" data2/macaque-svs-filtered.bed | bedtools sort | bedtools merge -d 1000 > macaque-svs-filtered-dups.sorted.merged.bed
## Use the tools you've learned to merge only duplications with other duplications
grep -c "<DUP>" data2/macaque-svs-filtered.bed
wc -l macaque-svs-filtered-dups.sorted.merged.bed
# Count the number of lines in the original file and the new file to confirm we merged some duplications
## 432
## 379 macaque-svs-filtered-dups.sorted.merged.bed
In the context of our macaque SVs, a natural question would be how many of the mutations affect genic regions, and may therefore affect some cellular function. To know this, we need another file that contains the regions of the macaque genome that contain genes. This information could easily be contained in a bed file, but genes are complex, structured regions of the genome: they have exons, introns, multiple transcripts, and may have other information associated with them that is difficult to encode in a bed file.
In GFF files, we refer to the regions in the file as features.
The format for encoding information about genic regions (commonly called a genome annotation) is the GFF format. GFF stands for General Feature Format. There is a related format, the GTF format, which stands for General Transfer Format but it is very similar to GFF and slightly dated so we will only talk about GFF files today.
GFF files are also tab delimited files, with each row in the file referencing a particular region in the genome and each column a piece of information about that feature This probably sounds similar to the bed format, but contains more required columns. GFF files by definition have the following columns:
.
+
(forward strand) or -
(reverse strand)0
, 1
, or 2
For more detailed information on GFF files, see the following links:
Let’s take a look at a GFF file and talk about it a bit.
Run the code block below to view the first few lines of a GFF file:
grep -v "biological_region" -m50 data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
# grep: The Unix string search command
# -v: This option tells grep to print lines that DO NOT contain the following string
# "biological_region": The string to search for in the provided file - we just don't want to display these for this demonstration
# -m50: This option tells grep to only display the first 50 matches
## ##gff-version 3
## ##sequence-region 1 1 225584828
## ##sequence-region 10 1 92844088
## ##sequence-region 11 1 133663169
## ##sequence-region 12 1 125506784
## ##sequence-region 13 1 108979918
## ##sequence-region 14 1 127894412
## ##sequence-region 15 1 111343173
## ##sequence-region 16 1 77216781
## ##sequence-region 17 1 95684472
## ##sequence-region 18 1 70235451
## ##sequence-region 19 1 53671032
## ##sequence-region 2 1 204787373
## ##sequence-region 20 1 74971481
## ##sequence-region 3 1 185818997
## ##sequence-region 4 1 172585720
## ##sequence-region 5 1 190429646
## ##sequence-region 6 1 180051392
## ##sequence-region 7 1 169600520
## ##sequence-region 8 1 144306982
## ##sequence-region 9 1 129882849
## ##sequence-region MT 1 16564
## ##sequence-region X 1 149150640
## ##sequence-region Y 1 11753682
## #!genome-build Mmul_8.0.1
## #!genome-version Mmul_8.0.1
## #!genome-date 2015-11
## #!genome-build-accession NCBI:GCA_000772875.3
## #!genebuild-last-updated 2016-02
## 1 Mmul_8.0.1 chromosome 1 225584828 . . . ID=chromosome:1;Alias=CM002977.3,NC_027893.1
## ###
## 1 ensembl gene 25432 42232 . + . ID=gene:ENSMMUG00000005947;Name=SAMD11;biotype=protein_coding;description=sterile alpha motif domain containing 11 [Source:HGNC Symbol%3BAcc:HGNC:28706];gene_id=ENSMMUG00000005947;logic_name=ensembl;version=3
## 1 ensembl mRNA 25432 35202 . + . ID=transcript:ENSMMUT00000015569;Parent=gene:ENSMMUG00000005947;Name=SAMD11-208;biotype=protein_coding;transcript_id=ENSMMUT00000015569;version=3
## 1 ensembl exon 25432 25503 . + . Parent=transcript:ENSMMUT00000015569;Name=ENSMMUE00000311984;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=ENSMMUE00000311984;rank=1;version=2
## 1 ensembl CDS 25432 25503 . + 0 ID=CDS:ENSMMUP00000014582;Parent=transcript:ENSMMUT00000015569;protein_id=ENSMMUP00000014582
## 1 ensembl exon 29573 29754 . + . Parent=transcript:ENSMMUT00000015569;Name=ENSMMUE00000311983;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=ENSMMUE00000311983;rank=2;version=1
## 1 ensembl CDS 29573 29754 . + 0 ID=CDS:ENSMMUP00000014582;Parent=transcript:ENSMMUT00000015569;protein_id=ENSMMUP00000014582
## 1 ensembl exon 30429 30479 . + . Parent=transcript:ENSMMUT00000015569;Name=ENSMMUE00000311982;constitutive=0;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSMMUE00000311982;rank=3;version=1
## 1 ensembl CDS 30429 30479 . + 1 ID=CDS:ENSMMUP00000014582;Parent=transcript:ENSMMUT00000015569;protein_id=ENSMMUP00000014582
## 1 ensembl exon 34224 34348 . + . Parent=transcript:ENSMMUT00000015569;Name=ENSMMUE00000339755;constitutive=0;ensembl_end_phase=1;ensembl_phase=2;exon_id=ENSMMUE00000339755;rank=4;version=1
## 1 ensembl CDS 34224 34348 . + 1 ID=CDS:ENSMMUP00000014582;Parent=transcript:ENSMMUT00000015569;protein_id=ENSMMUP00000014582
## 1 ensembl exon 35177 35202 . + . Parent=transcript:ENSMMUT00000015569;Name=ENSMMUE00000394552;constitutive=0;ensembl_end_phase=0;ensembl_phase=1;exon_id=ENSMMUE00000394552;rank=5;version=1
## 1 ensembl CDS 35177 35202 . + 2 ID=CDS:ENSMMUP00000014582;Parent=transcript:ENSMMUT00000015569;protein_id=ENSMMUP00000014582
## 1 ensembl mRNA 25432 40770 . + . ID=transcript:ENSMMUT00000047681;Parent=gene:ENSMMUG00000005947;Name=SAMD11-207;biotype=protein_coding;transcript_id=ENSMMUT00000047681;version=2
## 1 ensembl exon 25432 25503 . + . Parent=transcript:ENSMMUT00000047681;Name=ENSMMUE00000311984;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=ENSMMUE00000311984;rank=1;version=2
## 1 ensembl CDS 25432 25503 . + 0 ID=CDS:ENSMMUP00000040704;Parent=transcript:ENSMMUT00000047681;protein_id=ENSMMUP00000040704
## 1 ensembl exon 29573 29754 . + . Parent=transcript:ENSMMUT00000047681;Name=ENSMMUE00000311983;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=ENSMMUE00000311983;rank=2;version=1
## 1 ensembl CDS 29573 29754 . + 0 ID=CDS:ENSMMUP00000040704;Parent=transcript:ENSMMUT00000047681;protein_id=ENSMMUP00000040704
## 1 ensembl exon 30429 30479 . + . Parent=transcript:ENSMMUT00000047681;Name=ENSMMUE00000311982;constitutive=0;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSMMUE00000311982;rank=3;version=1
## 1 ensembl CDS 30429 30479 . + 1 ID=CDS:ENSMMUP00000040704;Parent=transcript:ENSMMUT00000047681;protein_id=ENSMMUP00000040704
We’ll just point out a couple of things. First, this file also has a
header, like a BAM file, though this
is not required for GFF files. In general, the
GFF format is less standardized than others we’ve gone
over in the workshop. Next you’ll note that columns 1, 3, and 4 are the
same three columns (ALTHOUGH WITH DIFFERENT INTERVAL ENCODING)
that define a bed file, so GFF files
are (sort of) easy to convert to bed files, though with
loss of information. This also means some bedtools
programs
can process GFF files as well.
Features in a GFF file are generally
nested: genes are comprised of transcripts and transcripts are
comprised of exons. All of these features are encoded in this file and
are usually linked to each other by IDs in the last column, though this
is not always standardized. This can make the strand
column
slightly confusing to work with for features nested under the same
parental feature. For features on the positive strand (+
),
it is straightforward: they are ordered by start coordinate. For
features nested under the same parental feature on the negative strand
(-
) though, the correct order is the reverse sorting by the
end coordinate. Many of the tools we work with will consider and correct
for strand, but it is always a good thing to consider if you ever parse
GFF files on your own.
Because of all the quirks with GFF files, there are many tools out there to help process and analyze them, with gffread being a relatively stable one. We won’t be demonstrating these today though.
Exercise: In the code block below, write an
awk
command that counts the number of genes in the macaque annotation. Be sure to only check the feature name column (third column) because any feature that has a gene as a parent will also have a “gene id” in the last column that would return that line if it was searched for the string “gene”. Also be sure to only get exact matches for the word “gene”, else pseudogenes might be included in the count:
## Write awk command to count the number of genes in the macaque annotation
# data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
awk 'BEGIN{g=0} $3=="gene"{g++}; END{print g}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
## Write awk command to count the number of genes in the macaque annotation
## 20852
BONUS Exercise: In the code block below, write an
awk
command that calculates the average number of transcripts per gene in the macaque annotation. This requires initializing 2 counter variables at the beginning and searching for 2 patterns separately within yourawk
script, and then doing some math at the end:
## Write awk command to calculate average number of transcripts per gene
# data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
awk 'BEGIN{g=0;t=0} {if($3=="gene"){g++};if($3=="mRNA"){t++}} END{print t, g, t/g}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
awk 'BEGIN{g=0;t=0} $3=="gene"{g++}; $3=="mRNA"{t++} END{print t, g, t/g}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
## Write awk command to calculate average number of transcripts per gene
## 44732 20852 2.14521
## 44732 20852 2.14521
bedtools intersect
So, how many of our SVs in our macaque population overlap with genes?
For this we can use bedtools intersect
, which takes two
interval files (either bed or GFF) and
calculates how many of the features overlap. Even though it takes
GFF as input, we need to parse out the gene
coordinates only.
Run the code block below to retrieve only the genes from the macaque annotation GFF file:
awk 'BEGIN{OFS="\t"} $3=="gene"{print "chr"$0}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3 > macaque-genes.gff
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
# > : The Unix redirect operator to write the output of the command to the following file
head macaque-genes.gff
# Display the first few lines of the new file with head
## chr1 ensembl gene 25432 42232 . + . ID=gene:ENSMMUG00000005947;Name=SAMD11;biotype=protein_coding;description=sterile alpha motif domain containing 11 [Source:HGNC Symbol%3BAcc:HGNC:28706];gene_id=ENSMMUG00000005947;logic_name=ensembl;version=3
## chr1 ensembl gene 40822 57414 . - . ID=gene:ENSMMUG00000015800;Name=NOC2L;biotype=protein_coding;description=NOC2 like nucleolar associated transcriptional repressor [Source:HGNC Symbol%3BAcc:HGNC:24517];gene_id=ENSMMUG00000015800;logic_name=ensembl;version=3
## chr1 ensembl gene 58784 63064 . + . ID=gene:ENSMMUG00000015802;Name=KLHL17;biotype=protein_coding;description=kelch like family member 17 [Source:HGNC Symbol%3BAcc:HGNC:24023];gene_id=ENSMMUG00000015802;logic_name=ensembl;version=3
## chr1 ensembl gene 64366 72839 . + . ID=gene:ENSMMUG00000015804;Name=PLEKHN1;biotype=protein_coding;description=pleckstrin homology domain containing N1 [Source:HGNC Symbol%3BAcc:HGNC:25284];gene_id=ENSMMUG00000015804;logic_name=ensembl;version=3
## chr1 ensembl gene 73276 79439 . - . ID=gene:ENSMMUG00000022525;Name=PERM1;biotype=protein_coding;description=PPARGC1 and ESRR induced regulator%2C muscle 1 [Source:HGNC Symbol%3BAcc:HGNC:28208];gene_id=ENSMMUG00000022525;logic_name=ensembl;version=3
## chr1 ensembl gene 87794 89166 . - . ID=gene:ENSMMUG00000008350;biotype=protein_coding;gene_id=ENSMMUG00000008350;logic_name=ensembl;version=3
## chr1 ensembl gene 97478 101905 . - . ID=gene:ENSMMUG00000001817;Name=HES4;biotype=protein_coding;description=hes family bHLH transcription factor 4 [Source:HGNC Symbol%3BAcc:HGNC:24149];gene_id=ENSMMUG00000001817;logic_name=ensembl;version=2
## chr1 ensembl gene 116734 118310 . + . ID=gene:ENSMMUG00000001819;Name=ISG15;biotype=protein_coding;description=ISG15 ubiquitin-like modifier [Source:HGNC Symbol%3BAcc:HGNC:4053];gene_id=ENSMMUG00000001819;logic_name=ensembl;version=3
## chr1 ensembl gene 120996 155534 . + . ID=gene:ENSMMUG00000000838;Name=AGRN;biotype=protein_coding;description=agrin [Source:HGNC Symbol%3BAcc:HGNC:329];gene_id=ENSMMUG00000000838;logic_name=ensembl;version=3
## chr1 ensembl gene 173834 176671 . - . ID=gene:ENSMMUG00000032293;Name=RNF223;biotype=protein_coding;description=ring finger protein 223 [Source:HGNC Symbol%3BAcc:HGNC:40020];gene_id=ENSMMUG00000032293;logic_name=ensembl;version=2
Now we can get the overlaps between genes and SVs in our sample of macaques.
Run the code block below to use
bedtools intersect
to get the overlapping regions between two interval files:
bedtools intersect -a data2/macaque-svs-filtered.bed -b macaque-genes.gff > macaque-svs-genes-intersect.bed
# bedtools: A suite of programs to process bed files
# intersect: The sub-program of bedtools to execute
# -a : The first interval file to check for overlaps
# -b : The second interval file to check overlaps
wc -l data2/macaque-svs-filtered.bed
wc -l macaque-svs-genes-intersect.bed
# Use wc -l to count the number of lines in the original bed file and those in the bed file that overlaps with genes
## 3646 data2/macaque-svs-filtered.bed
## 1702 macaque-svs-genes-intersect.bed
Ok great, we’ve got only the SVs that overlap with genes in the macaque genome. Let’s take a look at this file.
Run the code block below to view the first few lines of the bed file with SVs that overlap with genes:
head macaque-svs-genes-intersect.bed
# Display the first few lines of the bed file containing SVs that overlap with genes
## chr1 130740 131675 chr1:130740:<DEL>:935:285.63
## chr1 562048 562264 chr1:541132:<DEL>:49440:316.41
## chr1 569143 590572 chr1:541132:<DEL>:49440:316.41
## chr1 562048 562264 chr1:552968:<DUP>:29266:189.32
## chr1 569143 582234 chr1:552968:<DUP>:29266:189.32
## chr1 1117696 1122022 chr1:1117696:<DEL>:4326:201.55
## chr1 1151866 1154542 chr1:1151866:<DEL>:2676:11284.32
## chr1 1166390 1167586 chr1:1166390:<DEL>:1196:15253.03
## chr1 1408621 1409766 chr1:1408621:<DEL>:1145:1112.53
## chr1 1409564 1410074 chr1:1409564:<DEL>:510:13091.76
Exactly the same format as the input bed file, just with fewer lines.
bedtools intersect
can add additional columns with more
information about the overlap and overlaps can be defined more clearly.
Let’s try it out.
**Exercise: Read the documenation of
bedtools intersect
and do the following. Don’t save the output to a file, just pipe it towc -l
: 1. Count only the SVs that DO NOT overlap with any genes. 2. Count only SVs that have at least 90% of their sequence overlapping a gene. 3. Count only SVs that have at least 90% of their sequence overlapping a gene AND for which that overlap also encompasses at least 90% of the gene.
# data2/macaque-svs-filtered.bed
## Count SVs that DO NOT overlap with genes
bedtools intersect -v -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | wc -l
## Count SVs that DO NOT overlap with genes
## Count SVs that have at least 90% of their sequence overlap with a gene
bedtools intersect -f 0.9 -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | wc -l
## Count SVs that have at least 90% of their sequence overlap with a gene
## Count SVs that have at least 90% of their sequence overlap with a 90% of a gene's sequence
bedtools intersect -f 0.9 -r -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | wc -l
## Count SVs that have at least 90% of their sequence overlap with a 90% of a gene's sequence
## 2112
## 1437
## 7
bedtools intersect
can also output the actual features
that are overlapped with the amount of overlap with the -wo
option.
Run the code block below to perform an intersect between macaque SVs and genes with the
-wo
option:
bedtools intersect -wo -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | head
# bedtools: A suite of programs to process bed files
# intersect: The sub-program of bedtools to execute
# -wo : A bedtools intersect option that specifies to write both features and the number of overlapping bases to the output file
# -a : The first interval file to check for overlaps
# -b : The second interval file to check overlaps
# | : The Unix pipe operator to pass output from one command as input to another command
## chr1 130740 131675 chr1:130740:<DEL>:935:285.63 chr1 ensembl gene 120996 155534 . + . ID=gene:ENSMMUG00000000838;Name=AGRN;biotype=protein_coding;description=agrin [Source:HGNC Symbol%3BAcc:HGNC:329];gene_id=ENSMMUG00000000838;logic_name=ensembl;version=3 935
## chr1 541132 590572 chr1:541132:<DEL>:49440:316.41 chr1 ensembl gene 562049 562264 . + . ID=gene:ENSMMUG00000045301;biotype=protein_coding;gene_id=ENSMMUG00000045301;logic_name=ensembl;version=1 216
## chr1 541132 590572 chr1:541132:<DEL>:49440:316.41 chr1 ensembl gene 569144 591870 . + . ID=gene:ENSMMUG00000001549;biotype=protein_coding;gene_id=ENSMMUG00000001549;logic_name=ensembl;version=3 21429
## chr1 552968 582234 chr1:552968:<DUP>:29266:189.32 chr1 ensembl gene 562049 562264 . + . ID=gene:ENSMMUG00000045301;biotype=protein_coding;gene_id=ENSMMUG00000045301;logic_name=ensembl;version=1 216
## chr1 552968 582234 chr1:552968:<DUP>:29266:189.32 chr1 ensembl gene 569144 591870 . + . ID=gene:ENSMMUG00000001549;biotype=protein_coding;gene_id=ENSMMUG00000001549;logic_name=ensembl;version=3 13091
## chr1 1117696 1122022 chr1:1117696:<DEL>:4326:201.55 chr1 ensembl gene 1085483 1236570 . + . ID=gene:ENSMMUG00000018911;Name=PRKCZ;biotype=protein_coding;description=protein kinase C zeta [Source:HGNC Symbol%3BAcc:HGNC:9412];gene_id=ENSMMUG00000018911;logic_name=ensembl;version=3 4326
## chr1 1151866 1154542 chr1:1151866:<DEL>:2676:11284.32 chr1 ensembl gene 1085483 1236570 . + . ID=gene:ENSMMUG00000018911;Name=PRKCZ;biotype=protein_coding;description=protein kinase C zeta [Source:HGNC Symbol%3BAcc:HGNC:9412];gene_id=ENSMMUG00000018911;logic_name=ensembl;version=3 2676
## chr1 1166390 1167586 chr1:1166390:<DEL>:1196:15253.03 chr1 ensembl gene 1085483 1236570 . + . ID=gene:ENSMMUG00000018911;Name=PRKCZ;biotype=protein_coding;description=protein kinase C zeta [Source:HGNC Symbol%3BAcc:HGNC:9412];gene_id=ENSMMUG00000018911;logic_name=ensembl;version=3 1196
## chr1 1408621 1409766 chr1:1408621:<DEL>:1145:1112.53 chr1 ensembl gene 1407566 1447126 . - . ID=gene:ENSMMUG00000012345;Name=MORN1;biotype=protein_coding;description=MORN repeat containing 1 [Source:HGNC Symbol%3BAcc:HGNC:25852];gene_id=ENSMMUG00000012345;logic_name=ensembl;version=3 1145
## chr1 1409564 1410074 chr1:1409564:<DEL>:510:13091.76 chr1 ensembl gene 1407566 1447126 . - . ID=gene:ENSMMUG00000012345;Name=MORN1;biotype=protein_coding;description=MORN repeat containing 1 [Source:HGNC Symbol%3BAcc:HGNC:25852];gene_id=ENSMMUG00000012345;logic_name=ensembl;version=3 510
That’s it for day 2! Join us next week to learn about VCF files, shell scripts, conda environments, and the cluster.