Welcome to the second day of the FAS Informatics Bioinformatics Tips & Tricks workshop!

If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit and execute the code in this file. Please download the raw file here

Today we’re going to continue our tour and explanation of common genomics file formats and their associated tools by talking about interval files, that is files which indicate regions of a genome (.bed files, .gff files).

We’ll be learning about how to view and manipulate these files using both the native commands present in the Linux command line as well as tools developed specifically for these file formats.

Command input and output

Just to begin, I wanted to take a second to re-iterate a few concepts we learned yesterday. In general, the aim of a lot of the commands we run is to take text in a file that is formatted in a specific way and manipulate or process that text. This is central to the Unix philosophy:

formatted text -> command -> processed text

How do commands output text?

  1. By default, most commands simply print their output to the screen. While this doesn’t immediately make sense when processing such large files, it is integral to be able to perform some other operations namely, piping and redirecting.

  2. Many times, instead of displaying output to the screen we will want to save the output to a file. Natively, Unix has the redirect operator, which is the > character. Note that this is distinct from the literal string ">", which we see as the header character in FASTA files. Rather, this is part of the command being run:

command > output_file.txt

If we are using grep to search for header lines in a FASTA file like we did yesterday, we may see a command like this:

grep '>' file.fa > headers.txt

In this example, '>' and > are doing 2 different things. The string literal '>', being quoted, is the string we are searching for in our file with grep – it is an input argument to our grep command. The second, unquoted > is the Unix redirect operator, which is placed at the end of the command and tells the shell to redirect the output into the provided file.

Many programs will also have built-in options to redirect output to a file. A common option is -o filename.txt, which would tell a command to write output to that file rather than display it on the screen. We saw this yesterday with samtools, e.g.:

samtools view -b -o output.bam input.sam

which would convert input.sam to BAM format and save it to output.bam. While -o is a common output option, it is not universal and its important to read the documentation for each tool you use to see the output options.

  1. The other way output can be used in a Unix command is by piping it to another command with the | operator. Remember that commands simply take text as input and process it in someway that is output to the screen. If the output of one command is compatible with another, then they can be strung together:
command1 input_file.txt | command2

Here, we’ve specified the input file for command1, but not for command2. Instead, the | operator says **take the output of command1 and use it as the input of command2. This is an extremely powerful way to construct basic pipelines and we did this a bit yesterday.

Pipes and redirects can be combined:

command1 input_file.txt | command2 > output_file.txt

Here, the text in input_file.txt is first processed by command1 and that processed text is piped to command2 as input. command2 does its processing of the text and then this is redirected to output_file.txt, which should now have the text processed by both commands.

Note that if the program you run has a -o option to save output that you use, you can no longer pipe that output to another command:

command1 -o output_file.txt input_file.txt | command2

This will result in output_file.txt containing only the text processed by command1. Since the text from command1 was written to the file, there is nothing to pipe to command2, which may or may not display an error.

Bed files

Today we’ll talk about bed files. Bed files are used to indicate regions of a genome with each line in the file representing one region. The bed format is an extremely flexible format – the regions contained within it can represent anything. In it’s most basic and common form it is also an extremely simple format, consisting of three columns of text separated by a tab character. The first column represents the chromosome or assembly scaffold of the region, while the second indicates the starting coordinate and the third indicates the ending coordinate.

Bed files might have the .bed extension, and while it is best practice to use a file extension that properly describes the format of a file it is not required. Any 3 column tab delimited file that has the columns we described is a bed file.

A warning about coordinate systems

We will talk about several different file types today that are used to reference locations in the genome. Unfortunately for all of us, for various reasons different file types use different coordinate styles. Bed files, which we will talk about first, use 0-based coordinates and do not include end base in the interval (technically, this is called a right-open interval). So in a bed file, an interval that includes the first 100 bases of a chromosome would have start=0, end=100.

Gff files in contrast use 1-based coordinates and do include both the start and the end base in the interval (technically, this is called a closed interval). So in a gff file, an interval that includes the first 100 bases of a chromosome would have start=1, end=100.

It is worth noting that while the 1-based closed format of GFF files is more intuitive to read, it does suffer some issues. In particular, it is impossible unambiguously encode a 0-length feature in a GFF file.

Bed file example - Macaque structural variants

Today we’ll be working with a bed file that contains calls of structural variants (e.g. large deletions and duplications of segments of the genome; abbreviated SVs) from a small population of rhesus macaques (if you attended the R workshop earlier this month you might already be familiar with this dataset). Rheseus macaques are small, Old-World monkeys that are widespread across southern and eastern Asia and are a common model organism for the study of human disease and primate evolution. We sequenced these genomes to study the evolution structural variation over different timescales.

First thing we should do is look at our data. We can do this a couple of ways here. With the RStudio setup with the VDI, we can just use our file browser on the right to navigate to the path of the file and open it in the text editor (this panel).

However, if we want to see things in a more Unix way, we can use a command to directly display the contents of the file in our Terminal.

Run the following command in the Terminal below to view the bed file containing macaque SVs.

Note that whenever you see the > character followed by green text, this is an exercise or action to be done by you!

less -S data2/macaque-svs-filtered.bed

less is a file viewing program that lets us look at parts of a file without loading the whole thing into memory. You can scroll through the file with <up arrow> and <down arrow> to move line-by-line, or with <spacebar> and b to move by page (one screen of text). The -S flag simply means do not wrap the lines to fit on the screen, so we can also scroll left and right with <left arrow> and <right arrow>. Press q to quit and return to the Terminal interface.

So what do we see? We see, as described, three columns of text indicating the chromosome, start coordinate, and end coordinate of each SV (row). We also see a fourth column with a bunch of extra information. The fourth column in a bed file is an optional column meant to provide each region with a unique ID. In this case, the unique ID is just a long string of separate pieces of information delimited by a colon (:) character. In a way, I’ve made this column a sort of catch-all for other information not included in the base bed format (e.g. SV length, SV type), which is a common strategy in genomic file formats. Most of this information we can ignore, but I will point out that TYPE of each SV is encoded as a string, with deletions being <DEL> and duplications being <DUP>. We may use this information later.

In addition to this optional fourth column for an ID, bed files have several other common pieces of information that could be encoded in extra columns. Most of the time these extra columns are ignored by the tools that process bed files, but sometimes specific columns are used.

For more information on bed files and these extra columns, visit the following links:

Summarizing SVs from the command line

So imagine we get this bed file from our collaborator who has called these SVs, and the first thing we should do is get a general idea about the variants called. What can we do from the command line?

The most basic thing we’ll want to know is how many structural variants have been called. Recalling that each line in a bed file represents one region, which in this case means one structural variant, we can simply count the number of lines in the file with the wc command.

Run the command below to count the number of SVs in the bed file. How many SVs are there?


wc -l data2/macaque-svs-filtered.bed
# wc: the Unix word count command
# -l: tells wc to only return the line count
## 3646 data2/macaque-svs-filtered.bed

Cool! We also may want to known how many of these SVs are deletions and how many are duplications. We can figure that out with grep.

Exercise: In the code block below, use grep to count the number of deletions and duplications separately. Remember that SV type is encoded in the fourth column of our bed file.


## Count the number of deletions
# data2/macaque-svs-filtered.bed
grep -c "<DEL>" data2/macaque-svs-filtered.bed
## Count the number of deletions


## Count the number of duplications
# data2/macaque-svs-filtered.bed
grep -c "<DUP>" data2/macaque-svs-filtered.bed
## Count the number of duplications
## 3214
## 432

awk basics

So we have a lot more deletions than duplications. If we didn’t have reason to believe that deletions are more common than duplications (which we think they are) we may want to ask our collaborator to re-check their calls. But we can do some more checking ourselves too. Maybe, on average, the deletions being called are smaller events than the duplications so it would be expected that there are more of them. To check whether that is the case, we could get the average length of deletions and duplications in our bed file. The first step of that is to get the length of each SV.

Yesterday, we started to learn about awk, which is a scripting language that is interpreted in the Unix shell. Basically what this means is that we can use awk much the same way as if we were programming in a text editor. awk’s appeal for us is that it is set up to automatically read through and process text files, line by line, which is a common task in bioinformatics. We could achieve the same functionality by writing a Python or R script, but because those are not integrated into the shell we would waste time writing code to read and write files. awk does that automatically, so for simple file operations it is an extremely useful to for bioinformaticians to have.

Yesterday you learned the basic syntax of an awk command:

awk '{ action; other action }' input_file.txt

This means that awk reads input_file.txt line by line, and for each line performs both action and other action. A semi-colon (;) is used to de-limit separate actions.

The simplest awk program we could right then, would be something like this.

Run the code block below. What happens?


awk '{}' data2/macaque-svs-filtered.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file

echo "done"
# echo: A Unix command that simply prints the provided input to the screen
## done

Here, awk has read through our bed file, but nothing is displayed to the screen because we didn’t code any actions for it to perform.

The most basic action we can code for an awk program is the print command.

Run the code block below:


awk '{print}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## chr1 89943   90471   chr1:89943:<DUP>:528:1907.19
## chr1 130740  131675  chr1:130740:<DEL>:935:285.63
## chr1 218574  219534  chr1:218574:<DUP>:960:5699.01
## chr1 219608  220078  chr1:219608:<DUP>:470:2074.69
## chr1 519434  541582  chr1:519434:<DUP>:22148:1673.64
## chr1 519473  542033  chr1:519473:<DUP>:22560:2560.16
## chr1 520173  541800  chr1:520173:<DEL>:21627:2955.11
## chr1 525401  525806  chr1:525401:<DEL>:405:2986.21
## chr1 541132  590572  chr1:541132:<DEL>:49440:316.41
## chr1 552968  582234  chr1:552968:<DUP>:29266:189.32
## chr1 766381  766933  chr1:766381:<DEL>:552:5099.0
## chr1 1117696 1122022 chr1:1117696:<DEL>:4326:201.55
## chr1 1151866 1154542 chr1:1151866:<DEL>:2676:11284.32
## chr1 1166390 1167586 chr1:1166390:<DEL>:1196:15253.03
## chr1 1408621 1409766 chr1:1408621:<DEL>:1145:1112.53
## chr1 1409564 1410074 chr1:1409564:<DEL>:510:13091.76
## chr1 1564979 1565374 chr1:1564979:<DEL>:395:9231.44
## chr1 1602888 1604046 chr1:1602888:<DEL>:1158:1586.52
## chr1 1774887 1775498 chr1:1774887:<DEL>:611:1933.48
## chr1 1831576 1831983 chr1:1831576:<DEL>:407:3537.19

This time, now that we’ve given the instruction for awk to print we see each line displayed on the screen. This is a good demonstration of awk, but doesn’t really do anything we couldn’t do before. We can view the contents of files with cat, less, head, tail, etc. awk, however, also splits each record (line) into fields (columns) based on some character delimiter (tab by default). This naturally turns our text file into a data table to manipulate right in the shell.

In awk, the fields or columns are identified by number and a special character, the dollar sign $, to indicate we want to access that column. So, for instance, if I wanted to access only the third column from a given record, I could do so with $3.

Run the code block below to use awk to print the only the third column from the bed file with macaque SVs. We call head first to not overflow the text editor with output:


awk '{print $3}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## 90471
## 131675
## 219534
## 220078
## 541582
## 542033
## 541800
## 525806
## 590572
## 582234
## 766933
## 1122022
## 1154542
## 1167586
## 1409766
## 1410074
## 1565374
## 1604046
## 1775498
## 1831983

Another functionality of awk, since it is a scripting language is that there are basic operations it can perform on the input data. For instance, given two input columns that are numeric, awk can add, subtract, multiply, and divide them with the +, -, *, and / operators.

Exercise: In the code block below, use awk to print the length of each SV:


## Use awk to print the length of each SV
# data2/macaque-svs-filtered.n20.bed
awk '{print $3 - $2}' data2/macaque-svs-filtered.n20.bed
## Use awk to print the length of each SV
## 528
## 935
## 960
## 470
## 22148
## 22560
## 21627
## 405
## 49440
## 29266
## 552
## 4326
## 2676
## 1196
## 1145
## 510
## 395
## 1158
## 611
## 407

A note on data types

As a programmer (we are coding now!), one of the most important things I can tell you about programming is to always remember what data types you are operating on!

We won’t get into it too much here, but briefly, you should know about data types. Data types are the way different pieces of information are encoded. 3 is an integer. "hello world" is a string of characters. "3" is a character. This is important to remember because different functions and operators may perform different actions depending on the data type input to them, or they might not work at all with the wrong data type. For example, with algebraic operators like addition (+), 3 + 3 is a perfectly valid instruction to write. But what does 3 + "hello world" mean? Different programming languages may perform differntly in this situation some by erroring out and some by doing something you may not expect and not leaving any trace that something is wrong. And different programming languages generally have different data types.

The command above worked because both column 3 and column 2 contain only integers, so awk correctly subtracts their values when the - operator is provided between them. The other columns in our bed file, however, contain character strings.

Run the code block below to try and perform an algebraic operation (-) on a column made up of integers and a column made of strings. What happens? What did you expect to happen?


awk '{print $3 - $1}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## 90471
## 131675
## 219534
## 220078
## 541582
## 542033
## 541800
## 525806
## 590572
## 582234
## 766933
## 1122022
## 1154542
## 1167586
## 1409766
## 1410074
## 1565374
## 1604046
## 1775498
## 1831983

This is only printing out the third column unchanged. awk is pretty good about not throwing errors, so if you didn’t catch this, either because of a typo or because you thought column 1 also contained integers, you may move forward in your analysis and get some strange results you’d struggle to explain later.

All of which is to say (and to re-iterate) that you should always remember what data types you are operating on!

Variables in awk

In programming, variables are names given to pieces of information, allowing the information to be used later on in the program. The column numbers used by awk with the $ notation are variables that are updated as every record is read.

awk has several default variables that are initialized when the command is run:

Most of these pertain to how awk separates records and fields. Like any other variable in a program, its value can be accessed and overwritten. For instance, we can change the field separator (FS) to be something other than white space (e.g. a tab character).

Run the code block below to change the FS variable to colon (:) and print out the first 3 fields. How is this different from the default behavior?


awk 'BEGIN{FS=":"}{print $1,$2,$3}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## chr1 89943   90471   chr1 89943 <DUP>
## chr1 130740  131675  chr1 130740 <DEL>
## chr1 218574  219534  chr1 218574 <DUP>
## chr1 219608  220078  chr1 219608 <DUP>
## chr1 519434  541582  chr1 519434 <DUP>
## chr1 519473  542033  chr1 519473 <DUP>
## chr1 520173  541800  chr1 520173 <DEL>
## chr1 525401  525806  chr1 525401 <DEL>
## chr1 541132  590572  chr1 541132 <DEL>
## chr1 552968  582234  chr1 552968 <DUP>
## chr1 766381  766933  chr1 766381 <DEL>
## chr1 1117696 1122022 chr1 1117696 <DEL>
## chr1 1151866 1154542 chr1 1151866 <DEL>
## chr1 1166390 1167586 chr1 1166390 <DEL>
## chr1 1408621 1409766 chr1 1408621 <DEL>
## chr1 1409564 1410074 chr1 1409564 <DEL>
## chr1 1564979 1565374 chr1 1564979 <DEL>
## chr1 1602888 1604046 chr1 1602888 <DEL>
## chr1 1774887 1775498 chr1 1774887 <DEL>
## chr1 1831576 1831983 chr1 1831576 <DEL>

Now, the first field includes everything in the line up to the first colon in the last tab separated column. This is most of the line.

NR is also important. Rather than dealing with how fields and records are read, it simply counts the number of records as they are read.

Run the code block below to see how the value of NR changes for each record read:


awk '{print NR}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
## 20

awk patterns and custom variables

Yesterday you learned a bit about regular expressions and how to use them with grep. Well, in actuality, awk is also using regular expressions to decide which records to display. By default, the blank regular expression (because none is provided) matches every line in the file, so every line is displayed. However, you can use awk similarly to grep to display and process lines that only match some pattern.

Run the code block below to use awk to display only lines that represent duplications:


awk ' /<DUP>/ {print}' data2/macaque-svs-filtered.n20.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## chr1 89943   90471   chr1:89943:<DUP>:528:1907.19
## chr1 218574  219534  chr1:218574:<DUP>:960:5699.01
## chr1 219608  220078  chr1:219608:<DUP>:470:2074.69
## chr1 519434  541582  chr1:519434:<DUP>:22148:1673.64
## chr1 519473  542033  chr1:519473:<DUP>:22560:2560.16
## chr1 552968  582234  chr1:552968:<DUP>:29266:189.32

This should be equivalent to the following:


grep "<DUP>" data2/macaque-svs-filtered.n20.bed
# grep: The Unix string search command
# "<DUP>": The string to search for in the provided file
## chr1 89943   90471   chr1:89943:<DUP>:528:1907.19
## chr1 218574  219534  chr1:218574:<DUP>:960:5699.01
## chr1 219608  220078  chr1:219608:<DUP>:470:2074.69
## chr1 519434  541582  chr1:519434:<DUP>:22148:1673.64
## chr1 519473  542033  chr1:519473:<DUP>:22560:2560.16
## chr1 552968  582234  chr1:552968:<DUP>:29266:189.32

However, with awk, we can also process the output from the same command.

Exercise: Use a single awk command to print the length of every duplication in the macaque SV bed file.


## Use awk to print the length of every duplication
# data2/macaque-svs-filtered.n20.bed
awk '/<DUP>/ {print $3 - $2}' data2/macaque-svs-filtered.n20.bed
## Use awk to print the length of every duplication
## 528
## 960
## 470
## 22148
## 22560
## 29266

We can also print lines that contain information in a certain column using the same $ notation as before to refer to the column. For instance, we can print only SVs on the X chromosome.

Run the following block to print only lines of the bed file where the first column is “chrX”:


awk ' $1=="chrX"{print}' data2/macaque-svs-filtered.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## chrX 1988    2464    chrX:1988:<DEL>:476:5630.03
## chrX 3478    4124    chrX:3478:<DEL>:646:2476.53
## chrX 7281    14220   chrX:7281:<DEL>:6939:307.26
## chrX 62980   63554   chrX:62980:<DEL>:574:1578.01
## chrX 64524   64940   chrX:64524:<DUP>:416:5057.12
## chrX 107557  108311  chrX:107557:<DEL>:754:2465.47
## chrX 165868  166394  chrX:165868:<DUP>:526:402.59
## chrX 207382  208215  chrX:207382:<DUP>:833:2285.44
## chrX 278868  279501  chrX:278868:<DEL>:633:7328.7
## chrX 302402  302996  chrX:302402:<DEL>:594:2752.0
## chrX 377764  378397  chrX:377764:<DEL>:633:317.17
## chrX 411443  411860  chrX:411443:<DEL>:417:2583.56
## chrX 420049  420650  chrX:420049:<DUP>:601:8107.15
## chrX 426741  427265  chrX:426741:<DEL>:524:4874.08
## chrX 427022  427595  chrX:427022:<DUP>:573:684.46
## chrX 489174  491968  chrX:489174:<DEL>:2794:2691.96
## chrX 583257  584278  chrX:583257:<DUP>:1021:3134.13
## chrX 586115  586579  chrX:586115:<DUP>:464:7513.0
## chrX 608503  611838  chrX:608503:<DUP>:3335:2762.6
## chrX 609545  612675  chrX:609545:<DEL>:3130:290.41
## chrX 610696  611235  chrX:610696:<DUP>:539:5966.77
## chrX 610406  610905  chrX:610406:<DEL>:499:12606.37
## chrX 611827  612606  chrX:611827:<DEL>:779:384.62
## chrX 631553  632399  chrX:631553:<DUP>:846:7035.83
## chrX 695765  696374  chrX:695765:<DEL>:609:3692.3
## chrX 711314  712014  chrX:711314:<DUP>:700:1999.33
## chrX 711189  712411  chrX:711189:<DEL>:1222:675.4
## chrX 739531  740094  chrX:739531:<DUP>:563:3943.29
## chrX 739910  740599  chrX:739910:<DEL>:689:289.41
## chrX 787138  788503  chrX:787138:<DUP>:1365:1836.45
## chrX 927279  927667  chrX:927279:<DUP>:388:10797.04
## chrX 1149286 1150206 chrX:1149286:<DUP>:920:348.36
## chrX 1149490 1150104 chrX:1149490:<DEL>:614:1282.46
## chrX 1177417 1178193 chrX:1177417:<DUP>:776:2832.53
## chrX 1280613 1280869 chrX:1280613:<DUP>:256:9117.2
## chrX 1300711 1301484 chrX:1300711:<DEL>:773:1067.94
## chrX 1427624 1428395 chrX:1427624:<DEL>:771:3446.11
## chrX 1700718 1701519 chrX:1700718:<DEL>:801:2508.96
## chrX 2670310 2670696 chrX:2670310:<DUP>:386:1656.48
## chrX 2714010 2716794 chrX:2714010:<DEL>:2784:28167.58
## chrX 2894333 2904887 chrX:2894333:<DEL>:10554:498.77
## chrX 3515446 3515791 chrX:3515446:<DEL>:345:5134.69
## chrX 4456589 4457158 chrX:4456589:<DEL>:569:1875.91
## chrX 6881120 6881959 chrX:6881120:<DEL>:839:4596.71
## chrX 8451528 8452433 chrX:8451528:<DEL>:905:550.93
## chrX 8454996 8475204 chrX:8454996:<DEL>:20208:487.62
## chrX 8472209 8473006 chrX:8472209:<DUP>:797:1077.54
## chrX 8475742 8479634 chrX:8475742:<DEL>:3892:4230.64
## chrX 8477637 8478720 chrX:8477637:<DEL>:1083:2629.3
## chrX 9886789 9887256 chrX:9886789:<DEL>:467:7498.94
## chrX 12506208    12514476    chrX:12506208:<DEL>:8268:1477.12
## chrX 12526487    12549496    chrX:12526487:<DEL>:23009:2340.64
## chrX 15866445    15867408    chrX:15866445:<DEL>:963:12513.77
## chrX 21462342    21464887    chrX:21462342:<DEL>:2545:773.76
## chrX 25341791    25343157    chrX:25341791:<DEL>:1366:23636.98
## chrX 28166208    28168644    chrX:28166208:<DEL>:2436:3149.73
## chrX 28814796    28817815    chrX:28814796:<DEL>:3019:17095.27
## chrX 30019364    30019954    chrX:30019364:<DEL>:590:360.75
## chrX 34457321    34457797    chrX:34457321:<DEL>:476:4343.45
## chrX 41105284    41105990    chrX:41105284:<DEL>:706:2020.33
## chrX 45125902    45126511    chrX:45125902:<DEL>:609:2411.45
## chrX 47425736    47428746    chrX:47425736:<DEL>:3010:3554.64
## chrX 47911677    47914869    chrX:47911677:<DEL>:3192:4711.68
## chrX 49566153    49574194    chrX:49566153:<DUP>:8041:4698.6
## chrX 49566444    49612048    chrX:49566444:<DEL>:45604:2093.93
## chrX 54566354    54588336    chrX:54566354:<DEL>:21982:2478.71
## chrX 55270771    55272939    chrX:55270771:<DUP>:2168:676.53
## chrX 61020469    61021120    chrX:61020469:<DEL>:651:1420.94
## chrX 73368438    73374565    chrX:73368438:<DEL>:6127:30670.27
## chrX 80022182    80022832    chrX:80022182:<DEL>:650:3318.89
## chrX 81382396    81382959    chrX:81382396:<DEL>:563:3110.35
## chrX 81774454    81775074    chrX:81774454:<DEL>:620:7519.69
## chrX 86223974    86225085    chrX:86223974:<DEL>:1111:1993.16
## chrX 86225397    86226678    chrX:86225397:<DEL>:1281:2184.25
## chrX 86600287    86600725    chrX:86600287:<DEL>:438:6027.45
## chrX 86613955    86615407    chrX:86613955:<DEL>:1452:1737.19
## chrX 88765180    88766033    chrX:88765180:<DEL>:853:465.27
## chrX 91174961    91176888    chrX:91174961:<DEL>:1927:4205.93
## chrX 92158227    92159027    chrX:92158227:<DEL>:800:15778.25
## chrX 92753089    92753972    chrX:92753089:<DEL>:883:3724.23
## chrX 97476307    97476914    chrX:97476307:<DEL>:607:745.23
## chrX 98845605    98847132    chrX:98845605:<DEL>:1527:1678.29
## chrX 103957969   103958451   chrX:103957969:<DEL>:482:11427.91
## chrX 106372466   106373502   chrX:106372466:<DEL>:1036:8307.51
## chrX 108518086   108520181   chrX:108518086:<DEL>:2095:3719.46
## chrX 111653104   111653670   chrX:111653104:<DEL>:566:158.36
## chrX 123944663   123946419   chrX:123944663:<DEL>:1756:2666.16
## chrX 124454196   124456326   chrX:124454196:<DEL>:2130:31891.72
## chrX 129169887   129170435   chrX:129169887:<DEL>:548:11203.65
## chrX 129328746   129330969   chrX:129328746:<DEL>:2223:5690.06
## chrX 130990616   130991273   chrX:130990616:<DEL>:657:22216.66
## chrX 135378002   135378668   chrX:135378002:<DEL>:666:3953.37
## chrX 135679612   135715923   chrX:135679612:<DEL>:36311:1521.72
## chrX 135682628   135718741   chrX:135682628:<DEL>:36113:4892.62
## chrX 137821409   137821980   chrX:137821409:<DEL>:571:17208.79
## chrX 145551156   145552312   chrX:145551156:<DUP>:1156:3422.51
## chrX 146387029   146422365   chrX:146387029:<DEL>:35336:4990.42

BEGIN and END

awk has two special patterns, BEGIN and END. These patterns are followed by instructions that are to be performed either before (BEGIN) or after (END) awk reads every record in the file. Recall that, by default, awk performs the specified actions on every record (line) in the input file. These two keywords allow us to perform summary tasks both before and after the records are read and processed.

Run the code block below to use awk to only print the total number of records (without using NR):


awk ' BEGIN{sum=0} {sum++} END{print sum}' data2/macaque-svs-filtered.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
## 3646

To break this down, we told awk that we want it to read every record in the bed file, but BEFORE doing that set the value of a new variable called sum to 0. Then, as every record is read, increment sum by 1 with the ++ operator. Finally, after all records have been read, print out the value of sum, which should now be the total number of lines in the file. Remember that awk already has a variable that does this, NR.

In addition to the ++ operator, which adds 1 to a variable, it is useful to know about the += operator, which adds whatever is on the right side of the equation to the variable on the left side. So we could have written the code above as {sum += 1}. The ++ operate is a shortcut when we just need to incremete a variable, but the += operator allows us to increment a variable by more than 1, or even by another variable (e.g., {sum += $1} would keep a running total of the first column of a file).

This command introduces another key concept in awk programs: user-defined variables. Here, sum is not part of awk’s default namespace – we create and manipulate this variable on our own. We could have easily called it something else (e.g. random_data=0), but sum seems to be a good descriptive name for its purpose. record_count would also be a good name for this.

Average SV length with awk

Great! Now we’ve got some new awk knowledge. Let’s try and put it all together to calculate the average length of all SVs in our bed file.

Exercise: In the code block below, write a single awk command that calculates the average length of the SVs in the bed file. This command will need to: 1. Calculate the length of each SV 2. Add the length to a running total 3. After reading all records, divide the final total length of all SVs by the total number of SVs in the file (hint: remember NR!)


## Write awk command to calculate average length of SVs
# data2/macaque-svs-filtered.bed
awk '{sum += $3 - $2} END {if (NR > 0) print sum / NR }' data2/macaque-svs-filtered.bed
## Write awk command to calculate average length of SVs
## 3615.02

Ok, so we now have the average length of ALL SVs. What about deletions and duplications separately?

Exercise: In the code block below, calculate the average length of duplications and deletions separately (2 commands). This can be done in several ways using the tools we’ve taught (i.e. grep, awk, pipes (|)) or just with a single awk command per SV type. Are deletions or duplications longer on average?


## Calculate the average length of deletions
# data2/macaque-svs-filtered.bed
grep "<DEL>" data2/macaque-svs-filtered.bed | awk '{sum += $3 - $2} END {if (NR > 0) print sum / NR }'
## Calculate the average length of deletions


## Calculate the average length of duplications
# data2/macaque-svs-filtered.bed
grep "<DUP>" data2/macaque-svs-filtered.bed | awk '{sum += $3 - $2} END {if (NR > 0) print sum / NR }'
## Calculate the average length of duplications
## 3161.33
## 6990.42

bedtools

We can do a lot of simple processing of bed files (and genomic files in general) with native bash commands like grep, awk, wc, etc. However, there are a lot of tasks that require software (commands) built specifically for these types of files. For bed files (and other interval files), bedtools is a great tool. It has a wide range of functions for working with these files, and is particularly powerful when you are interested in the overlap between regions in two files.

We’ll only have time to go over a small number of bedtools functions in this workshop, so be sure to check out the bedtools website for more in-depth documentation on all its functions:

bedtools website

bedtools getfasta

Given a set of genomic regions in a bed file, one common task you may want to accomplish is to get the sequences contained within those intervals from the genome. bedtools can do this with the bedtools getfasta command. You can type bedtools getfasta -h in the Terminal below to see some documentation about this command. To do this, you will need:

  1. The bed file with regions of interest.
  2. The whole genome fasta file from which the coordinates in the bed file are derived.
  3. A sequence index (.fai file) of the input genome – though bedtools will create this automatically if it isn’t found.

We’ve provided the genome file for you. So let’s get the sequences of our macaque SVs in FASTA format.

Run the code block below to extract the sequences of the macaque SVs in the bed file in FASTA format:


bedtools getfasta -fi data2/rheMac8.fa -bed data2/macaque-svs-filtered.bed -fo macaque-svs-filtered.fa
# bedtools: A suite of programs to process bed files
# getfasta: The sub-program of bedtools to execute
# -fi: The genome fasta file as input
# -bed: The bed file as input
# -fo: The desired output fasta file

head macaque-svs-filtered.fa
# Display the first few lines of the new file with head
## >chr1:89943-90471
## TGGGTTGATGGTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCAGGGTTGATTGGTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTGGGGGTCGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTATTTCTGGAGTTCAGGGTTGATTGGTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCAGGGTTGATTGTTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTTGTGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATGGTTTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTGGGGGTCGATTGTTTCTGGAGT
## >chr1:130740-131675
## CAGGCAGGTGGGGGGCTATCAGTGTCTATGCAGGCAGGTGGGGGTTCATCAGTGTCTATACAGGCAGGTGGGGGGACATTAGTGTGTATGCAGGCAGGTGAGGGGACATCTAGTGTCTATGCAGGCAGGTGGGGGGACATCCAGTGTCTATGCAGGCAGGTGGGCGGTCATCAGTGTGTATGCAGGCAGGTGGGGGGACACCCAGTGTTTATACAGGCAGGTGGGGGGAGGTCATCAGTGTCTATGCAGGCAGGTGGGGGGACATCCAGTGTCTATGCAGGCAGGTGGGGGGATGCCCAGTGTCTATGCAGGCAGGTGGGGGGATGCCCAGTGTCTATGCAGGCAGGTGGGGGGACACCCAGTGTTTATGCAGGCAGGTGGGGGGAGGTCATCAGTGTCTATGCAGGCAGGTGGGGGGACATCCAGTGTCTATGCAGGCAGGTGAGGGGACATCTAGTGTCTATGCAGGCAGGTGGGGGGTCATCAGTGTGTATGCAGGCAGGTGGGGGGACGTCAGTGTCTATGCAGGCAGGTGGGGGGTCATCCAGTATCCAGTATCTATGCAGGCAGGTGGGGGGGTCATCAAGTGTCTATGCAGGCAGGTGGGGGGACGTCAGTGTCTATGCAGGCAGGTGGGGGATGTCAGTGTCTATGCAGGCAGGTGGGGGGGTCCCCAGTGTCTATGCAGGGGGGTCATCAGTGTCTATGCAGGCAGATGGGGGGACATCAGTGTCTATGCAGGCAGGTGGGGGGACATCCAGTATCTATGCAGGCAGATTGGGGGGATGCCCAGTGTTTATGCAGGCAGATTGGGGGGACACCCAGTGTCTATGCAGGCAGGTGGGGGGCTATCAGTGTCTATGCAGGCAGGTGGGGGGGTCATCAGTGTCTATGCAGGCAGGTGGGGGGACATTAGTGTCTATGCAGGCAGGTGA
## >chr1:218574-219534
## TCTGTCACGGAGGAGGCGGGTCTTTCTCTGTCATGGAGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCACGGGGGAGGAGGATCTTTCTCTGCCAATGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGAAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCATGGGGGAGGCGGGTCTTTCTCTGCCACGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGGAGGCAGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTGTCGTGGGGAAGGCGGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCCATCATGGGGGAGGCGGGTCTTTCTCTGCCTTCAGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCATGTGGGAGGCGGGTCTTTCTCTGCCTTCAGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCATGGGGGAGGCGGGTCTTTCTCCCTCATGGGGAGGCGGGTCTTTCTCCCTCATGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGAAGGCGGGTCTTTCTCTGTCATAGGGGAGGCGGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTGTCATGGGAGAGGCGGGTCTTCGTCTCTCATGGGGGAGGCGGGTCTTCCTCCCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCT
## >chr1:219608-220078
## CTCTGTCACGTGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCATGGGGGAGGCGGGTCTTTCTCTGTCACGAGGGAGGCGGGTCTTTCTCTGTCATGGAGGAGGTGGGTCTTTCTCTGTCACGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCTGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGTCATAGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGTGGGCCTTTCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGT
## >chr1:519434-541582
## TCCTGGGGTCAAAGGTGACAGAAGAGGCAGAGGCTGGAGCTTTCTGGAGAATTTACTGACCACAGCGTGGTGCACTTGACATCAGGCGCCCGCCATGGCCGGGCCTGGGTCTGAATGCTGCCCGGGACCAGCTGCCTGCGCTCCAGCAGCCCCTCCCTCCTGAAGGCCAGGTCCCCCGAGAAGAACGAGGCTGCAGAGTGATGTGGGGGCCAGCGGTGACTTCCTACCACACTGTTCTCAGGTGTAAGAGGGCTCGCTTCTGCCCAGGCATTGTCCGTGGAAGACACACAGCCGGCCACTGCAGCCTCAGTCCTGGGATGCCCTGGGGCTGGGTCACAGGGGGCCACGGGCCACGCTGGGAGGCCACAGTCCTGTCGTGCCATGCAGCTCCCTGTCCCCAGATGTCCGCTCAGGGATGCAGAGGGCAGAAACCACACTCGCTGCCTGAATTCTGGGAGCAGAGCCCGGTACCCACTGCCTGGCCGGGGCCTACCCTGGGACTCCAGCCCCTGTTCCCGCTGGCCCGGGCTTCCGGAGGCAACTGTGTCCCTATCCTGGCTCAAGGTCCAGGCTGCACCTGGAACCTGCACGGTCACTCCTCCAGGTCCTCAATGCTGGAGGACTCTCTCAGACAGGAAACCTTTGCGTTGGGCGCAGGGCGGGGTGCGGGGTGGTCACGGGGAATCGCAGGGCAAAACAGCACAGTGCAATCGCGCAGAGCCTGATATTGGCGGATGAAACATAAACTGCTTTCTGCACTTTGTGTCCTTAGGAAGGGTGTGGGGTGTTGGCGGAAGTAGGAAACAGAAGAGGAGCCTGGGCATGCAGCGGGTCTGTCAGAGAGCAGAGCCCTCGGAGCTGCAGTGCTTGGAGGGAGGCGGTTCACCTCTGCCCACTCTCTCCAtttctctctctctcattttccttttagagatggattcttgctctgagcctaggctggagtgcagtggtgtgattatagctcattgcagcctcgcccttccaggctcaagtgatcctcctgcctcagcctgtccagtagcCATACCCTACTAGGTCCTAGTTAGCCCCCAGAGGCGTGCACCACCACGCCCACTAATTGCAAAAATTTGTTggctgggcgcgatggctaacatctgtaatctttgggaggccaaggcgggcggatcacgaggtcaagagatggagaccatcctggctaacacggtgaaacccggtctctactaaaaatacaaaaaattagccgggtgtggtggcgggggcctgtagtcccagctactcaggaggctgaggcaggagaatggcgggaacccgggaagtggagcttgcagtgatctgagatcactccactgcactccagtctgggggacagagcgagactccgtctcaaaataaataaataaataaatatataataaataaataaaaataaaaataaaaCTAAGCCCTTCCTGATGGTCATTGGGGGGTTTGGGGGTTGGGGGGGGTGTCTGGCTATGGCTGGGGAACTCATTTGGTTTTCCTCCTCCTCCTCtttttattttttggtagagacggggtctcttgatttcccaggctgatctccaactcctgggctcaagcaatcctcctgcctcagcctcccaaagtgttgggattacaggcctgagacaccgtagctagccAGCtttctttttttttttgagacggagtcctgctgtcacccaggctggagtgcagtggcgagatctcagcggatcactgtgttatacgtaaattttcggtgtcgcaaaagaagtagcactcgaatgtacacttttctcagctaggaaatttacttctatagaaggggggtctcatagatggagcaatggtgagcatttggacaagggaggggaaggttcttattcctgacgcaggtagcgcctactgctgtgtggttcccttattggacagcgttagacctcacaatctaaatccgattggcCtttttttttttttgagatggagtcttgctgtgtcgcccagactggagtacagtggtgcgatcttggctcactgcaagctctgcctcctgggttcatgccattcttctgccttagcctcctgagtagttgagactacaagtgtatgccatcatgtgcggctaatttttgtgtttttggtagaaagagatttcaccacgttggccaggatggtctcgatctcctgacctcgagatccacctgcctcggcctcccacagtgctgggattataggcatgagccactgcacctggccttaagtggttctttaaagtctgattcgttgtttctactttccctgatgagggtgggtgtcaaggagtgtggtattcttacataatgtctgatgtttggaatagcAttttttttttttttgaggcagagtctcactctgtcgcccatgctggagtgtagtggcaccatcttgtctcactgtaacctttgcctcccgggttcaaacgatcctcctgcctcagccttccacgtagctaggattacaggcgtccaccaccacggccggctagcttttatatttttagtagagacggggtttcaccatgttggccaggctgtacttgaacttctgacctcaatgatctgcccccctcagcctcccgaagtgctgggatacaggtgtgagccaccactccTCGCTCAAGTAATATGTTAAACTTATGCTTTCTTCTTTTCTTCTTTCttttttttttttttttttttggatggagtcttgttctgtctgcccaggcttgagggcatggcataactcggctcactgccctccgccgttccagtcatgcatatctgctgccttcagcctcctttagtacgggacacgaggccacctgccacccgtgcctggctatttttttatttttttttttttttttttttttttttttttttATCAGgacagagtctggctctgccgccaggctggagcttgcagtggcgtcagctcaacctgcaagctccgctccgcgggttcaacgccattctcatgcctcctcagcctccccgagtaattgggactacagcgcgcccgccaccgccccgctcagtttttgtattttttagcagagaggggttaccgtgtagccaggatgggtctcgattcctgacgcctcgtgatccgcccgtctcggctcccaagctgggattacaggcttgagccacgcgccccggcccggcatttttttcatttttagtaagaaacagggtttcaccgtgtttagccaggattggtgtcgatttcctgacccgtgatccgcccccctcggcctcccaaagtgctggattccaggcctgagcctgcaagccgggccTACTCTTTGGCTTTTAAAAGAATGGGCAACATTGCTTTTCTTTACTAACTTCTAATCTTTCCCTCTCTGACTCATCTCTCCTCCCACTTCTCTTGTTCTCCCTGTCAGTGTTCCTTTCCTAAGAGTTTTTCCCTGTCTATGATCTTTTTTTATAGGCTTTTTTCTAGTTTCTCTTTCTTTGTAATTGTGCGTTAATACTGGCCAATTGTTAGTGACAAATTCCTTGCCAAGAGATCCCTGACCCTAAACCAGCATATTCTGTCCATTCGTTTTAATCTGTACtttatttttcttgagatggagttccgctctgtcgccaggtgtggatggtgtagtggcacgttctcgctcactgtcaactcgccctccagggtcaacccgcaccatcctcgctgccttagcctccgagtacggggattgtacaagcgtccaccacccggcctggcgaggcgcttgatttttttatttcagtagagatgggggttttcatcgtgttagccagatggtcccccccatctcctggactcatgctccgcgcaccgccccttggcctccgcaagtgcgcgattaTGATCTCTCTCAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtctcctgcctcagccctctgagtagctgggattacaggcattttttgtatttttagtacagatggggtttcaccattttggtcaggctggtggaggactcctgacctcaaatgacctccccgcctggcctcccgaaatgctgggattacatgcgtgatccaccacgcccagccATACAGttcttatgttaagacaggctctctgtcgcccaggctggagtgcagtggcgcgatcacagctcactgtttgcctcgacctttcaagctcaagctgtcctcctgcctgagccgcccgcgtagccaggactgcaggggcacagtgccatgcccggctaatttttttttttgtgatggcgttttgctcttgttgcccaggctggagtgcggtggcgcaatcttggctcactgcaacctcctccccctgggttcaagcaattctcctgcctcagcctcccaagtagctgggattacagtcatgtaccaccacgcccggttcattttgtatttttttttagtagacaagggatttctccatgtcagtcaggctagtcctaaactcctgacctcaggtgacccgcccacctcagcctcccaaactgctgggattacaggcgtgagccactgtgcctggtcCTGGCTAATAttttttttttttttgagacggagcctcgctctgtcacccagactaaagtacagcggcgcaatctcagctcactgcaagctccgcctcccgggttcatggcattctcctgcctcagcctcccaagtagctgggactacaggctcctgtcacctcgcccggctaattttttgtatttttgtagagacggggtttcacagtgttagccaggatggcctcaatctcctgacctcgtgatccgcccacctcggcctcccaaagtgctgggattataggcgtgagccaccgcgcccagctgtttttttgtaatgttagtagacatggactttccccttgttacccaggctgggctcaaacttctgaggtataagagatgctcccgccttgaccttgtgaagttctgggattacagacgtgagcccccatgcccagtcAGGGGTttgtttgttttggtttttgtttttgtttttgagacagagtctcactctgtcgcccatgctggagtgcagccgtgcaattttggctcgctgcaacctctgcctcccgggttgaagtgattctcctgcttcagcctcccacgtagctgagaccacaggtgtgccaccgcgcctggctgatttttgtatttttagtggagacggggtctcaccatattggccaggatggcctcaaactccctacctcaggtgatctgcccgcctcggcctcccaaaatactacgttacatgcatgagccaccgtccctggcTGTGGTCAGGCTTTTGAGTTTAGATCCATGAAAGTGTGGCCGCGTCCCTGCTCCCTGCAGGAGGGAGGCCTGTGGGACCTTCTGCTGTGGCTGTTTACAAGGCTTTGCTCCTGGTGCCTAAGGCTGGAACCTTCTCTCTGCAGGAGGAGATGAGCAATTACTACCTCAGAGTCACCCAGAACGCCTTCCTAAACCACACGAGGCAACGCAGCAACAAGTGAGGGAGCCCCTCGGGTCCTGGGCCCCCGGGTAGGGCTGTGCAGCCGTCGCCCTTGGTTCCCACAGAGGGACCTCAGAGGCCCTGGATCACAGTGCTGGGCAGCACCCGTGGCCTCAACGTGTCCACCTCGGATGTCCCCTAGGAATGTCCCAGCTCGGGACAGCATGGGGCGTCACTGAGGAACATGCGGGGGCCTCCTGGGCAGAGCCGGGGTCAGTCCCGTCCTCACGGCCCTGTGCGATGCCGCCCCAGCTTGCACGTCCCTCTGCCCCTGGGTTTCCGCGGTCCTGTGCCAGCAAGGGAGGCGGTCTGATTGTCTGAGGCTCTGCTGGGGCCTCCATTGCAGGCTGTGGGTGCCCTGGGGTGGGAGATGGAGACACTTTTGCTCCCACGGGAAGCTGGGCACGAGCAGGTCCTGTGTGTTTGGGCGGAGCCTGGGGCCTTGGCCCCCCCGCCCAGATGCTGGACAGGGTTGCTCCCTCCAGGCCTGGGGCCCTCCTCACATTGCGCGTCCTCCGTGAGCTGCTACCCAGAGGTCCCCAGTAGGTGGATAGCCCCATGGCCAGGCTCCCTAGCCCCTTTCAAATCCCCTTATTTTGAGTTTTCTTGGTCTCCTGGGCCCCTCCAGCCCCAGTCACGTGTCACACGGAGAATCAAGTCCTGCCGGTCGGCCGTGGCCGAGTCTTCAGGCGTGTTGGGCTCGCTGGCTCAGCTGCTGCCGGTAGACGCTCCCTGGAGCCCTGGCTCAGGTCCTTCCCAGAGAGGCAGGGCTGGGGCCCTGGTGAGCCTCCGCTGCACCCGGGCCCCCAAGGTCCTGCTCCTGGCTCGTGTGGCCACTCTTGGCATGGACTCTGGGTCCCGCATCCCTGCTCCCAGCACAGCAGGGCTCAGGCAGCAGGAGGAGTGGTGGTCCCGACGCTGCCTATCACGCTGGGTGAGGGTCAGCGGGGAAGCGCCACACGGGATGAGAACAGAGGCCCAGGTAGCCGGGCGGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAAGGCTGTGCCACGTCCTCCCGATGTTTCCTGTGCTCACAAGCTGCCGCTTTAGATTCTCCGGGAAAGTCCCCCTGAAGGGACTAAGGAGCCCGCGTTCCCCTCGGGACAGCTTGGCCGGCAGCCCCAGCATTTCCTTCCCCATCCCTGCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCGACCCCAAGCAGTTACACCAGGACATCCACGACCGCATCGACGTGATGTTTTACTTCGACCCGCCCGGGCCAGAGGAGCGGGAGCGCCTGCTGAGAATGTATCTTGACAAGTATGTTCTTATGCCGGCAACAGAAGGAAAGCAGTAAGTGTCTCCCCTCACCCACCCCTGTCCAGGGACCCTCGCTCTGGGCCCACCCCCGGCCCTGCTCTCCGGACGCACACAGCAGGCCCAGTCTCCGGGGTGGCACCGCCTCCCTGCTTTGCGGTTTCGCACAGGAGCCCTGTGGGCCCCAAGGGTCCCAGAGGCTGCACCCAGGGATGTGCCACCACCCTTTCCTCATCCCCACCTGAGAACAGCCTGGTGGTGTCTCCTCGGGTTTGGGGGGCAGAGCCCACCATCACTTACAAACCTTCAACtttttgtttttgagacaaagtcttgctctgtgccccaggctggagtgcagtggcacgatctcagctgactgcaacctccgcctcctgggttcacgcgattctcctgcctcagcctcctgagtagctgcgattataggtgcctgccaccacgccccactgcttttcgcctttttgtagagatgcagtttcaccatgttggccagggtggtctcgaaaccctgacctcgggtgatctgcccgccttggcctcctacagtgcagggattacagatgccagccactgtgcccgaccACCCTCAGGCCCTGGCAGTGCAGGGAGGTGACGTGGAGTGTTGCTCTGAGACCCCCATGTTGGGATTTGAGGGAGACGCTCCTCATGAGAGCCCCGTGTTGGGACTGGAGAGGATCCTCACGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCTGGCCCTGGAGCCTCATGGTGTGGGGCGCGGCTCCGGCTGCACTTGTGCCCTGAGGCCTTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGACAGGCCCTGGGGCTGCACGCCGGCTGCCTCAGGAACACTCCAGATGAGCAGTGGCTGCTCCACCTCTTGGCGTCCCCAGGTCCCAGGTTTCTGAGTCCTTCTGTCCACCTGACCTAAATTCCTGCTCTCTCCAGTGACAGCAAAAGCCGCTCTGTTCCAGAGAGAGCCTGGTTCCCCCTGCCAACCGCTCCGTGGCTGCCTGCTTCATGCTAGCCCAGCTGTCCCGGCCTCAGTTTCCCTTTGGCCCTCCCCTGCCCTGGGCTCTCCCACTCCCACGGCTGCTCATAGACCTGGCACAGTGACTTGGCTTCTATGACCTCCAGGGAGATGCTTTTGCTGGAATTCAGGGCTCTGCCACTGCCACTGTAACGGCCATGAGCCCTGTGGGTGCTGAGTGGGCAGGTGAGGGCAGGGCTGGTGTGAAGAGGGGGTGCGGCCATCTCCAGGCCCCACAGCAGCCACCACCTCCCTGCTCAGCCCAGACCTGGTTTGCATCAGGGAGAGGGCGGAGTTTGGCTGTCACAGGAAGAGTCCCTCCCAAGGGGGCATCTGGCATGGGTGCCCGCCTGGCTGCCTGTCTTCCAGCCCCCACCTCGTGGTGTGGGAGCCGCTGCCTTGGCCGGCCCACTTGGGAACTCCTTCCCCAGGCGCCTGAAGCTGGCCCAGTTTGACTATGGGAGGAAGTGCGAGGAGATCGCTGAGCTGACGAACGGCATGTCGGCCCGGGAGATCGCACAGCTGGCTCAGTCCTGGCAGGTGAGTGGGGCTCGGGCGCACCCACCCAGACAGGAGCCCAACTCCTGTGGAGACGCCGGGTTGCGCCTGTCCCAGCACCAGTGTCACACCGCAGCTTCTGTTGAGGGGTTTTCAGTGCACAGACGTGACACGGGGCACTCGCCCCAGTCGGCCACTCCACACACTGGCGCGCCCCTGCTCCTGCCCTGGGAAGTGTGGGGCATGTCCGTGGCTGACGGTCATAGGTCAGGAAGCCCGTCCGGCATCCTAGTATCCGGGCTCTGCCAGGTGGGGCGGGAGGCTTTCGATGCTCACCTTGGCAGACGGGCACCCCCTGGTGTGAATGGTCATCGGGACAGGCCCCGCCTGAGTTTGGTGGTGGGGCTGGAGGGATGTTGTGTTTCCCGGACCACGTCCGTTGGCTTGATCCTGCTTGACGGGCTCAGACACAGGGGCAGGAGTGACCTCTGATTGTCCCACAGCCGGCTGCTCCTTGGAGGACCCCCTCCTGCAGCTCCGTGGCTGCTGCAGGGACGGGGAGCCGGGACTCAGAGCAGTGTGGGCGTGGCCATCCAGAAAGCTTTGGTCTTTGGGGGTTGCTGGAAAAGCATAACCAGGTCTGTAGAAGGCACCAAGGCCATGCACAGGCATTGCTGCCTCTGGGGTCTGCAGAGTCTGTGACAACCTGGTCACTCAACCTAGCAGCGCTTTCGCGTGTGACAGGTTCATGAAGTAGCCAGTTACCTTGATTTGAACGTTGGAGCTGGGGACTATATGGGCTGTATTAGTCAGTTATGCCGCTGTGACAAAGAGCCTCAGATCTCAAACCCCATCCTTGTGGGTCAGCTGAGGTCTGTGTTCCAGGCCGTCTCCACTTGAGACCAGGTCTGTTTCCACAACTAAGCAAACAGAgaccgggccatggtgttgggctacatttgttcccagcatttgggaggtcgaagtcagcccagattatttgaaggcaggagtcaggaccagccttggggggggggggggggggggggggggggggaaagcaaggggagactccatctacaaaaaataaaaaaattagccggaccctaatgtggcacgcctgtaatgcagctcctgggagcctgaggtgggatgatcactgagtcccaggtaggccagaaatacagtgagcctgtggattgtgccactgcactccagcccgggttacagagcgagaccctggtctttaaaaataagaataaTTTGAgccgggcatggtggctcacgcctgtaatcccagcacgctgggaggccaaggggagaggatcacttgaggccaggagttcgagaccagcctggccaacatgtcgagccccacctctactaaaaatacaagaattggccgggcgcagtggtggtgcatgcctgtattctcagctactcaggaggctgaggcaggagaatcgcttgaacccgggaggtggaggttgcagtgagctgagatggtgccattgaattccagcctggactattcaggatcctttgagattccataagaattttaggagtggttttcctatttttgtaaaacataatttgggttttcacagggaccgcgtttagtctctatgtcgctttgatgtctctcagcaatattCTGTGGttttctcttgttttcgagacggagtctcgctctgctgcccaggctggagtgcagtgttgtgatctcagctcactgcaacgttcccctcccgggttaaagtgattctcctgactcagcctcctgaggagctggaattccaggcaggcgccaccatgcccggctaatttttgtactaagagacggggttttgccatgttggccaggctggtctcgaacctctgacctcaggcaatccacccacctcagcctcctaaagtgctgagattaaaggcacgtgccaccacgcccggctaatttttgtatttttagtagagacgatgattcaccatgccggcgaggttggtcttgaactcctgacatgaggtaatccatctgcctctgcctcccaaagggctgggattcagacatgggccactgcgcccagccagttttcactgtacaagtctttcaccctcttggttaagtgaatttccaagcattttattcttgccgctgctgttgtaaatggaaacggtttcataattccccattcacattattcactgttgggatggagaactgcagctttctttgctgttgattttgtatcctgtaagtttgctgatgtcacggcattttttcttccaatatggattctaggattttctacatataagattatgtcatctgagaacaggtgatttttacctttcccttttcagtttggatgacttttctttttcttgtctaattgcactgtccagagcttccagtggtgtgtggaatagaagcggtaaagcattcttgcctggttccttacctcagaggaaaagctttgtttttcaccactgagtatgtcacctatgggcttgtgatgtgtggccttcattgtgtttagggtgtatccttcaattcttggtttggtgagtgtttttatcataaaagtgtgaggcgggtggatcacctgaggtcggcagttcgaggccagcctgaccaacgtgaagaaaccccatctctcctacaaatacaaacttagttgggcatggtggtgcatgcccgtaatcccagctactcgggaagctgagacaggagaatcgcatgaaggcggcaggcagaggttccagtgagccgagatcgcgccatttgcactccagcctgggcaagaagagcaaaattgtctccaaaaaaaaaaaaGTggccaggcacggtgactcacgcctgtaatcccagcactttgggaggccaaggtgggtggatcacgaggtcaggagatcgataccatcctggctaacacagtgaaaccctgtttctactataaatataaaacatcagctgggcatggtggcaggtgcctgtagtcccagctacctgggaggctggggcaaaagaatggcgtgaacccaggaagcggagcatgcagtgagctgagatgcctgggctacagagtgaggccccaactcaaaaaaaaaaaaaggtgttgtatttggtcgaatactttttctgcaacacttgagacagtcgtgtggtttccttcctccaccctgctaatatcgattgatttttgtatgttgaacatttcatatgcggaacattgattttcatatgttgaactatcgttgcattccaggaataaatcctgcttggtcggctgggcgcggtggctcaagcctgtaatcccagcactttgggaggccgagatgggcggatcacaaggtcaggagatcgagaccatcctgtctaacctggtgaaaccccgtctctactaaaaaatacaaaaaactagccgggcgaggtggcgggcgcctgtagtcccagctactcaggaggctgaggcaggagaatggcgtgaacccgaaaggcggagcttgcagtgagctgagatgcggccactgcactccagcctgggtgacagagcgagactccgtctcaaaaaaaaaaaaaaaaaaaaaaaaaatcctgcttggtcagggtatagagtccttttagtgtgctgctgaattcactctgctggcattttgttgaggactttcccagtgatgctcatcagggatattggcctgtcatttttcttgtggtgtctttgtctgggtttgatatcagggtaatgctggcctcctaggatgagtgaggaaatgttcttcaatttgtccaagagtttgaggtgtgctgctgattcttcttaatgttttgtgaattgacacgtgaagacatcaggtccaggtcttgtgtttCaacttttacagcttgaagactttaggttcccagaaaaattgcaaaggtagcacagagagctcccgGGCCCGGGGCCTTGCCACGTAGTGAACGTCATGTGTCACTGTTGGCCCCACCTGGGACTGGGTCTTGCCCAGAATCCCACCCAGGAGGCCACGTGACATTTAGCTGTCACTTCTGGTGGGCTCTGCCAGGTCCCGTGCTTCCTGGTGGGGTGGCCCCATGAGCATCTGCTCATCCCCTTTCCTCCACTGGGCCCTGGGTGAGGTGCAGCCACTCGGGTGCACCCTGAGGGTTCCTGCACCTGTTTGAACTCTCTTGGGTCGGCTCAAGACCAAAAATGATGCTGAGCAGTCCTGGGCCTCTGATGCATAGTGGTGGTCCGGTTCCGGTCAGCGTCTCCTGCACTCCTGGGCCCCTGAGCCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGGGTGGGCGCTGTCAGTGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCCCATCGTGGCAGCCGTGTTGTGGGAGGATGGTGCGCTGCTGCCCCTTTACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCCGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTGCCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCAAGCACGGTGCCCAGTGGGGGTGCCCAAACCTTCACCCTGACCCATGGGTGACTTCCCTTGGGGACTCCACGCCTTTCACTGGGACTGGGATGGAGAGCGACCTGTCCATGGCAGAAGGGCTGCACCTGAGGTGCTTGAAGCAACACCAAGGGCCACAGTCCCAGCAGCTCCAGCCTCCGCATGCTGGATGCCAAGTCCTGTGCCCAGGACAGGGAGGTGGAGGCACGGGTGATCTTGATGCTAGCACCTATGTGCCCCGAGGTTGGGCAGTGGCTGCCTCTGCTGTGGAGGCCTATGAAGGTGAGGGTCTGAGGATCTGTAGTGCACTGTGACCCGGGGGCACTGCCTGGCCACGGCTGAGACACGCAGAGGGTCTGCAATTCCCTCCTGCCTCTTGGGAGCTGCCCTGGGTCTGCAGTCAGTGGGGCTCGTCCTCGGGCTTTCCGTTATTAGAAAGTCACTGAGAAACTGCAGTGCTGAGGACGCAGGCAGGGCTGTGGCACTGCAGGGGCCGCTCCCGGTGTCCACACGCATGCTGGGCTCTGCCGAGGTGCCGGAAGCCTGTGTTTCACCCTGAGGCCGTCCTGGTGCCCCGGGTTTGGACCCTCCCCACCTCGGGGTCCTGGAGTGCGTTACGGGTGGGGGGTTCCCATGGTGGCCTCCCTCAGCTCCCTCTCTCCTCACTAGGACACGGCGTATGCCTCCGAGGATGGGGTCCTCACCGAGGCCATGTTGGATGCCCATGTTGAAGACTTTGTCGAGCAGCACCAGAAGAAAATGCGCTGGCTGAAGAGGGAGGGCCTGTCCTCATGGACCAGCACCCCTTAACCTGAGTCCGCGGTGAGACCACACGTCACGGAGCCTGGCTGCGGACCCCTCCCACCCCTGCTTTTCCGGTCCCTGCACGTTTAGGAAATGCTTCCCCTAATAAACTCCCACAGGTGCCACAGCGCTGTGTCTATTGGCTGATGTGGTGCGGGGTTTGGGGTCCCCTAGTGTCCTTCTGGGGTCAAAGGTGATAGAAAAGACAGGCTGGAGCTTTCTGGAGAATTTAGGCACAGAAGGGTGGGCTTCACATGAGGTGCCTGCCACAGCGGGGTTGGCTGCCTGAATGCCACCCGGGACCGGCTGCTCGCGCTCCATCCTGCAGCTGTGGAGACGGGGGTGCCCCTTTGCCTCTCTCCACGAAGTGCAGGGCAAACAAGACACAGCGGTTTCAAACAGGCGATGGCCCGGACTGCGTGCCTCGCCGCCCCTGCGCCTTCCCCTGCCCCTGCTTTCCAGCTAGTCCCTGAAAACCTTGATGGggccgggcgcggtggcccatgatggattctcagcactttgtgaggccaaggcgggtggatcacctgaggttaagtgttccagcccagcctggccaacatggtgaaaccccatctctcctaaaaaaaaaaaagaaaagaaaaagaaaaattagccgagcgtcgtggcaggtgtctgaaatctcaggcactcaggaggctgaggcaggagaatcacttgaccccgggaagtggaggttgcagtaagctgagaccatgccattgcagtgcagcctggacaacaagagtcaaactctctcaaaaaaaaaaaaaGgccaggtcaggtggcatgtgcctgtggtcccagcttggtcccagattcttggtttggaggctgaggtaggaggatcacttgagcatgggaggatgaggttgcagtgagccaagatcgcttcagacactccagcctgggtgacagagtgagaccctgtctctaaataatcaaaaCCTTGATTACAGCCATGGGGTGGGGGTTGGGGGGCGTCTGGCTCGGCAGGGAACTATTGGGTTTTTCTGCTCTCtaatttttgtagagacagggtttctctttgttgcccaggctggtctccaactcctgggtcaagcgtcgatcttctgcctcggcctcccaagtggtgaggttacaggcgtgccaccgcacctgaccTGttttctttttttttttttttttttttttgagacggagtcagctctgtcacccagggctggagtgcagtgggcggtctcagctcactgcaagctccgcctcccgggttcacggccattctcctgcctcagcctcccgagtagctgggactacaggtgcgtgccacaacgcccggctaagtttttgtatttttagtagagacagggtttcactgtgttagccagggtggtctcaatctcctgaccttgggatccgcccgtctcggcctcccaaagtgctgggattacaggcttgagccaccgcccccggccCCttttttttttttttttttggcaagggagtcttgctcgcccagggtggagtgcagtgttgcaatctgggctcactgcaacctccacgtccagggtgtcaggcctctgagcccacgctaagccatcatatccccagtgacctgcatgtgtacatctgatggcctgaagcccctgaagatccgcagaagtgaaaacagtcttaactgatgacattccagccttgtgatttgttcctgccccaccctacctgatcaatgtactttgtaatgtcccccacccttaagaaggttctttgtaattctccccaccctggagaatgtactttgtgagatccacccccagcccccaaaatattgctcctaactccactgcctatcccaaaacctctcagaactaacggtaatcccagcaccctttgctgactctttttggactcagctggcctgcacccgggtgaagtaaacagccttgtggttcacacaaaacctgtttcgtggtgtcttcacacggacacgcgtgacacagggttcgaggaaatttcatgcctgaacctccggagtagctgggattacaggcgaacggcaccatgcccaggttaatttttgtattttcggcagagacagaggcccaggtagccgggctggGGGACAGCTGGGTGTGGTGGGGCCGGCGGTGACCAGGGCTGTGCCGCGTCCTCCCGGTGTTTTCTGTGCCCACCAGCTGCCGCTTTAGATTCTCCGGGATAGTCTCCCTGAGGGGGCTGAGGAGCCTGTGTTCCCCTCGGGGCAGCTTGGCCGGCAGCCCCAACATTTCCTTCCTCATCCCTCCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCCACCCCGAGCAGTTGGACTGGGGCATCCATGACTGCATCGATGTGACGGTCCACTGCGACCTGCCACGGCAGGAGGGGCGGCAGCGCCAGGTGAGAATGTATTTTGACAAGTATGTTCTTAAGCCGGCCACAGAAGGGAAACAGTAAGTGTCCCGCCTCACCCGCCCCTGTCCAGGGACCCTCGCTCAGGGCCCACCCCGCCCCTGCTCTCCAGACGCACCCAGCAGGCCCAGTCTCCAGGGTGGGCACCACCTCCGTGCCCTGAGGTTTTGTGCGGGAGCCCTGTGGGCCCCGAGGGTCCCAGAGGCCGCATCCAGGAGGTCACGCCCCCTTTTCCTCATCCCCATCTGAGAACAGCCTGGTGGCGTCTCCTCAGGTTTGGGGGCAAAGTCCACCATCACTTAGAAACTTTCAGCAttccttttttttttttttcttaagacggactcttgctctgtcatccaggctggagtgcagtagcttgacctcggctcactgcaagctctgtctcccaggttcacgccgttctcctgcctcagcctcccaagtagctgggacaacaggcacccgacaccacgcccggctaatttttttgtgtttttttagtagagatgggtttgaccgtattagccaggatggtctcgatctcctgacctcgtgatccacctgcctcggcctcccaaagtggtgggattacaggtgtgagccaccgcatctgacctttttttgaggaagtctcactcttgtccccctggctggagtgcagtgccgggatctcagttcactgcaacctgtgcctcagcctcctgagtagttgggattataggtgcccgccaccgcgcctggctggtttttgtgtttttgtagagatggaatctaactccgtctcccaggctggagtacagtggtgtgatctcagcttactgcaacctccaccctccgggttcaaaccatcctcttgcctgagcctcctgaacagctgcgattacaggcgcccagcacaatgctcgcctcatttttttgtctttttagtagaaacagcttttcaccaaattgaccagactggtcttggacttctgatctcaagtgattcaccctcctcggcctccaaagtgcagggattgcagatgtgagccaccggacccggcctcttttatgttcctcttcagtaCTCAGAGGGCTGTGAGGAAATCCGGTGCCCGGCCACCCCCAGGCCCTGGCAGTGAGGGGAGGTGATGTGGAGTGTTACTCTGAGATTCCCATGTTTGGATTCGAGGGAGACGCTCATCATGAGACCCCTCCGTGTCGGGATTAGAGGGAGAGGCTCCTCATGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCCGGCCCTGGAGCCTCCTGGTGTGGGGCGCGGATCCGGCTGCACTTGTGCCTTGAGGCTCTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGAGAGCCCCCAGGAACACTCCAGATGAGCAGAGGCTGTTCCACCTCTTGGCGTCCCCAGGTCCCCGGTCTGAGTCCTTCTGTGCACCTGACCTAAATTCCTGCTGTCTCCTGTGACAACAAAAGCCACTCTGTTCCAGAGAGAGCCTGGTTCTCCCGTTGACCCCTCCGCTGCCGCCTGCTCCATGCTAGCCCAGCCGTCCAGGCCTCAGTTTCCCTTTGGCTCTCCCCTGCCCCGGTtcccagctgcttgggaggctgaggtaggaggatcatttgagtccaggagcttgaggttgcactgagctgtgactgtgccactgtactccagccttggcaacagagtgagacactgtcttaaaaaagaagaaTTTGggccagatgctgtgtttcatgcctgttcccagcatgctgggaggctgaggagagaagatcactcgaggccaggggttccagaccagcctgccaacatgttgaaccccgcctctacgaaaaatacaaaaattagccgggcgtggtgggtgggtgggtgccagtaatcccagctactcaggaggctgaggcagcaaaatctcttgaacctgggaggtggagattgtggtgagctgagatagtgccgctgtacttcaacctgagcaacagagtgagactccttatcaaaataaagaaaTCAATCAATCAATAAAAATAATCACAATAATTTGggctgggcgtggtggctcactcctgtaatcccagcactttgggaggcgtggatcggttgagttcgaggcaagcctggccaatgtggcgaaaccccatctccactacaaatacaaaaattagccaggtgtggtgacaggcacctgtaatcccagctgctcgggaggctgagacaggagaatctctggaacctaggaggcggaggttgcagtgagccaagatcacgtcagtgcgctccagcctgggtgacagagactgtctcaaaaaagaataataataaTTTgactgggtgtggcggctcactcttgtcatcccacactttgggaggccgaggcaggaggattgcttcagctcaggatttcgagactggcctggacaactggcctggacaacatggtgaaactccatctctacaaaaaatacaaaaattagccaggcatggtatcatgtgcctgtgatctcagctactcaggaagcagagatgggagcattgctggagcctgggagttggaggctgcaatgaaccatgttcgtgccactgcactccagtgtgggtgacagagtgagaccctgtctccaaaaggcatggtggctcacgcctgtaatccctgcactttgggaggccaagctgggtggatcacctgaggtcaagagttggagaccagcctggctaacgtggtgaaaccccatctctaggaaaaatagaaaaaATTggccaggtgcagtggctcacacctgtaatcccggcactttgggaggccgaggcgggcgaatgacctgagatcaggaattccagaccaaccacaccaatatggagaatccccgtctctactcaaaatacaaaatcagccgggcatggtagcaatcccagttactcaggaggccgaggcaggagaatcactggaggtgagccgagaccacgccattgcactgaagcctgagcaacgagagggaaactgtctcaaaaaataaTGCTAATAACAAGGGGGAGAGAACAGGAGTGTGGTCAGCAGCTGGGCCTGCCATAACCCCTGGGTCGTGTGTCCCCACAGCTCTGAAGGCTAGAGGCCCGAGGTCAGGGTGCCAGCTCGGTCCCCCCCGTGGAGTGTTCTCTGTTAGCTTCTCACATGGCAGGGAGAGTGACTGAGCTCTCGCTCTGGTGTCCCTTACGAGGACGTTCATCCCCCACTGCTCAGAGCGGCGGTGAGCCACCACGCCCAGCGCCAACTTTGTCCTTCAAGAGTTGTTTTTTTGTgccgggctcagtggctcatgcctggaatcccagcactttgaaatgccaaggtgggtggagcacctgaggtcaggagtttgactccagcctggtctaaatggtgaaaacctgcctctactaaacataaaaaaatcagctgggcatgttggtgtgtgcctgtaatcccagccactcgggaggctgaggcaggagaatcacttgaacccaagaggtggaggttgcagtgaactgagatcatgtcactgcactgcagcctggatgacaagagtgagactcccttgcaagaaaaaacaaaaattaaaaaagaaGTTGTTGTcttttttttttttttttcccttggacaattcaagatgcctagagattccatatcaattttagtaatgcttcttctatattttaaaaagtaatttgggtttttacagggattgcattcagtctctgtattgccttAATGACTCTTAGCAATGttgttttttttatttattattttttttctagagatggagtctcactctgtcagccaggctggagtttagttgttggccaggatgggcccaatctaatgacgtcaggtgatccgcctgcccctggctcccaaattgctgggattcagacgtgggccaccatgcccagccagtttacattgtacatttctttcaccttcttggttcagtgaagctccaagtattttattctttcggatgctcttgtaaatggaaatggtttcgtcattccccgttcagattatacacttactatgaagaactgcagctttctttgctgttgattttgtatcctgtaactttgctgatgtcgtggggttgttttttccaatatggattctagattttcCTTTTCTTTTTCTtttttttgtttttttgttttttttttttttgatatggggtctccctctgtggcccaagctggagtggaatgcagcggcacgatcttgaatctgcgagctcctctgcccgggtccacgccattctcctgcctcagcctcctgagtagctgagactacaggtgcctgccatcacggccggctaattttgtgtattttttgtgcagatgaggtttcaccgtgttagccaggatggtctcgatctcctaactttgtgatcggcccgcctcggcctcccaatgctgAATGCTGTTGGGACTGGGTCTTGCCCCAGAATCCCACCCAGGAGGCCACCTGACGTTTAGCTGTGACTTCTGGTGGGCTCTGCCAGGTCCCATGCTTCCTGGTGGGGTGGCCCCGTGAACGTCTTCTCAGGCCCTTTCCTCCATTGGGCCCTGGGTGAGGTGCAGCCACTCGGGGGCACCCTGAGGGTTCCTGCACCTGTTTGAAGTCTCTTCGGTCGGCTTGAGACCAAAAATGATGTTTAGCAGCCCTGGCCCCCTGACGCACAGTGGCGGTCCTTCTCCGGTCAGTGTCCCCTGCACCCTTGGGCTCCTGACGCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGTGTGGGCGCTGCCAGAGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCTGGTCGTGGCGGCCGTGTTGTGGGAGGATGGTGTGCTGCTGCCTCTGCACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCTGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTACCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCGAGCTCGGTGCCCTGTGGGCGTCCCCATGCCCAGACCCTGACCCACAGGTGCCTCCCCTTGGGGTCTCCACGCCTTTCCCTGGCCCTGGGATGCAGAGTGACCTGTCCATGGTAGAAGGGCTGGACCTGAGGTGCCTGAGACAGCACCAAGGGCACTGGTCCCAGCAGCTCCAGCCTCTGTGTGCTGGATGCCACACAGACACAAGACTCTTGGGAGACGCATTTTCCATCTGGCTCAGAGGGGGAGGGGGAGGCTTTGCAACCCAGCCCCTGCCCAGGCCCCTGGGAGGGTGGGTGCCTGCTGAGCCCCCGGGGCAGCAGGAGCGGGGCAGGCGGGGTCTTTGTTCTCACTCCCACAGCAGAGGCAGATGTGGGGGCGCCTGCTGGGGCCAGACCAAGGTGGGGTGGCCTGGAGACTGCTTCCAACCGTGGCCGGGAAGCAGGGAACCTGCCCGGCGTGTCTGAGGCCACACTCTCAGCTGGCCGGTCCAAGCCTGCGGCTGGAGCTGGTGTCTGTTTAGCTAATAAAGTCCCACAGTTGCCTCACTGCCGTGTCTATTTGCTGATGCTGCGCGGGGTTTCAGGGGCCGCCTAGCCTCCTCCTGGGGTCAAAGGTGACAGAAGAGGCAGAGGCGGGAGCTTT

Let’s break this command down since there is a lot going on. Here is a table that explains each option:

Command line option Description
bedtools Call the main bedtools interface
getfasta The sub-program in the bedtools program to run
-fi The path to the input whole genome sequence file in FASTA format
-bed The path to the input bed file
-fo The desired name of the file to write the extracted sequences to in FASTA format

We saw this a bit yesterday as well, but this is another framework for running commands. While it still follows the Unix philosophy (formatted text -> command -> processed text), the Unix commands we’ve seen up to this point generally act on a single file, the use of multiple command line options allow us to specify multiple input files, here a FASTA file of sequences and a bed file of intervals. Also, like Unix commands, the default action of bedtools is to simply print output to the screen. This output can be redirected to a file with >, but here we also have an option (-fo) to tell the program directly to print output to that file instead.

The use of a main program (bedtools) and a sub-program (getfasta) is also a norm among bioinformatics tools.

The downside of having multiple input files is that it makes piping with | difficult.

Try running the code block below to pipe output from grep to bedtools getfasta:


grep chr10 data2/macaque-svs-filtered.bed | bedtools getfasta -fi data2/rheMac8.fa -fo macaque-svs-filtered-chr10.fa
# grep: The Unix string search command
# chr10: The string to search for in the provided file
# | : The Unix pipe operator to pass output from one command as input to another command
# bedtools: A suite of programs to process bed files
# getfasta: The sub-program of bedtools to execute
# -fi: The genome fasta file as input
# -bed: The bed file as input
# -fo: The desired output fasta file

This doesn’t work because bedtools getfasta requires the -bed option to be specified. It doesn’t know that we’ve given it the bed formatted input through a pipe.

Luckily, many bioinformatics tools have a shortcut help us pipe output to a specific input option.

Using - to pipe

For tools that require an input file to be specified with a command line option (like -bed above), we may still want to pipe the output from another command to it. We can do so with the - shortcut. Basically, when this is provided as an option in lieu of an actual path to a file it tells the command to read the input for that option from the STDOUT stream (what is printed to the screen).


grep chr10 data2/macaque-svs-filtered.bed | bedtools getfasta -fi data2/rheMac8.fa -bed - -fo macaque-svs-filtered-chr10.fa
# grep: The Unix string search command
# chr10: The string to search for in the provided file
# | : The Unix pipe operator to pass output from one command as input to another command
# bedtools: A suite of programs to process bed files
# getfasta: The sub-program of bedtools to execute
# -fi: The genome fasta file as input
# - : Another way to pipe the output from the previous command to the input of the current command when an input option is required
# -bed: The bed file as input
# -fo: The desired output fasta file


head macaque-svs-filtered-chr10.fa
# Display the first few lines of the new file with head
## >chr10:52589-53460
## CACCCATCATGACAAGGCCAGGGTCACACACTATGGGATAGTCTAGGGGTCACCACGACTAGTTTGGGGTCAGACACCATGACCAGCCCAGGGTCACATACCACGGCCAGCCCAGGGTCACATACCACAGCCAGCCCAGGGTCACCCACCATGGCCAGCCCAGGGTCACCCACAAAGACCAGCCCAGGGTCACCCACCATGACCAGCCCTGGGTCACCCACCACGACCAGCCCTGGGTCACCCACCACAGCCATCCCAGGGTCACCCACCACGGTCACCCACCACGGCCAGCCCAGGGTCTCCCACCATGACCAGCCCAGGGTCATATACTATGGCCAGCCCAGGGTCACCCACCACGGCCAGCCCAGGGTCACATACCACAGCCAGCCCAGGCTCACCCACCACGGCCAGCCCAGGGTCACCCACCACGGCCAGCCCAGGGTCACATACCACAGCCAGCCCAGGGTCACCCACCACGGCCAGCCCAGGGTCATATACTATGGCCAGCCCAGGGTCACCCACCATGACCAGCCCAGGGTCATATACTATGGCCAGCCCAGGGTCACCCACCAGGGCCAGCCCAGGGTCACCCACCATGACCAGCCCAGGGTCATATACTATGGCCAGCCCAGGGTCACCCACCACGGCCAGCCCAGAGTCACATACCACGGCCTGCCCAGGGTCACCCACCACGGCCAGCCCAGGGTCACCCACCACGGCCAGCCCAGAGTCACATACCACGGCCAGCCCAGGGTCACCCACCAGGGCCAGCCCAGGGTCACTCACCACAGCCAGCCCTGGGTCACCCACCACAGCCAGCCCAGAGTCACATACCACGTCCAGCCCAGGGTCATCCACCATGAGCATCCCA
## >chr10:69728-70192
## GCTCAGAGGGAGACATGTGGGCACACCGTGTGTGCACAACCTCACAGTCAGAGGGGGAGACACGTGGGAACTTGTGCTCACAGCCCTCGGTCCCCCTTCGCTCAGCCTCACAGTCAGAGGGGGAGACACGTGGGAACTTGTGCTCACAACCCTCGGTCCCCCTTCGCTCAGCCTCACAGTCAGAGGGGAAGACACGTGGGAACTTGTGCTCACAGCCCTCGGTCCCCCTTCGCTCAGCCTCACAGTCAGAGGGGGAGACACGTGGGAACTTGTGCTTACAGCCCTCGGTCCCCCTTCGCTCAGCCTCACAGTCAGAGGGGGAGACACGTGGGAACTTGTGCTCACAGCCCTCGGCCCGCCTTCGCTCAGCCTCACAGTCAGAGGGGGAGACACGTGGGAACTTGTGCTCACAGCCCTCGGTCCCCCTTCGCTCAGCCTCACAGTCAGAGGGGGAGACACGTGGG
## >chr10:71484-71997
## GTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGAAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGGAGGAGAGGCTCACACCCCACTGGGGCTGCCGGAGTTGGG
## >chr10:131190-131574
## ggtggggacagggacaggtggggacaggggcaggtgggggcaggggcaggtggagcaggtgaggacagggagaggtggggcaggtggggacagggacaggtggggcaggttgaggacagggacaggtggggacaggggcaggtgggacaggttgaggacagggacaggtggggacacggacaggtggggacagggacaggtgggacaggttggggacagggacaggtgggggcaggggcaggtggagtaggtgaggacagggagaggtggggcaggtggggacagggacaggtggggacaggggcaggtggggcaggttgaggacagggacaggtgggacaggtcgggacagggacaggtggggacaggggcaggtgggaacag
## >chr10:155900-156342
## CAGTACCTCTACACACACACGAACACGCCTGGATTCTCCAGTACCTCTACACACACATGAACACGCCTGGATACTCCAGTGCCTCTATCCACACACGAACACGGCTGGATTCTCCAGTGCCTCTATCCACACACGGACACGCCTGGATTCTCCAGTGCCTCTACACACACACGGACACGCCTGGATTCTCCAGTACCTCTACACACACACGAACACGCCTGGATTCTCCAGTGCCTCTAGGCACACACGGACACGCCTGGATTCTCCAGTGCCTCTAGGCACACACGAACACGCTTGGATTCTCCAGTGCCTCTACACACACATGAACACGCCTGGATACTCCAGTGCCTCTAGGCACACACGAACACGCCTGGATTCTCCAGTGCCTCTAGGCACACACGAACACGCTTGGATTCTCCAGTACCTCTACACACACACGAAC

Did you spot the difference between this command and the one above it?

Here, all we’ve added is -bed -, which tells getfasta that the input for the -bed option will come from the output of the previous grep command.

Note that not all command line tools accept this shortcut, but most of the ones we cover today do.

Exercise: In the code block below, write a command that extracts the sequences of only the duplications in the bed file from the macaque genome. Output these sequences to a file called macaque-svs-filtered-dups.fa. BONUS: Figure out how to keep the SV name (4th column of bed file) as the header of the sequences in the output FASTA file (Hint: check the help menu of bedtools getfasta!).


## Use grep and bedtools to extract sequences of duplications only
# data2/macaque-svs-filtered.bed
# data2/rheMac8.fa
grep "<DUP>" data2/macaque-svs-filtered.bed | bedtools getfasta -fi data2/rheMac8.fa -bed - -name -fo macaque-svs-filtered-dups.fa
## Use grep and bedtools to extract sequences of duplications only

head macaque-svs-filtered-dups.fa
# View the first few lines of the file you created
## >chr1:89943:<DUP>:528:1907.19::chr1:89943-90471
## TGGGTTGATGGTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCAGGGTTGATTGGTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTGGGGGTCGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTATTTCTGGAGTTCAGGGTTGATTGGTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCAGGGTTGATTGTTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCGGGGTTGATTGTTTCTGGAGTTTGTGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATGGTTTCTGGAGTTCTGGGTTGATTGTTTTCTGGAGTTCAGGGTTGATTGTTTCTGGAGTTCTGGGTTGATTGTTTCTGGAGTTGGGGGTCGATTGTTTCTGGAGT
## >chr1:218574:<DUP>:960:5699.01::chr1:218574-219534
## TCTGTCACGGAGGAGGCGGGTCTTTCTCTGTCATGGAGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCACGGGGGAGGAGGATCTTTCTCTGCCAATGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGAAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCATGGGGGAGGCGGGTCTTTCTCTGCCACGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGGAGGCAGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTGTCGTGGGGAAGGCGGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCCATCATGGGGGAGGCGGGTCTTTCTCTGCCTTCAGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCATGTGGGAGGCGGGTCTTTCTCTGCCTTCAGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCATGGGGGAGGCGGGTCTTTCTCCCTCATGGGGAGGCGGGTCTTTCTCCCTCATGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGGAGGCGGGTCTTTCTCTGCCGTGGGGAAGGCGGGTCTTTCTCTGTCATAGGGGAGGCGGGTCTTTCTCTGTCGTGGGGGAGGCGGGTCTTTCTCTGTCATGGGAGAGGCGGGTCTTCGTCTCTCATGGGGGAGGCGGGTCTTCCTCCCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCT
## >chr1:219608:<DUP>:470:2074.69::chr1:219608-220078
## CTCTGTCACGTGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCATGGGGGAGGCGGGTCTTTCTCTGTCACGAGGGAGGCGGGTCTTTCTCTGTCATGGAGGAGGTGGGTCTTTCTCTGTCACGGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCTGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGTCATAGGGGAGGCGGGTCTTTCTCTCTCATGGGGGAGGTGGGCCTTTCTCATGGGGGAGGCGGGTCTTTCTCCGTCATGGGGGAGGCGGGTCTTTCTCTGTCATGGGGGAGGCGGGTCTTTCTCTGCCTTGAGGGAGGCGGGTCTTTCTCTGT
## >chr1:519434:<DUP>:22148:1673.64::chr1:519434-541582
## TCCTGGGGTCAAAGGTGACAGAAGAGGCAGAGGCTGGAGCTTTCTGGAGAATTTACTGACCACAGCGTGGTGCACTTGACATCAGGCGCCCGCCATGGCCGGGCCTGGGTCTGAATGCTGCCCGGGACCAGCTGCCTGCGCTCCAGCAGCCCCTCCCTCCTGAAGGCCAGGTCCCCCGAGAAGAACGAGGCTGCAGAGTGATGTGGGGGCCAGCGGTGACTTCCTACCACACTGTTCTCAGGTGTAAGAGGGCTCGCTTCTGCCCAGGCATTGTCCGTGGAAGACACACAGCCGGCCACTGCAGCCTCAGTCCTGGGATGCCCTGGGGCTGGGTCACAGGGGGCCACGGGCCACGCTGGGAGGCCACAGTCCTGTCGTGCCATGCAGCTCCCTGTCCCCAGATGTCCGCTCAGGGATGCAGAGGGCAGAAACCACACTCGCTGCCTGAATTCTGGGAGCAGAGCCCGGTACCCACTGCCTGGCCGGGGCCTACCCTGGGACTCCAGCCCCTGTTCCCGCTGGCCCGGGCTTCCGGAGGCAACTGTGTCCCTATCCTGGCTCAAGGTCCAGGCTGCACCTGGAACCTGCACGGTCACTCCTCCAGGTCCTCAATGCTGGAGGACTCTCTCAGACAGGAAACCTTTGCGTTGGGCGCAGGGCGGGGTGCGGGGTGGTCACGGGGAATCGCAGGGCAAAACAGCACAGTGCAATCGCGCAGAGCCTGATATTGGCGGATGAAACATAAACTGCTTTCTGCACTTTGTGTCCTTAGGAAGGGTGTGGGGTGTTGGCGGAAGTAGGAAACAGAAGAGGAGCCTGGGCATGCAGCGGGTCTGTCAGAGAGCAGAGCCCTCGGAGCTGCAGTGCTTGGAGGGAGGCGGTTCACCTCTGCCCACTCTCTCCAtttctctctctctcattttccttttagagatggattcttgctctgagcctaggctggagtgcagtggtgtgattatagctcattgcagcctcgcccttccaggctcaagtgatcctcctgcctcagcctgtccagtagcCATACCCTACTAGGTCCTAGTTAGCCCCCAGAGGCGTGCACCACCACGCCCACTAATTGCAAAAATTTGTTggctgggcgcgatggctaacatctgtaatctttgggaggccaaggcgggcggatcacgaggtcaagagatggagaccatcctggctaacacggtgaaacccggtctctactaaaaatacaaaaaattagccgggtgtggtggcgggggcctgtagtcccagctactcaggaggctgaggcaggagaatggcgggaacccgggaagtggagcttgcagtgatctgagatcactccactgcactccagtctgggggacagagcgagactccgtctcaaaataaataaataaataaatatataataaataaataaaaataaaaataaaaCTAAGCCCTTCCTGATGGTCATTGGGGGGTTTGGGGGTTGGGGGGGGTGTCTGGCTATGGCTGGGGAACTCATTTGGTTTTCCTCCTCCTCCTCtttttattttttggtagagacggggtctcttgatttcccaggctgatctccaactcctgggctcaagcaatcctcctgcctcagcctcccaaagtgttgggattacaggcctgagacaccgtagctagccAGCtttctttttttttttgagacggagtcctgctgtcacccaggctggagtgcagtggcgagatctcagcggatcactgtgttatacgtaaattttcggtgtcgcaaaagaagtagcactcgaatgtacacttttctcagctaggaaatttacttctatagaaggggggtctcatagatggagcaatggtgagcatttggacaagggaggggaaggttcttattcctgacgcaggtagcgcctactgctgtgtggttcccttattggacagcgttagacctcacaatctaaatccgattggcCtttttttttttttgagatggagtcttgctgtgtcgcccagactggagtacagtggtgcgatcttggctcactgcaagctctgcctcctgggttcatgccattcttctgccttagcctcctgagtagttgagactacaagtgtatgccatcatgtgcggctaatttttgtgtttttggtagaaagagatttcaccacgttggccaggatggtctcgatctcctgacctcgagatccacctgcctcggcctcccacagtgctgggattataggcatgagccactgcacctggccttaagtggttctttaaagtctgattcgttgtttctactttccctgatgagggtgggtgtcaaggagtgtggtattcttacataatgtctgatgtttggaatagcAttttttttttttttgaggcagagtctcactctgtcgcccatgctggagtgtagtggcaccatcttgtctcactgtaacctttgcctcccgggttcaaacgatcctcctgcctcagccttccacgtagctaggattacaggcgtccaccaccacggccggctagcttttatatttttagtagagacggggtttcaccatgttggccaggctgtacttgaacttctgacctcaatgatctgcccccctcagcctcccgaagtgctgggatacaggtgtgagccaccactccTCGCTCAAGTAATATGTTAAACTTATGCTTTCTTCTTTTCTTCTTTCttttttttttttttttttttggatggagtcttgttctgtctgcccaggcttgagggcatggcataactcggctcactgccctccgccgttccagtcatgcatatctgctgccttcagcctcctttagtacgggacacgaggccacctgccacccgtgcctggctatttttttatttttttttttttttttttttttttttttttttATCAGgacagagtctggctctgccgccaggctggagcttgcagtggcgtcagctcaacctgcaagctccgctccgcgggttcaacgccattctcatgcctcctcagcctccccgagtaattgggactacagcgcgcccgccaccgccccgctcagtttttgtattttttagcagagaggggttaccgtgtagccaggatgggtctcgattcctgacgcctcgtgatccgcccgtctcggctcccaagctgggattacaggcttgagccacgcgccccggcccggcatttttttcatttttagtaagaaacagggtttcaccgtgtttagccaggattggtgtcgatttcctgacccgtgatccgcccccctcggcctcccaaagtgctggattccaggcctgagcctgcaagccgggccTACTCTTTGGCTTTTAAAAGAATGGGCAACATTGCTTTTCTTTACTAACTTCTAATCTTTCCCTCTCTGACTCATCTCTCCTCCCACTTCTCTTGTTCTCCCTGTCAGTGTTCCTTTCCTAAGAGTTTTTCCCTGTCTATGATCTTTTTTTATAGGCTTTTTTCTAGTTTCTCTTTCTTTGTAATTGTGCGTTAATACTGGCCAATTGTTAGTGACAAATTCCTTGCCAAGAGATCCCTGACCCTAAACCAGCATATTCTGTCCATTCGTTTTAATCTGTACtttatttttcttgagatggagttccgctctgtcgccaggtgtggatggtgtagtggcacgttctcgctcactgtcaactcgccctccagggtcaacccgcaccatcctcgctgccttagcctccgagtacggggattgtacaagcgtccaccacccggcctggcgaggcgcttgatttttttatttcagtagagatgggggttttcatcgtgttagccagatggtcccccccatctcctggactcatgctccgcgcaccgccccttggcctccgcaagtgcgcgattaTGATCTCTCTCAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtctcctgcctcagccctctgagtagctgggattacaggcattttttgtatttttagtacagatggggtttcaccattttggtcaggctggtggaggactcctgacctcaaatgacctccccgcctggcctcccgaaatgctgggattacatgcgtgatccaccacgcccagccATACAGttcttatgttaagacaggctctctgtcgcccaggctggagtgcagtggcgcgatcacagctcactgtttgcctcgacctttcaagctcaagctgtcctcctgcctgagccgcccgcgtagccaggactgcaggggcacagtgccatgcccggctaatttttttttttgtgatggcgttttgctcttgttgcccaggctggagtgcggtggcgcaatcttggctcactgcaacctcctccccctgggttcaagcaattctcctgcctcagcctcccaagtagctgggattacagtcatgtaccaccacgcccggttcattttgtatttttttttagtagacaagggatttctccatgtcagtcaggctagtcctaaactcctgacctcaggtgacccgcccacctcagcctcccaaactgctgggattacaggcgtgagccactgtgcctggtcCTGGCTAATAttttttttttttttgagacggagcctcgctctgtcacccagactaaagtacagcggcgcaatctcagctcactgcaagctccgcctcccgggttcatggcattctcctgcctcagcctcccaagtagctgggactacaggctcctgtcacctcgcccggctaattttttgtatttttgtagagacggggtttcacagtgttagccaggatggcctcaatctcctgacctcgtgatccgcccacctcggcctcccaaagtgctgggattataggcgtgagccaccgcgcccagctgtttttttgtaatgttagtagacatggactttccccttgttacccaggctgggctcaaacttctgaggtataagagatgctcccgccttgaccttgtgaagttctgggattacagacgtgagcccccatgcccagtcAGGGGTttgtttgttttggtttttgtttttgtttttgagacagagtctcactctgtcgcccatgctggagtgcagccgtgcaattttggctcgctgcaacctctgcctcccgggttgaagtgattctcctgcttcagcctcccacgtagctgagaccacaggtgtgccaccgcgcctggctgatttttgtatttttagtggagacggggtctcaccatattggccaggatggcctcaaactccctacctcaggtgatctgcccgcctcggcctcccaaaatactacgttacatgcatgagccaccgtccctggcTGTGGTCAGGCTTTTGAGTTTAGATCCATGAAAGTGTGGCCGCGTCCCTGCTCCCTGCAGGAGGGAGGCCTGTGGGACCTTCTGCTGTGGCTGTTTACAAGGCTTTGCTCCTGGTGCCTAAGGCTGGAACCTTCTCTCTGCAGGAGGAGATGAGCAATTACTACCTCAGAGTCACCCAGAACGCCTTCCTAAACCACACGAGGCAACGCAGCAACAAGTGAGGGAGCCCCTCGGGTCCTGGGCCCCCGGGTAGGGCTGTGCAGCCGTCGCCCTTGGTTCCCACAGAGGGACCTCAGAGGCCCTGGATCACAGTGCTGGGCAGCACCCGTGGCCTCAACGTGTCCACCTCGGATGTCCCCTAGGAATGTCCCAGCTCGGGACAGCATGGGGCGTCACTGAGGAACATGCGGGGGCCTCCTGGGCAGAGCCGGGGTCAGTCCCGTCCTCACGGCCCTGTGCGATGCCGCCCCAGCTTGCACGTCCCTCTGCCCCTGGGTTTCCGCGGTCCTGTGCCAGCAAGGGAGGCGGTCTGATTGTCTGAGGCTCTGCTGGGGCCTCCATTGCAGGCTGTGGGTGCCCTGGGGTGGGAGATGGAGACACTTTTGCTCCCACGGGAAGCTGGGCACGAGCAGGTCCTGTGTGTTTGGGCGGAGCCTGGGGCCTTGGCCCCCCCGCCCAGATGCTGGACAGGGTTGCTCCCTCCAGGCCTGGGGCCCTCCTCACATTGCGCGTCCTCCGTGAGCTGCTACCCAGAGGTCCCCAGTAGGTGGATAGCCCCATGGCCAGGCTCCCTAGCCCCTTTCAAATCCCCTTATTTTGAGTTTTCTTGGTCTCCTGGGCCCCTCCAGCCCCAGTCACGTGTCACACGGAGAATCAAGTCCTGCCGGTCGGCCGTGGCCGAGTCTTCAGGCGTGTTGGGCTCGCTGGCTCAGCTGCTGCCGGTAGACGCTCCCTGGAGCCCTGGCTCAGGTCCTTCCCAGAGAGGCAGGGCTGGGGCCCTGGTGAGCCTCCGCTGCACCCGGGCCCCCAAGGTCCTGCTCCTGGCTCGTGTGGCCACTCTTGGCATGGACTCTGGGTCCCGCATCCCTGCTCCCAGCACAGCAGGGCTCAGGCAGCAGGAGGAGTGGTGGTCCCGACGCTGCCTATCACGCTGGGTGAGGGTCAGCGGGGAAGCGCCACACGGGATGAGAACAGAGGCCCAGGTAGCCGGGCGGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAAGGCTGTGCCACGTCCTCCCGATGTTTCCTGTGCTCACAAGCTGCCGCTTTAGATTCTCCGGGAAAGTCCCCCTGAAGGGACTAAGGAGCCCGCGTTCCCCTCGGGACAGCTTGGCCGGCAGCCCCAGCATTTCCTTCCCCATCCCTGCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCGACCCCAAGCAGTTACACCAGGACATCCACGACCGCATCGACGTGATGTTTTACTTCGACCCGCCCGGGCCAGAGGAGCGGGAGCGCCTGCTGAGAATGTATCTTGACAAGTATGTTCTTATGCCGGCAACAGAAGGAAAGCAGTAAGTGTCTCCCCTCACCCACCCCTGTCCAGGGACCCTCGCTCTGGGCCCACCCCCGGCCCTGCTCTCCGGACGCACACAGCAGGCCCAGTCTCCGGGGTGGCACCGCCTCCCTGCTTTGCGGTTTCGCACAGGAGCCCTGTGGGCCCCAAGGGTCCCAGAGGCTGCACCCAGGGATGTGCCACCACCCTTTCCTCATCCCCACCTGAGAACAGCCTGGTGGTGTCTCCTCGGGTTTGGGGGGCAGAGCCCACCATCACTTACAAACCTTCAACtttttgtttttgagacaaagtcttgctctgtgccccaggctggagtgcagtggcacgatctcagctgactgcaacctccgcctcctgggttcacgcgattctcctgcctcagcctcctgagtagctgcgattataggtgcctgccaccacgccccactgcttttcgcctttttgtagagatgcagtttcaccatgttggccagggtggtctcgaaaccctgacctcgggtgatctgcccgccttggcctcctacagtgcagggattacagatgccagccactgtgcccgaccACCCTCAGGCCCTGGCAGTGCAGGGAGGTGACGTGGAGTGTTGCTCTGAGACCCCCATGTTGGGATTTGAGGGAGACGCTCCTCATGAGAGCCCCGTGTTGGGACTGGAGAGGATCCTCACGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCTGGCCCTGGAGCCTCATGGTGTGGGGCGCGGCTCCGGCTGCACTTGTGCCCTGAGGCCTTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGACAGGCCCTGGGGCTGCACGCCGGCTGCCTCAGGAACACTCCAGATGAGCAGTGGCTGCTCCACCTCTTGGCGTCCCCAGGTCCCAGGTTTCTGAGTCCTTCTGTCCACCTGACCTAAATTCCTGCTCTCTCCAGTGACAGCAAAAGCCGCTCTGTTCCAGAGAGAGCCTGGTTCCCCCTGCCAACCGCTCCGTGGCTGCCTGCTTCATGCTAGCCCAGCTGTCCCGGCCTCAGTTTCCCTTTGGCCCTCCCCTGCCCTGGGCTCTCCCACTCCCACGGCTGCTCATAGACCTGGCACAGTGACTTGGCTTCTATGACCTCCAGGGAGATGCTTTTGCTGGAATTCAGGGCTCTGCCACTGCCACTGTAACGGCCATGAGCCCTGTGGGTGCTGAGTGGGCAGGTGAGGGCAGGGCTGGTGTGAAGAGGGGGTGCGGCCATCTCCAGGCCCCACAGCAGCCACCACCTCCCTGCTCAGCCCAGACCTGGTTTGCATCAGGGAGAGGGCGGAGTTTGGCTGTCACAGGAAGAGTCCCTCCCAAGGGGGCATCTGGCATGGGTGCCCGCCTGGCTGCCTGTCTTCCAGCCCCCACCTCGTGGTGTGGGAGCCGCTGCCTTGGCCGGCCCACTTGGGAACTCCTTCCCCAGGCGCCTGAAGCTGGCCCAGTTTGACTATGGGAGGAAGTGCGAGGAGATCGCTGAGCTGACGAACGGCATGTCGGCCCGGGAGATCGCACAGCTGGCTCAGTCCTGGCAGGTGAGTGGGGCTCGGGCGCACCCACCCAGACAGGAGCCCAACTCCTGTGGAGACGCCGGGTTGCGCCTGTCCCAGCACCAGTGTCACACCGCAGCTTCTGTTGAGGGGTTTTCAGTGCACAGACGTGACACGGGGCACTCGCCCCAGTCGGCCACTCCACACACTGGCGCGCCCCTGCTCCTGCCCTGGGAAGTGTGGGGCATGTCCGTGGCTGACGGTCATAGGTCAGGAAGCCCGTCCGGCATCCTAGTATCCGGGCTCTGCCAGGTGGGGCGGGAGGCTTTCGATGCTCACCTTGGCAGACGGGCACCCCCTGGTGTGAATGGTCATCGGGACAGGCCCCGCCTGAGTTTGGTGGTGGGGCTGGAGGGATGTTGTGTTTCCCGGACCACGTCCGTTGGCTTGATCCTGCTTGACGGGCTCAGACACAGGGGCAGGAGTGACCTCTGATTGTCCCACAGCCGGCTGCTCCTTGGAGGACCCCCTCCTGCAGCTCCGTGGCTGCTGCAGGGACGGGGAGCCGGGACTCAGAGCAGTGTGGGCGTGGCCATCCAGAAAGCTTTGGTCTTTGGGGGTTGCTGGAAAAGCATAACCAGGTCTGTAGAAGGCACCAAGGCCATGCACAGGCATTGCTGCCTCTGGGGTCTGCAGAGTCTGTGACAACCTGGTCACTCAACCTAGCAGCGCTTTCGCGTGTGACAGGTTCATGAAGTAGCCAGTTACCTTGATTTGAACGTTGGAGCTGGGGACTATATGGGCTGTATTAGTCAGTTATGCCGCTGTGACAAAGAGCCTCAGATCTCAAACCCCATCCTTGTGGGTCAGCTGAGGTCTGTGTTCCAGGCCGTCTCCACTTGAGACCAGGTCTGTTTCCACAACTAAGCAAACAGAgaccgggccatggtgttgggctacatttgttcccagcatttgggaggtcgaagtcagcccagattatttgaaggcaggagtcaggaccagccttggggggggggggggggggggggggggggggaaagcaaggggagactccatctacaaaaaataaaaaaattagccggaccctaatgtggcacgcctgtaatgcagctcctgggagcctgaggtgggatgatcactgagtcccaggtaggccagaaatacagtgagcctgtggattgtgccactgcactccagcccgggttacagagcgagaccctggtctttaaaaataagaataaTTTGAgccgggcatggtggctcacgcctgtaatcccagcacgctgggaggccaaggggagaggatcacttgaggccaggagttcgagaccagcctggccaacatgtcgagccccacctctactaaaaatacaagaattggccgggcgcagtggtggtgcatgcctgtattctcagctactcaggaggctgaggcaggagaatcgcttgaacccgggaggtggaggttgcagtgagctgagatggtgccattgaattccagcctggactattcaggatcctttgagattccataagaattttaggagtggttttcctatttttgtaaaacataatttgggttttcacagggaccgcgtttagtctctatgtcgctttgatgtctctcagcaatattCTGTGGttttctcttgttttcgagacggagtctcgctctgctgcccaggctggagtgcagtgttgtgatctcagctcactgcaacgttcccctcccgggttaaagtgattctcctgactcagcctcctgaggagctggaattccaggcaggcgccaccatgcccggctaatttttgtactaagagacggggttttgccatgttggccaggctggtctcgaacctctgacctcaggcaatccacccacctcagcctcctaaagtgctgagattaaaggcacgtgccaccacgcccggctaatttttgtatttttagtagagacgatgattcaccatgccggcgaggttggtcttgaactcctgacatgaggtaatccatctgcctctgcctcccaaagggctgggattcagacatgggccactgcgcccagccagttttcactgtacaagtctttcaccctcttggttaagtgaatttccaagcattttattcttgccgctgctgttgtaaatggaaacggtttcataattccccattcacattattcactgttgggatggagaactgcagctttctttgctgttgattttgtatcctgtaagtttgctgatgtcacggcattttttcttccaatatggattctaggattttctacatataagattatgtcatctgagaacaggtgatttttacctttcccttttcagtttggatgacttttctttttcttgtctaattgcactgtccagagcttccagtggtgtgtggaatagaagcggtaaagcattcttgcctggttccttacctcagaggaaaagctttgtttttcaccactgagtatgtcacctatgggcttgtgatgtgtggccttcattgtgtttagggtgtatccttcaattcttggtttggtgagtgtttttatcataaaagtgtgaggcgggtggatcacctgaggtcggcagttcgaggccagcctgaccaacgtgaagaaaccccatctctcctacaaatacaaacttagttgggcatggtggtgcatgcccgtaatcccagctactcgggaagctgagacaggagaatcgcatgaaggcggcaggcagaggttccagtgagccgagatcgcgccatttgcactccagcctgggcaagaagagcaaaattgtctccaaaaaaaaaaaaGTggccaggcacggtgactcacgcctgtaatcccagcactttgggaggccaaggtgggtggatcacgaggtcaggagatcgataccatcctggctaacacagtgaaaccctgtttctactataaatataaaacatcagctgggcatggtggcaggtgcctgtagtcccagctacctgggaggctggggcaaaagaatggcgtgaacccaggaagcggagcatgcagtgagctgagatgcctgggctacagagtgaggccccaactcaaaaaaaaaaaaaggtgttgtatttggtcgaatactttttctgcaacacttgagacagtcgtgtggtttccttcctccaccctgctaatatcgattgatttttgtatgttgaacatttcatatgcggaacattgattttcatatgttgaactatcgttgcattccaggaataaatcctgcttggtcggctgggcgcggtggctcaagcctgtaatcccagcactttgggaggccgagatgggcggatcacaaggtcaggagatcgagaccatcctgtctaacctggtgaaaccccgtctctactaaaaaatacaaaaaactagccgggcgaggtggcgggcgcctgtagtcccagctactcaggaggctgaggcaggagaatggcgtgaacccgaaaggcggagcttgcagtgagctgagatgcggccactgcactccagcctgggtgacagagcgagactccgtctcaaaaaaaaaaaaaaaaaaaaaaaaaatcctgcttggtcagggtatagagtccttttagtgtgctgctgaattcactctgctggcattttgttgaggactttcccagtgatgctcatcagggatattggcctgtcatttttcttgtggtgtctttgtctgggtttgatatcagggtaatgctggcctcctaggatgagtgaggaaatgttcttcaatttgtccaagagtttgaggtgtgctgctgattcttcttaatgttttgtgaattgacacgtgaagacatcaggtccaggtcttgtgtttCaacttttacagcttgaagactttaggttcccagaaaaattgcaaaggtagcacagagagctcccgGGCCCGGGGCCTTGCCACGTAGTGAACGTCATGTGTCACTGTTGGCCCCACCTGGGACTGGGTCTTGCCCAGAATCCCACCCAGGAGGCCACGTGACATTTAGCTGTCACTTCTGGTGGGCTCTGCCAGGTCCCGTGCTTCCTGGTGGGGTGGCCCCATGAGCATCTGCTCATCCCCTTTCCTCCACTGGGCCCTGGGTGAGGTGCAGCCACTCGGGTGCACCCTGAGGGTTCCTGCACCTGTTTGAACTCTCTTGGGTCGGCTCAAGACCAAAAATGATGCTGAGCAGTCCTGGGCCTCTGATGCATAGTGGTGGTCCGGTTCCGGTCAGCGTCTCCTGCACTCCTGGGCCCCTGAGCCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGGGTGGGCGCTGTCAGTGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCCCATCGTGGCAGCCGTGTTGTGGGAGGATGGTGCGCTGCTGCCCCTTTACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCCGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTGCCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCAAGCACGGTGCCCAGTGGGGGTGCCCAAACCTTCACCCTGACCCATGGGTGACTTCCCTTGGGGACTCCACGCCTTTCACTGGGACTGGGATGGAGAGCGACCTGTCCATGGCAGAAGGGCTGCACCTGAGGTGCTTGAAGCAACACCAAGGGCCACAGTCCCAGCAGCTCCAGCCTCCGCATGCTGGATGCCAAGTCCTGTGCCCAGGACAGGGAGGTGGAGGCACGGGTGATCTTGATGCTAGCACCTATGTGCCCCGAGGTTGGGCAGTGGCTGCCTCTGCTGTGGAGGCCTATGAAGGTGAGGGTCTGAGGATCTGTAGTGCACTGTGACCCGGGGGCACTGCCTGGCCACGGCTGAGACACGCAGAGGGTCTGCAATTCCCTCCTGCCTCTTGGGAGCTGCCCTGGGTCTGCAGTCAGTGGGGCTCGTCCTCGGGCTTTCCGTTATTAGAAAGTCACTGAGAAACTGCAGTGCTGAGGACGCAGGCAGGGCTGTGGCACTGCAGGGGCCGCTCCCGGTGTCCACACGCATGCTGGGCTCTGCCGAGGTGCCGGAAGCCTGTGTTTCACCCTGAGGCCGTCCTGGTGCCCCGGGTTTGGACCCTCCCCACCTCGGGGTCCTGGAGTGCGTTACGGGTGGGGGGTTCCCATGGTGGCCTCCCTCAGCTCCCTCTCTCCTCACTAGGACACGGCGTATGCCTCCGAGGATGGGGTCCTCACCGAGGCCATGTTGGATGCCCATGTTGAAGACTTTGTCGAGCAGCACCAGAAGAAAATGCGCTGGCTGAAGAGGGAGGGCCTGTCCTCATGGACCAGCACCCCTTAACCTGAGTCCGCGGTGAGACCACACGTCACGGAGCCTGGCTGCGGACCCCTCCCACCCCTGCTTTTCCGGTCCCTGCACGTTTAGGAAATGCTTCCCCTAATAAACTCCCACAGGTGCCACAGCGCTGTGTCTATTGGCTGATGTGGTGCGGGGTTTGGGGTCCCCTAGTGTCCTTCTGGGGTCAAAGGTGATAGAAAAGACAGGCTGGAGCTTTCTGGAGAATTTAGGCACAGAAGGGTGGGCTTCACATGAGGTGCCTGCCACAGCGGGGTTGGCTGCCTGAATGCCACCCGGGACCGGCTGCTCGCGCTCCATCCTGCAGCTGTGGAGACGGGGGTGCCCCTTTGCCTCTCTCCACGAAGTGCAGGGCAAACAAGACACAGCGGTTTCAAACAGGCGATGGCCCGGACTGCGTGCCTCGCCGCCCCTGCGCCTTCCCCTGCCCCTGCTTTCCAGCTAGTCCCTGAAAACCTTGATGGggccgggcgcggtggcccatgatggattctcagcactttgtgaggccaaggcgggtggatcacctgaggttaagtgttccagcccagcctggccaacatggtgaaaccccatctctcctaaaaaaaaaaaagaaaagaaaaagaaaaattagccgagcgtcgtggcaggtgtctgaaatctcaggcactcaggaggctgaggcaggagaatcacttgaccccgggaagtggaggttgcagtaagctgagaccatgccattgcagtgcagcctggacaacaagagtcaaactctctcaaaaaaaaaaaaaGgccaggtcaggtggcatgtgcctgtggtcccagcttggtcccagattcttggtttggaggctgaggtaggaggatcacttgagcatgggaggatgaggttgcagtgagccaagatcgcttcagacactccagcctgggtgacagagtgagaccctgtctctaaataatcaaaaCCTTGATTACAGCCATGGGGTGGGGGTTGGGGGGCGTCTGGCTCGGCAGGGAACTATTGGGTTTTTCTGCTCTCtaatttttgtagagacagggtttctctttgttgcccaggctggtctccaactcctgggtcaagcgtcgatcttctgcctcggcctcccaagtggtgaggttacaggcgtgccaccgcacctgaccTGttttctttttttttttttttttttttttgagacggagtcagctctgtcacccagggctggagtgcagtgggcggtctcagctcactgcaagctccgcctcccgggttcacggccattctcctgcctcagcctcccgagtagctgggactacaggtgcgtgccacaacgcccggctaagtttttgtatttttagtagagacagggtttcactgtgttagccagggtggtctcaatctcctgaccttgggatccgcccgtctcggcctcccaaagtgctgggattacaggcttgagccaccgcccccggccCCttttttttttttttttttggcaagggagtcttgctcgcccagggtggagtgcagtgttgcaatctgggctcactgcaacctccacgtccagggtgtcaggcctctgagcccacgctaagccatcatatccccagtgacctgcatgtgtacatctgatggcctgaagcccctgaagatccgcagaagtgaaaacagtcttaactgatgacattccagccttgtgatttgttcctgccccaccctacctgatcaatgtactttgtaatgtcccccacccttaagaaggttctttgtaattctccccaccctggagaatgtactttgtgagatccacccccagcccccaaaatattgctcctaactccactgcctatcccaaaacctctcagaactaacggtaatcccagcaccctttgctgactctttttggactcagctggcctgcacccgggtgaagtaaacagccttgtggttcacacaaaacctgtttcgtggtgtcttcacacggacacgcgtgacacagggttcgaggaaatttcatgcctgaacctccggagtagctgggattacaggcgaacggcaccatgcccaggttaatttttgtattttcggcagagacagaggcccaggtagccgggctggGGGACAGCTGGGTGTGGTGGGGCCGGCGGTGACCAGGGCTGTGCCGCGTCCTCCCGGTGTTTTCTGTGCCCACCAGCTGCCGCTTTAGATTCTCCGGGATAGTCTCCCTGAGGGGGCTGAGGAGCCTGTGTTCCCCTCGGGGCAGCTTGGCCGGCAGCCCCAACATTTCCTTCCTCATCCCTCCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCCACCCCGAGCAGTTGGACTGGGGCATCCATGACTGCATCGATGTGACGGTCCACTGCGACCTGCCACGGCAGGAGGGGCGGCAGCGCCAGGTGAGAATGTATTTTGACAAGTATGTTCTTAAGCCGGCCACAGAAGGGAAACAGTAAGTGTCCCGCCTCACCCGCCCCTGTCCAGGGACCCTCGCTCAGGGCCCACCCCGCCCCTGCTCTCCAGACGCACCCAGCAGGCCCAGTCTCCAGGGTGGGCACCACCTCCGTGCCCTGAGGTTTTGTGCGGGAGCCCTGTGGGCCCCGAGGGTCCCAGAGGCCGCATCCAGGAGGTCACGCCCCCTTTTCCTCATCCCCATCTGAGAACAGCCTGGTGGCGTCTCCTCAGGTTTGGGGGCAAAGTCCACCATCACTTAGAAACTTTCAGCAttccttttttttttttttcttaagacggactcttgctctgtcatccaggctggagtgcagtagcttgacctcggctcactgcaagctctgtctcccaggttcacgccgttctcctgcctcagcctcccaagtagctgggacaacaggcacccgacaccacgcccggctaatttttttgtgtttttttagtagagatgggtttgaccgtattagccaggatggtctcgatctcctgacctcgtgatccacctgcctcggcctcccaaagtggtgggattacaggtgtgagccaccgcatctgacctttttttgaggaagtctcactcttgtccccctggctggagtgcagtgccgggatctcagttcactgcaacctgtgcctcagcctcctgagtagttgggattataggtgcccgccaccgcgcctggctggtttttgtgtttttgtagagatggaatctaactccgtctcccaggctggagtacagtggtgtgatctcagcttactgcaacctccaccctccgggttcaaaccatcctcttgcctgagcctcctgaacagctgcgattacaggcgcccagcacaatgctcgcctcatttttttgtctttttagtagaaacagcttttcaccaaattgaccagactggtcttggacttctgatctcaagtgattcaccctcctcggcctccaaagtgcagggattgcagatgtgagccaccggacccggcctcttttatgttcctcttcagtaCTCAGAGGGCTGTGAGGAAATCCGGTGCCCGGCCACCCCCAGGCCCTGGCAGTGAGGGGAGGTGATGTGGAGTGTTACTCTGAGATTCCCATGTTTGGATTCGAGGGAGACGCTCATCATGAGACCCCTCCGTGTCGGGATTAGAGGGAGAGGCTCCTCATGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCCGGCCCTGGAGCCTCCTGGTGTGGGGCGCGGATCCGGCTGCACTTGTGCCTTGAGGCTCTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGAGAGCCCCCAGGAACACTCCAGATGAGCAGAGGCTGTTCCACCTCTTGGCGTCCCCAGGTCCCCGGTCTGAGTCCTTCTGTGCACCTGACCTAAATTCCTGCTGTCTCCTGTGACAACAAAAGCCACTCTGTTCCAGAGAGAGCCTGGTTCTCCCGTTGACCCCTCCGCTGCCGCCTGCTCCATGCTAGCCCAGCCGTCCAGGCCTCAGTTTCCCTTTGGCTCTCCCCTGCCCCGGTtcccagctgcttgggaggctgaggtaggaggatcatttgagtccaggagcttgaggttgcactgagctgtgactgtgccactgtactccagccttggcaacagagtgagacactgtcttaaaaaagaagaaTTTGggccagatgctgtgtttcatgcctgttcccagcatgctgggaggctgaggagagaagatcactcgaggccaggggttccagaccagcctgccaacatgttgaaccccgcctctacgaaaaatacaaaaattagccgggcgtggtgggtgggtgggtgccagtaatcccagctactcaggaggctgaggcagcaaaatctcttgaacctgggaggtggagattgtggtgagctgagatagtgccgctgtacttcaacctgagcaacagagtgagactccttatcaaaataaagaaaTCAATCAATCAATAAAAATAATCACAATAATTTGggctgggcgtggtggctcactcctgtaatcccagcactttgggaggcgtggatcggttgagttcgaggcaagcctggccaatgtggcgaaaccccatctccactacaaatacaaaaattagccaggtgtggtgacaggcacctgtaatcccagctgctcgggaggctgagacaggagaatctctggaacctaggaggcggaggttgcagtgagccaagatcacgtcagtgcgctccagcctgggtgacagagactgtctcaaaaaagaataataataaTTTgactgggtgtggcggctcactcttgtcatcccacactttgggaggccgaggcaggaggattgcttcagctcaggatttcgagactggcctggacaactggcctggacaacatggtgaaactccatctctacaaaaaatacaaaaattagccaggcatggtatcatgtgcctgtgatctcagctactcaggaagcagagatgggagcattgctggagcctgggagttggaggctgcaatgaaccatgttcgtgccactgcactccagtgtgggtgacagagtgagaccctgtctccaaaaggcatggtggctcacgcctgtaatccctgcactttgggaggccaagctgggtggatcacctgaggtcaagagttggagaccagcctggctaacgtggtgaaaccccatctctaggaaaaatagaaaaaATTggccaggtgcagtggctcacacctgtaatcccggcactttgggaggccgaggcgggcgaatgacctgagatcaggaattccagaccaaccacaccaatatggagaatccccgtctctactcaaaatacaaaatcagccgggcatggtagcaatcccagttactcaggaggccgaggcaggagaatcactggaggtgagccgagaccacgccattgcactgaagcctgagcaacgagagggaaactgtctcaaaaaataaTGCTAATAACAAGGGGGAGAGAACAGGAGTGTGGTCAGCAGCTGGGCCTGCCATAACCCCTGGGTCGTGTGTCCCCACAGCTCTGAAGGCTAGAGGCCCGAGGTCAGGGTGCCAGCTCGGTCCCCCCCGTGGAGTGTTCTCTGTTAGCTTCTCACATGGCAGGGAGAGTGACTGAGCTCTCGCTCTGGTGTCCCTTACGAGGACGTTCATCCCCCACTGCTCAGAGCGGCGGTGAGCCACCACGCCCAGCGCCAACTTTGTCCTTCAAGAGTTGTTTTTTTGTgccgggctcagtggctcatgcctggaatcccagcactttgaaatgccaaggtgggtggagcacctgaggtcaggagtttgactccagcctggtctaaatggtgaaaacctgcctctactaaacataaaaaaatcagctgggcatgttggtgtgtgcctgtaatcccagccactcgggaggctgaggcaggagaatcacttgaacccaagaggtggaggttgcagtgaactgagatcatgtcactgcactgcagcctggatgacaagagtgagactcccttgcaagaaaaaacaaaaattaaaaaagaaGTTGTTGTcttttttttttttttttcccttggacaattcaagatgcctagagattccatatcaattttagtaatgcttcttctatattttaaaaagtaatttgggtttttacagggattgcattcagtctctgtattgccttAATGACTCTTAGCAATGttgttttttttatttattattttttttctagagatggagtctcactctgtcagccaggctggagtttagttgttggccaggatgggcccaatctaatgacgtcaggtgatccgcctgcccctggctcccaaattgctgggattcagacgtgggccaccatgcccagccagtttacattgtacatttctttcaccttcttggttcagtgaagctccaagtattttattctttcggatgctcttgtaaatggaaatggtttcgtcattccccgttcagattatacacttactatgaagaactgcagctttctttgctgttgattttgtatcctgtaactttgctgatgtcgtggggttgttttttccaatatggattctagattttcCTTTTCTTTTTCTtttttttgtttttttgttttttttttttttgatatggggtctccctctgtggcccaagctggagtggaatgcagcggcacgatcttgaatctgcgagctcctctgcccgggtccacgccattctcctgcctcagcctcctgagtagctgagactacaggtgcctgccatcacggccggctaattttgtgtattttttgtgcagatgaggtttcaccgtgttagccaggatggtctcgatctcctaactttgtgatcggcccgcctcggcctcccaatgctgAATGCTGTTGGGACTGGGTCTTGCCCCAGAATCCCACCCAGGAGGCCACCTGACGTTTAGCTGTGACTTCTGGTGGGCTCTGCCAGGTCCCATGCTTCCTGGTGGGGTGGCCCCGTGAACGTCTTCTCAGGCCCTTTCCTCCATTGGGCCCTGGGTGAGGTGCAGCCACTCGGGGGCACCCTGAGGGTTCCTGCACCTGTTTGAAGTCTCTTCGGTCGGCTTGAGACCAAAAATGATGTTTAGCAGCCCTGGCCCCCTGACGCACAGTGGCGGTCCTTCTCCGGTCAGTGTCCCCTGCACCCTTGGGCTCCTGACGCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGTGTGGGCGCTGCCAGAGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCTGGTCGTGGCGGCCGTGTTGTGGGAGGATGGTGTGCTGCTGCCTCTGCACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCTGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTACCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCGAGCTCGGTGCCCTGTGGGCGTCCCCATGCCCAGACCCTGACCCACAGGTGCCTCCCCTTGGGGTCTCCACGCCTTTCCCTGGCCCTGGGATGCAGAGTGACCTGTCCATGGTAGAAGGGCTGGACCTGAGGTGCCTGAGACAGCACCAAGGGCACTGGTCCCAGCAGCTCCAGCCTCTGTGTGCTGGATGCCACACAGACACAAGACTCTTGGGAGACGCATTTTCCATCTGGCTCAGAGGGGGAGGGGGAGGCTTTGCAACCCAGCCCCTGCCCAGGCCCCTGGGAGGGTGGGTGCCTGCTGAGCCCCCGGGGCAGCAGGAGCGGGGCAGGCGGGGTCTTTGTTCTCACTCCCACAGCAGAGGCAGATGTGGGGGCGCCTGCTGGGGCCAGACCAAGGTGGGGTGGCCTGGAGACTGCTTCCAACCGTGGCCGGGAAGCAGGGAACCTGCCCGGCGTGTCTGAGGCCACACTCTCAGCTGGCCGGTCCAAGCCTGCGGCTGGAGCTGGTGTCTGTTTAGCTAATAAAGTCCCACAGTTGCCTCACTGCCGTGTCTATTTGCTGATGCTGCGCGGGGTTTCAGGGGCCGCCTAGCCTCCTCCTGGGGTCAAAGGTGACAGAAGAGGCAGAGGCGGGAGCTTT
## >chr1:519473:<DUP>:22560:2560.16::chr1:519473-542033
## CTTTCTGGAGAATTTACTGACCACAGCGTGGTGCACTTGACATCAGGCGCCCGCCATGGCCGGGCCTGGGTCTGAATGCTGCCCGGGACCAGCTGCCTGCGCTCCAGCAGCCCCTCCCTCCTGAAGGCCAGGTCCCCCGAGAAGAACGAGGCTGCAGAGTGATGTGGGGGCCAGCGGTGACTTCCTACCACACTGTTCTCAGGTGTAAGAGGGCTCGCTTCTGCCCAGGCATTGTCCGTGGAAGACACACAGCCGGCCACTGCAGCCTCAGTCCTGGGATGCCCTGGGGCTGGGTCACAGGGGGCCACGGGCCACGCTGGGAGGCCACAGTCCTGTCGTGCCATGCAGCTCCCTGTCCCCAGATGTCCGCTCAGGGATGCAGAGGGCAGAAACCACACTCGCTGCCTGAATTCTGGGAGCAGAGCCCGGTACCCACTGCCTGGCCGGGGCCTACCCTGGGACTCCAGCCCCTGTTCCCGCTGGCCCGGGCTTCCGGAGGCAACTGTGTCCCTATCCTGGCTCAAGGTCCAGGCTGCACCTGGAACCTGCACGGTCACTCCTCCAGGTCCTCAATGCTGGAGGACTCTCTCAGACAGGAAACCTTTGCGTTGGGCGCAGGGCGGGGTGCGGGGTGGTCACGGGGAATCGCAGGGCAAAACAGCACAGTGCAATCGCGCAGAGCCTGATATTGGCGGATGAAACATAAACTGCTTTCTGCACTTTGTGTCCTTAGGAAGGGTGTGGGGTGTTGGCGGAAGTAGGAAACAGAAGAGGAGCCTGGGCATGCAGCGGGTCTGTCAGAGAGCAGAGCCCTCGGAGCTGCAGTGCTTGGAGGGAGGCGGTTCACCTCTGCCCACTCTCTCCAtttctctctctctcattttccttttagagatggattcttgctctgagcctaggctggagtgcagtggtgtgattatagctcattgcagcctcgcccttccaggctcaagtgatcctcctgcctcagcctgtccagtagcCATACCCTACTAGGTCCTAGTTAGCCCCCAGAGGCGTGCACCACCACGCCCACTAATTGCAAAAATTTGTTggctgggcgcgatggctaacatctgtaatctttgggaggccaaggcgggcggatcacgaggtcaagagatggagaccatcctggctaacacggtgaaacccggtctctactaaaaatacaaaaaattagccgggtgtggtggcgggggcctgtagtcccagctactcaggaggctgaggcaggagaatggcgggaacccgggaagtggagcttgcagtgatctgagatcactccactgcactccagtctgggggacagagcgagactccgtctcaaaataaataaataaataaatatataataaataaataaaaataaaaataaaaCTAAGCCCTTCCTGATGGTCATTGGGGGGTTTGGGGGTTGGGGGGGGTGTCTGGCTATGGCTGGGGAACTCATTTGGTTTTCCTCCTCCTCCTCtttttattttttggtagagacggggtctcttgatttcccaggctgatctccaactcctgggctcaagcaatcctcctgcctcagcctcccaaagtgttgggattacaggcctgagacaccgtagctagccAGCtttctttttttttttgagacggagtcctgctgtcacccaggctggagtgcagtggcgagatctcagcggatcactgtgttatacgtaaattttcggtgtcgcaaaagaagtagcactcgaatgtacacttttctcagctaggaaatttacttctatagaaggggggtctcatagatggagcaatggtgagcatttggacaagggaggggaaggttcttattcctgacgcaggtagcgcctactgctgtgtggttcccttattggacagcgttagacctcacaatctaaatccgattggcCtttttttttttttgagatggagtcttgctgtgtcgcccagactggagtacagtggtgcgatcttggctcactgcaagctctgcctcctgggttcatgccattcttctgccttagcctcctgagtagttgagactacaagtgtatgccatcatgtgcggctaatttttgtgtttttggtagaaagagatttcaccacgttggccaggatggtctcgatctcctgacctcgagatccacctgcctcggcctcccacagtgctgggattataggcatgagccactgcacctggccttaagtggttctttaaagtctgattcgttgtttctactttccctgatgagggtgggtgtcaaggagtgtggtattcttacataatgtctgatgtttggaatagcAttttttttttttttgaggcagagtctcactctgtcgcccatgctggagtgtagtggcaccatcttgtctcactgtaacctttgcctcccgggttcaaacgatcctcctgcctcagccttccacgtagctaggattacaggcgtccaccaccacggccggctagcttttatatttttagtagagacggggtttcaccatgttggccaggctgtacttgaacttctgacctcaatgatctgcccccctcagcctcccgaagtgctgggatacaggtgtgagccaccactccTCGCTCAAGTAATATGTTAAACTTATGCTTTCTTCTTTTCTTCTTTCttttttttttttttttttttggatggagtcttgttctgtctgcccaggcttgagggcatggcataactcggctcactgccctccgccgttccagtcatgcatatctgctgccttcagcctcctttagtacgggacacgaggccacctgccacccgtgcctggctatttttttatttttttttttttttttttttttttttttttttATCAGgacagagtctggctctgccgccaggctggagcttgcagtggcgtcagctcaacctgcaagctccgctccgcgggttcaacgccattctcatgcctcctcagcctccccgagtaattgggactacagcgcgcccgccaccgccccgctcagtttttgtattttttagcagagaggggttaccgtgtagccaggatgggtctcgattcctgacgcctcgtgatccgcccgtctcggctcccaagctgggattacaggcttgagccacgcgccccggcccggcatttttttcatttttagtaagaaacagggtttcaccgtgtttagccaggattggtgtcgatttcctgacccgtgatccgcccccctcggcctcccaaagtgctggattccaggcctgagcctgcaagccgggccTACTCTTTGGCTTTTAAAAGAATGGGCAACATTGCTTTTCTTTACTAACTTCTAATCTTTCCCTCTCTGACTCATCTCTCCTCCCACTTCTCTTGTTCTCCCTGTCAGTGTTCCTTTCCTAAGAGTTTTTCCCTGTCTATGATCTTTTTTTATAGGCTTTTTTCTAGTTTCTCTTTCTTTGTAATTGTGCGTTAATACTGGCCAATTGTTAGTGACAAATTCCTTGCCAAGAGATCCCTGACCCTAAACCAGCATATTCTGTCCATTCGTTTTAATCTGTACtttatttttcttgagatggagttccgctctgtcgccaggtgtggatggtgtagtggcacgttctcgctcactgtcaactcgccctccagggtcaacccgcaccatcctcgctgccttagcctccgagtacggggattgtacaagcgtccaccacccggcctggcgaggcgcttgatttttttatttcagtagagatgggggttttcatcgtgttagccagatggtcccccccatctcctggactcatgctccgcgcaccgccccttggcctccgcaagtgcgcgattaTGATCTCTCTCAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtctcctgcctcagccctctgagtagctgggattacaggcattttttgtatttttagtacagatggggtttcaccattttggtcaggctggtggaggactcctgacctcaaatgacctccccgcctggcctcccgaaatgctgggattacatgcgtgatccaccacgcccagccATACAGttcttatgttaagacaggctctctgtcgcccaggctggagtgcagtggcgcgatcacagctcactgtttgcctcgacctttcaagctcaagctgtcctcctgcctgagccgcccgcgtagccaggactgcaggggcacagtgccatgcccggctaatttttttttttgtgatggcgttttgctcttgttgcccaggctggagtgcggtggcgcaatcttggctcactgcaacctcctccccctgggttcaagcaattctcctgcctcagcctcccaagtagctgggattacagtcatgtaccaccacgcccggttcattttgtatttttttttagtagacaagggatttctccatgtcagtcaggctagtcctaaactcctgacctcaggtgacccgcccacctcagcctcccaaactgctgggattacaggcgtgagccactgtgcctggtcCTGGCTAATAttttttttttttttgagacggagcctcgctctgtcacccagactaaagtacagcggcgcaatctcagctcactgcaagctccgcctcccgggttcatggcattctcctgcctcagcctcccaagtagctgggactacaggctcctgtcacctcgcccggctaattttttgtatttttgtagagacggggtttcacagtgttagccaggatggcctcaatctcctgacctcgtgatccgcccacctcggcctcccaaagtgctgggattataggcgtgagccaccgcgcccagctgtttttttgtaatgttagtagacatggactttccccttgttacccaggctgggctcaaacttctgaggtataagagatgctcccgccttgaccttgtgaagttctgggattacagacgtgagcccccatgcccagtcAGGGGTttgtttgttttggtttttgtttttgtttttgagacagagtctcactctgtcgcccatgctggagtgcagccgtgcaattttggctcgctgcaacctctgcctcccgggttgaagtgattctcctgcttcagcctcccacgtagctgagaccacaggtgtgccaccgcgcctggctgatttttgtatttttagtggagacggggtctcaccatattggccaggatggcctcaaactccctacctcaggtgatctgcccgcctcggcctcccaaaatactacgttacatgcatgagccaccgtccctggcTGTGGTCAGGCTTTTGAGTTTAGATCCATGAAAGTGTGGCCGCGTCCCTGCTCCCTGCAGGAGGGAGGCCTGTGGGACCTTCTGCTGTGGCTGTTTACAAGGCTTTGCTCCTGGTGCCTAAGGCTGGAACCTTCTCTCTGCAGGAGGAGATGAGCAATTACTACCTCAGAGTCACCCAGAACGCCTTCCTAAACCACACGAGGCAACGCAGCAACAAGTGAGGGAGCCCCTCGGGTCCTGGGCCCCCGGGTAGGGCTGTGCAGCCGTCGCCCTTGGTTCCCACAGAGGGACCTCAGAGGCCCTGGATCACAGTGCTGGGCAGCACCCGTGGCCTCAACGTGTCCACCTCGGATGTCCCCTAGGAATGTCCCAGCTCGGGACAGCATGGGGCGTCACTGAGGAACATGCGGGGGCCTCCTGGGCAGAGCCGGGGTCAGTCCCGTCCTCACGGCCCTGTGCGATGCCGCCCCAGCTTGCACGTCCCTCTGCCCCTGGGTTTCCGCGGTCCTGTGCCAGCAAGGGAGGCGGTCTGATTGTCTGAGGCTCTGCTGGGGCCTCCATTGCAGGCTGTGGGTGCCCTGGGGTGGGAGATGGAGACACTTTTGCTCCCACGGGAAGCTGGGCACGAGCAGGTCCTGTGTGTTTGGGCGGAGCCTGGGGCCTTGGCCCCCCCGCCCAGATGCTGGACAGGGTTGCTCCCTCCAGGCCTGGGGCCCTCCTCACATTGCGCGTCCTCCGTGAGCTGCTACCCAGAGGTCCCCAGTAGGTGGATAGCCCCATGGCCAGGCTCCCTAGCCCCTTTCAAATCCCCTTATTTTGAGTTTTCTTGGTCTCCTGGGCCCCTCCAGCCCCAGTCACGTGTCACACGGAGAATCAAGTCCTGCCGGTCGGCCGTGGCCGAGTCTTCAGGCGTGTTGGGCTCGCTGGCTCAGCTGCTGCCGGTAGACGCTCCCTGGAGCCCTGGCTCAGGTCCTTCCCAGAGAGGCAGGGCTGGGGCCCTGGTGAGCCTCCGCTGCACCCGGGCCCCCAAGGTCCTGCTCCTGGCTCGTGTGGCCACTCTTGGCATGGACTCTGGGTCCCGCATCCCTGCTCCCAGCACAGCAGGGCTCAGGCAGCAGGAGGAGTGGTGGTCCCGACGCTGCCTATCACGCTGGGTGAGGGTCAGCGGGGAAGCGCCACACGGGATGAGAACAGAGGCCCAGGTAGCCGGGCGGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAGGGGGACAGCTGGGCGTGGTGGGGCCGGCGGTGACCAAGGCTGTGCCACGTCCTCCCGATGTTTCCTGTGCTCACAAGCTGCCGCTTTAGATTCTCCGGGAAAGTCCCCCTGAAGGGACTAAGGAGCCCGCGTTCCCCTCGGGACAGCTTGGCCGGCAGCCCCAGCATTTCCTTCCCCATCCCTGCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCGACCCCAAGCAGTTACACCAGGACATCCACGACCGCATCGACGTGATGTTTTACTTCGACCCGCCCGGGCCAGAGGAGCGGGAGCGCCTGCTGAGAATGTATCTTGACAAGTATGTTCTTATGCCGGCAACAGAAGGAAAGCAGTAAGTGTCTCCCCTCACCCACCCCTGTCCAGGGACCCTCGCTCTGGGCCCACCCCCGGCCCTGCTCTCCGGACGCACACAGCAGGCCCAGTCTCCGGGGTGGCACCGCCTCCCTGCTTTGCGGTTTCGCACAGGAGCCCTGTGGGCCCCAAGGGTCCCAGAGGCTGCACCCAGGGATGTGCCACCACCCTTTCCTCATCCCCACCTGAGAACAGCCTGGTGGTGTCTCCTCGGGTTTGGGGGGCAGAGCCCACCATCACTTACAAACCTTCAACtttttgtttttgagacaaagtcttgctctgtgccccaggctggagtgcagtggcacgatctcagctgactgcaacctccgcctcctgggttcacgcgattctcctgcctcagcctcctgagtagctgcgattataggtgcctgccaccacgccccactgcttttcgcctttttgtagagatgcagtttcaccatgttggccagggtggtctcgaaaccctgacctcgggtgatctgcccgccttggcctcctacagtgcagggattacagatgccagccactgtgcccgaccACCCTCAGGCCCTGGCAGTGCAGGGAGGTGACGTGGAGTGTTGCTCTGAGACCCCCATGTTGGGATTTGAGGGAGACGCTCCTCATGAGAGCCCCGTGTTGGGACTGGAGAGGATCCTCACGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCTGGCCCTGGAGCCTCATGGTGTGGGGCGCGGCTCCGGCTGCACTTGTGCCCTGAGGCCTTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGACAGGCCCTGGGGCTGCACGCCGGCTGCCTCAGGAACACTCCAGATGAGCAGTGGCTGCTCCACCTCTTGGCGTCCCCAGGTCCCAGGTTTCTGAGTCCTTCTGTCCACCTGACCTAAATTCCTGCTCTCTCCAGTGACAGCAAAAGCCGCTCTGTTCCAGAGAGAGCCTGGTTCCCCCTGCCAACCGCTCCGTGGCTGCCTGCTTCATGCTAGCCCAGCTGTCCCGGCCTCAGTTTCCCTTTGGCCCTCCCCTGCCCTGGGCTCTCCCACTCCCACGGCTGCTCATAGACCTGGCACAGTGACTTGGCTTCTATGACCTCCAGGGAGATGCTTTTGCTGGAATTCAGGGCTCTGCCACTGCCACTGTAACGGCCATGAGCCCTGTGGGTGCTGAGTGGGCAGGTGAGGGCAGGGCTGGTGTGAAGAGGGGGTGCGGCCATCTCCAGGCCCCACAGCAGCCACCACCTCCCTGCTCAGCCCAGACCTGGTTTGCATCAGGGAGAGGGCGGAGTTTGGCTGTCACAGGAAGAGTCCCTCCCAAGGGGGCATCTGGCATGGGTGCCCGCCTGGCTGCCTGTCTTCCAGCCCCCACCTCGTGGTGTGGGAGCCGCTGCCTTGGCCGGCCCACTTGGGAACTCCTTCCCCAGGCGCCTGAAGCTGGCCCAGTTTGACTATGGGAGGAAGTGCGAGGAGATCGCTGAGCTGACGAACGGCATGTCGGCCCGGGAGATCGCACAGCTGGCTCAGTCCTGGCAGGTGAGTGGGGCTCGGGCGCACCCACCCAGACAGGAGCCCAACTCCTGTGGAGACGCCGGGTTGCGCCTGTCCCAGCACCAGTGTCACACCGCAGCTTCTGTTGAGGGGTTTTCAGTGCACAGACGTGACACGGGGCACTCGCCCCAGTCGGCCACTCCACACACTGGCGCGCCCCTGCTCCTGCCCTGGGAAGTGTGGGGCATGTCCGTGGCTGACGGTCATAGGTCAGGAAGCCCGTCCGGCATCCTAGTATCCGGGCTCTGCCAGGTGGGGCGGGAGGCTTTCGATGCTCACCTTGGCAGACGGGCACCCCCTGGTGTGAATGGTCATCGGGACAGGCCCCGCCTGAGTTTGGTGGTGGGGCTGGAGGGATGTTGTGTTTCCCGGACCACGTCCGTTGGCTTGATCCTGCTTGACGGGCTCAGACACAGGGGCAGGAGTGACCTCTGATTGTCCCACAGCCGGCTGCTCCTTGGAGGACCCCCTCCTGCAGCTCCGTGGCTGCTGCAGGGACGGGGAGCCGGGACTCAGAGCAGTGTGGGCGTGGCCATCCAGAAAGCTTTGGTCTTTGGGGGTTGCTGGAAAAGCATAACCAGGTCTGTAGAAGGCACCAAGGCCATGCACAGGCATTGCTGCCTCTGGGGTCTGCAGAGTCTGTGACAACCTGGTCACTCAACCTAGCAGCGCTTTCGCGTGTGACAGGTTCATGAAGTAGCCAGTTACCTTGATTTGAACGTTGGAGCTGGGGACTATATGGGCTGTATTAGTCAGTTATGCCGCTGTGACAAAGAGCCTCAGATCTCAAACCCCATCCTTGTGGGTCAGCTGAGGTCTGTGTTCCAGGCCGTCTCCACTTGAGACCAGGTCTGTTTCCACAACTAAGCAAACAGAgaccgggccatggtgttgggctacatttgttcccagcatttgggaggtcgaagtcagcccagattatttgaaggcaggagtcaggaccagccttggggggggggggggggggggggggggggggaaagcaaggggagactccatctacaaaaaataaaaaaattagccggaccctaatgtggcacgcctgtaatgcagctcctgggagcctgaggtgggatgatcactgagtcccaggtaggccagaaatacagtgagcctgtggattgtgccactgcactccagcccgggttacagagcgagaccctggtctttaaaaataagaataaTTTGAgccgggcatggtggctcacgcctgtaatcccagcacgctgggaggccaaggggagaggatcacttgaggccaggagttcgagaccagcctggccaacatgtcgagccccacctctactaaaaatacaagaattggccgggcgcagtggtggtgcatgcctgtattctcagctactcaggaggctgaggcaggagaatcgcttgaacccgggaggtggaggttgcagtgagctgagatggtgccattgaattccagcctggactattcaggatcctttgagattccataagaattttaggagtggttttcctatttttgtaaaacataatttgggttttcacagggaccgcgtttagtctctatgtcgctttgatgtctctcagcaatattCTGTGGttttctcttgttttcgagacggagtctcgctctgctgcccaggctggagtgcagtgttgtgatctcagctcactgcaacgttcccctcccgggttaaagtgattctcctgactcagcctcctgaggagctggaattccaggcaggcgccaccatgcccggctaatttttgtactaagagacggggttttgccatgttggccaggctggtctcgaacctctgacctcaggcaatccacccacctcagcctcctaaagtgctgagattaaaggcacgtgccaccacgcccggctaatttttgtatttttagtagagacgatgattcaccatgccggcgaggttggtcttgaactcctgacatgaggtaatccatctgcctctgcctcccaaagggctgggattcagacatgggccactgcgcccagccagttttcactgtacaagtctttcaccctcttggttaagtgaatttccaagcattttattcttgccgctgctgttgtaaatggaaacggtttcataattccccattcacattattcactgttgggatggagaactgcagctttctttgctgttgattttgtatcctgtaagtttgctgatgtcacggcattttttcttccaatatggattctaggattttctacatataagattatgtcatctgagaacaggtgatttttacctttcccttttcagtttggatgacttttctttttcttgtctaattgcactgtccagagcttccagtggtgtgtggaatagaagcggtaaagcattcttgcctggttccttacctcagaggaaaagctttgtttttcaccactgagtatgtcacctatgggcttgtgatgtgtggccttcattgtgtttagggtgtatccttcaattcttggtttggtgagtgtttttatcataaaagtgtgaggcgggtggatcacctgaggtcggcagttcgaggccagcctgaccaacgtgaagaaaccccatctctcctacaaatacaaacttagttgggcatggtggtgcatgcccgtaatcccagctactcgggaagctgagacaggagaatcgcatgaaggcggcaggcagaggttccagtgagccgagatcgcgccatttgcactccagcctgggcaagaagagcaaaattgtctccaaaaaaaaaaaaGTggccaggcacggtgactcacgcctgtaatcccagcactttgggaggccaaggtgggtggatcacgaggtcaggagatcgataccatcctggctaacacagtgaaaccctgtttctactataaatataaaacatcagctgggcatggtggcaggtgcctgtagtcccagctacctgggaggctggggcaaaagaatggcgtgaacccaggaagcggagcatgcagtgagctgagatgcctgggctacagagtgaggccccaactcaaaaaaaaaaaaaggtgttgtatttggtcgaatactttttctgcaacacttgagacagtcgtgtggtttccttcctccaccctgctaatatcgattgatttttgtatgttgaacatttcatatgcggaacattgattttcatatgttgaactatcgttgcattccaggaataaatcctgcttggtcggctgggcgcggtggctcaagcctgtaatcccagcactttgggaggccgagatgggcggatcacaaggtcaggagatcgagaccatcctgtctaacctggtgaaaccccgtctctactaaaaaatacaaaaaactagccgggcgaggtggcgggcgcctgtagtcccagctactcaggaggctgaggcaggagaatggcgtgaacccgaaaggcggagcttgcagtgagctgagatgcggccactgcactccagcctgggtgacagagcgagactccgtctcaaaaaaaaaaaaaaaaaaaaaaaaaatcctgcttggtcagggtatagagtccttttagtgtgctgctgaattcactctgctggcattttgttgaggactttcccagtgatgctcatcagggatattggcctgtcatttttcttgtggtgtctttgtctgggtttgatatcagggtaatgctggcctcctaggatgagtgaggaaatgttcttcaatttgtccaagagtttgaggtgtgctgctgattcttcttaatgttttgtgaattgacacgtgaagacatcaggtccaggtcttgtgtttCaacttttacagcttgaagactttaggttcccagaaaaattgcaaaggtagcacagagagctcccgGGCCCGGGGCCTTGCCACGTAGTGAACGTCATGTGTCACTGTTGGCCCCACCTGGGACTGGGTCTTGCCCAGAATCCCACCCAGGAGGCCACGTGACATTTAGCTGTCACTTCTGGTGGGCTCTGCCAGGTCCCGTGCTTCCTGGTGGGGTGGCCCCATGAGCATCTGCTCATCCCCTTTCCTCCACTGGGCCCTGGGTGAGGTGCAGCCACTCGGGTGCACCCTGAGGGTTCCTGCACCTGTTTGAACTCTCTTGGGTCGGCTCAAGACCAAAAATGATGCTGAGCAGTCCTGGGCCTCTGATGCATAGTGGTGGTCCGGTTCCGGTCAGCGTCTCCTGCACTCCTGGGCCCCTGAGCCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGGGTGGGCGCTGTCAGTGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCCCATCGTGGCAGCCGTGTTGTGGGAGGATGGTGCGCTGCTGCCCCTTTACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCCGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTGCCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCAAGCACGGTGCCCAGTGGGGGTGCCCAAACCTTCACCCTGACCCATGGGTGACTTCCCTTGGGGACTCCACGCCTTTCACTGGGACTGGGATGGAGAGCGACCTGTCCATGGCAGAAGGGCTGCACCTGAGGTGCTTGAAGCAACACCAAGGGCCACAGTCCCAGCAGCTCCAGCCTCCGCATGCTGGATGCCAAGTCCTGTGCCCAGGACAGGGAGGTGGAGGCACGGGTGATCTTGATGCTAGCACCTATGTGCCCCGAGGTTGGGCAGTGGCTGCCTCTGCTGTGGAGGCCTATGAAGGTGAGGGTCTGAGGATCTGTAGTGCACTGTGACCCGGGGGCACTGCCTGGCCACGGCTGAGACACGCAGAGGGTCTGCAATTCCCTCCTGCCTCTTGGGAGCTGCCCTGGGTCTGCAGTCAGTGGGGCTCGTCCTCGGGCTTTCCGTTATTAGAAAGTCACTGAGAAACTGCAGTGCTGAGGACGCAGGCAGGGCTGTGGCACTGCAGGGGCCGCTCCCGGTGTCCACACGCATGCTGGGCTCTGCCGAGGTGCCGGAAGCCTGTGTTTCACCCTGAGGCCGTCCTGGTGCCCCGGGTTTGGACCCTCCCCACCTCGGGGTCCTGGAGTGCGTTACGGGTGGGGGGTTCCCATGGTGGCCTCCCTCAGCTCCCTCTCTCCTCACTAGGACACGGCGTATGCCTCCGAGGATGGGGTCCTCACCGAGGCCATGTTGGATGCCCATGTTGAAGACTTTGTCGAGCAGCACCAGAAGAAAATGCGCTGGCTGAAGAGGGAGGGCCTGTCCTCATGGACCAGCACCCCTTAACCTGAGTCCGCGGTGAGACCACACGTCACGGAGCCTGGCTGCGGACCCCTCCCACCCCTGCTTTTCCGGTCCCTGCACGTTTAGGAAATGCTTCCCCTAATAAACTCCCACAGGTGCCACAGCGCTGTGTCTATTGGCTGATGTGGTGCGGGGTTTGGGGTCCCCTAGTGTCCTTCTGGGGTCAAAGGTGATAGAAAAGACAGGCTGGAGCTTTCTGGAGAATTTAGGCACAGAAGGGTGGGCTTCACATGAGGTGCCTGCCACAGCGGGGTTGGCTGCCTGAATGCCACCCGGGACCGGCTGCTCGCGCTCCATCCTGCAGCTGTGGAGACGGGGGTGCCCCTTTGCCTCTCTCCACGAAGTGCAGGGCAAACAAGACACAGCGGTTTCAAACAGGCGATGGCCCGGACTGCGTGCCTCGCCGCCCCTGCGCCTTCCCCTGCCCCTGCTTTCCAGCTAGTCCCTGAAAACCTTGATGGggccgggcgcggtggcccatgatggattctcagcactttgtgaggccaaggcgggtggatcacctgaggttaagtgttccagcccagcctggccaacatggtgaaaccccatctctcctaaaaaaaaaaaagaaaagaaaaagaaaaattagccgagcgtcgtggcaggtgtctgaaatctcaggcactcaggaggctgaggcaggagaatcacttgaccccgggaagtggaggttgcagtaagctgagaccatgccattgcagtgcagcctggacaacaagagtcaaactctctcaaaaaaaaaaaaaGgccaggtcaggtggcatgtgcctgtggtcccagcttggtcccagattcttggtttggaggctgaggtaggaggatcacttgagcatgggaggatgaggttgcagtgagccaagatcgcttcagacactccagcctgggtgacagagtgagaccctgtctctaaataatcaaaaCCTTGATTACAGCCATGGGGTGGGGGTTGGGGGGCGTCTGGCTCGGCAGGGAACTATTGGGTTTTTCTGCTCTCtaatttttgtagagacagggtttctctttgttgcccaggctggtctccaactcctgggtcaagcgtcgatcttctgcctcggcctcccaagtggtgaggttacaggcgtgccaccgcacctgaccTGttttctttttttttttttttttttttttgagacggagtcagctctgtcacccagggctggagtgcagtgggcggtctcagctcactgcaagctccgcctcccgggttcacggccattctcctgcctcagcctcccgagtagctgggactacaggtgcgtgccacaacgcccggctaagtttttgtatttttagtagagacagggtttcactgtgttagccagggtggtctcaatctcctgaccttgggatccgcccgtctcggcctcccaaagtgctgggattacaggcttgagccaccgcccccggccCCttttttttttttttttttggcaagggagtcttgctcgcccagggtggagtgcagtgttgcaatctgggctcactgcaacctccacgtccagggtgtcaggcctctgagcccacgctaagccatcatatccccagtgacctgcatgtgtacatctgatggcctgaagcccctgaagatccgcagaagtgaaaacagtcttaactgatgacattccagccttgtgatttgttcctgccccaccctacctgatcaatgtactttgtaatgtcccccacccttaagaaggttctttgtaattctccccaccctggagaatgtactttgtgagatccacccccagcccccaaaatattgctcctaactccactgcctatcccaaaacctctcagaactaacggtaatcccagcaccctttgctgactctttttggactcagctggcctgcacccgggtgaagtaaacagccttgtggttcacacaaaacctgtttcgtggtgtcttcacacggacacgcgtgacacagggttcgaggaaatttcatgcctgaacctccggagtagctgggattacaggcgaacggcaccatgcccaggttaatttttgtattttcggcagagacagaggcccaggtagccgggctggGGGACAGCTGGGTGTGGTGGGGCCGGCGGTGACCAGGGCTGTGCCGCGTCCTCCCGGTGTTTTCTGTGCCCACCAGCTGCCGCTTTAGATTCTCCGGGATAGTCTCCCTGAGGGGGCTGAGGAGCCTGTGTTCCCCTCGGGGCAGCTTGGCCGGCAGCCCCAACATTTCCTTCCTCATCCCTCCTCCGCAGATTCATGCTGGTCCTGGCCAGCCGCCACCCCGAGCAGTTGGACTGGGGCATCCATGACTGCATCGATGTGACGGTCCACTGCGACCTGCCACGGCAGGAGGGGCGGCAGCGCCAGGTGAGAATGTATTTTGACAAGTATGTTCTTAAGCCGGCCACAGAAGGGAAACAGTAAGTGTCCCGCCTCACCCGCCCCTGTCCAGGGACCCTCGCTCAGGGCCCACCCCGCCCCTGCTCTCCAGACGCACCCAGCAGGCCCAGTCTCCAGGGTGGGCACCACCTCCGTGCCCTGAGGTTTTGTGCGGGAGCCCTGTGGGCCCCGAGGGTCCCAGAGGCCGCATCCAGGAGGTCACGCCCCCTTTTCCTCATCCCCATCTGAGAACAGCCTGGTGGCGTCTCCTCAGGTTTGGGGGCAAAGTCCACCATCACTTAGAAACTTTCAGCAttccttttttttttttttcttaagacggactcttgctctgtcatccaggctggagtgcagtagcttgacctcggctcactgcaagctctgtctcccaggttcacgccgttctcctgcctcagcctcccaagtagctgggacaacaggcacccgacaccacgcccggctaatttttttgtgtttttttagtagagatgggtttgaccgtattagccaggatggtctcgatctcctgacctcgtgatccacctgcctcggcctcccaaagtggtgggattacaggtgtgagccaccgcatctgacctttttttgaggaagtctcactcttgtccccctggctggagtgcagtgccgggatctcagttcactgcaacctgtgcctcagcctcctgagtagttgggattataggtgcccgccaccgcgcctggctggtttttgtgtttttgtagagatggaatctaactccgtctcccaggctggagtacagtggtgtgatctcagcttactgcaacctccaccctccgggttcaaaccatcctcttgcctgagcctcctgaacagctgcgattacaggcgcccagcacaatgctcgcctcatttttttgtctttttagtagaaacagcttttcaccaaattgaccagactggtcttggacttctgatctcaagtgattcaccctcctcggcctccaaagtgcagggattgcagatgtgagccaccggacccggcctcttttatgttcctcttcagtaCTCAGAGGGCTGTGAGGAAATCCGGTGCCCGGCCACCCCCAGGCCCTGGCAGTGAGGGGAGGTGATGTGGAGTGTTACTCTGAGATTCCCATGTTTGGATTCGAGGGAGACGCTCATCATGAGACCCCTCCGTGTCGGGATTAGAGGGAGAGGCTCCTCATGGTCCCCTGCTGGCTGCTGGCCTGGCCTTCCTCCAGCTGCCACGCCCGGCCCTGGAGCCTCCTGGTGTGGGGCGCGGATCCGGCTGCACTTGTGCCTTGAGGCTCTCAGGCTCCCTGTCGCTGGCGGTGGGTGCAGCAGGCACGGCGGGCAGAGCCCTCCAGGTGATGAGAGCCCCCAGGAACACTCCAGATGAGCAGAGGCTGTTCCACCTCTTGGCGTCCCCAGGTCCCCGGTCTGAGTCCTTCTGTGCACCTGACCTAAATTCCTGCTGTCTCCTGTGACAACAAAAGCCACTCTGTTCCAGAGAGAGCCTGGTTCTCCCGTTGACCCCTCCGCTGCCGCCTGCTCCATGCTAGCCCAGCCGTCCAGGCCTCAGTTTCCCTTTGGCTCTCCCCTGCCCCGGTtcccagctgcttgggaggctgaggtaggaggatcatttgagtccaggagcttgaggttgcactgagctgtgactgtgccactgtactccagccttggcaacagagtgagacactgtcttaaaaaagaagaaTTTGggccagatgctgtgtttcatgcctgttcccagcatgctgggaggctgaggagagaagatcactcgaggccaggggttccagaccagcctgccaacatgttgaaccccgcctctacgaaaaatacaaaaattagccgggcgtggtgggtgggtgggtgccagtaatcccagctactcaggaggctgaggcagcaaaatctcttgaacctgggaggtggagattgtggtgagctgagatagtgccgctgtacttcaacctgagcaacagagtgagactccttatcaaaataaagaaaTCAATCAATCAATAAAAATAATCACAATAATTTGggctgggcgtggtggctcactcctgtaatcccagcactttgggaggcgtggatcggttgagttcgaggcaagcctggccaatgtggcgaaaccccatctccactacaaatacaaaaattagccaggtgtggtgacaggcacctgtaatcccagctgctcgggaggctgagacaggagaatctctggaacctaggaggcggaggttgcagtgagccaagatcacgtcagtgcgctccagcctgggtgacagagactgtctcaaaaaagaataataataaTTTgactgggtgtggcggctcactcttgtcatcccacactttgggaggccgaggcaggaggattgcttcagctcaggatttcgagactggcctggacaactggcctggacaacatggtgaaactccatctctacaaaaaatacaaaaattagccaggcatggtatcatgtgcctgtgatctcagctactcaggaagcagagatgggagcattgctggagcctgggagttggaggctgcaatgaaccatgttcgtgccactgcactccagtgtgggtgacagagtgagaccctgtctccaaaaggcatggtggctcacgcctgtaatccctgcactttgggaggccaagctgggtggatcacctgaggtcaagagttggagaccagcctggctaacgtggtgaaaccccatctctaggaaaaatagaaaaaATTggccaggtgcagtggctcacacctgtaatcccggcactttgggaggccgaggcgggcgaatgacctgagatcaggaattccagaccaaccacaccaatatggagaatccccgtctctactcaaaatacaaaatcagccgggcatggtagcaatcccagttactcaggaggccgaggcaggagaatcactggaggtgagccgagaccacgccattgcactgaagcctgagcaacgagagggaaactgtctcaaaaaataaTGCTAATAACAAGGGGGAGAGAACAGGAGTGTGGTCAGCAGCTGGGCCTGCCATAACCCCTGGGTCGTGTGTCCCCACAGCTCTGAAGGCTAGAGGCCCGAGGTCAGGGTGCCAGCTCGGTCCCCCCCGTGGAGTGTTCTCTGTTAGCTTCTCACATGGCAGGGAGAGTGACTGAGCTCTCGCTCTGGTGTCCCTTACGAGGACGTTCATCCCCCACTGCTCAGAGCGGCGGTGAGCCACCACGCCCAGCGCCAACTTTGTCCTTCAAGAGTTGTTTTTTTGTgccgggctcagtggctcatgcctggaatcccagcactttgaaatgccaaggtgggtggagcacctgaggtcaggagtttgactccagcctggtctaaatggtgaaaacctgcctctactaaacataaaaaaatcagctgggcatgttggtgtgtgcctgtaatcccagccactcgggaggctgaggcaggagaatcacttgaacccaagaggtggaggttgcagtgaactgagatcatgtcactgcactgcagcctggatgacaagagtgagactcccttgcaagaaaaaacaaaaattaaaaaagaaGTTGTTGTcttttttttttttttttcccttggacaattcaagatgcctagagattccatatcaattttagtaatgcttcttctatattttaaaaagtaatttgggtttttacagggattgcattcagtctctgtattgccttAATGACTCTTAGCAATGttgttttttttatttattattttttttctagagatggagtctcactctgtcagccaggctggagtttagttgttggccaggatgggcccaatctaatgacgtcaggtgatccgcctgcccctggctcccaaattgctgggattcagacgtgggccaccatgcccagccagtttacattgtacatttctttcaccttcttggttcagtgaagctccaagtattttattctttcggatgctcttgtaaatggaaatggtttcgtcattccccgttcagattatacacttactatgaagaactgcagctttctttgctgttgattttgtatcctgtaactttgctgatgtcgtggggttgttttttccaatatggattctagattttcCTTTTCTTTTTCTtttttttgtttttttgttttttttttttttgatatggggtctccctctgtggcccaagctggagtggaatgcagcggcacgatcttgaatctgcgagctcctctgcccgggtccacgccattctcctgcctcagcctcctgagtagctgagactacaggtgcctgccatcacggccggctaattttgtgtattttttgtgcagatgaggtttcaccgtgttagccaggatggtctcgatctcctaactttgtgatcggcccgcctcggcctcccaatgctgAATGCTGTTGGGACTGGGTCTTGCCCCAGAATCCCACCCAGGAGGCCACCTGACGTTTAGCTGTGACTTCTGGTGGGCTCTGCCAGGTCCCATGCTTCCTGGTGGGGTGGCCCCGTGAACGTCTTCTCAGGCCCTTTCCTCCATTGGGCCCTGGGTGAGGTGCAGCCACTCGGGGGCACCCTGAGGGTTCCTGCACCTGTTTGAAGTCTCTTCGGTCGGCTTGAGACCAAAAATGATGTTTAGCAGCCCTGGCCCCCTGACGCACAGTGGCGGTCCTTCTCCGGTCAGTGTCCCCTGCACCCTTGGGCTCCTGACGCACAGTGGCGGTCCAGCTCCAGTCAGTGTCTCCCCACACAGTGGCTCTTGGCGAGGTGTGGGCGCTGCCAGAGGGGACGGGCACCACGTGGTCATCCCCATGGCAGGTCTGGTCGTGGCGGCCGTGTTGTGGGAGGATGGTGTGCTGCTGCCTCTGCACCCTGTGAGATGAATCCTGCCTCTGGGAGGCACAGCTGGGATGGGGTGAGGGACCCCCTCAGCTGTCCGGGAAGCGTCCCCTACCCTGTGCTTCCTCCAGGCGTCCTGGTGCACTCCCGAGCTCGGTGCCCTGTGGGCGTCCCCATGCCCAGACCCTGACCCACAGGTGCCTCCCCTTGGGGTCTCCACGCCTTTCCCTGGCCCTGGGATGCAGAGTGACCTGTCCATGGTAGAAGGGCTGGACCTGAGGTGCCTGAGACAGCACCAAGGGCACTGGTCCCAGCAGCTCCAGCCTCTGTGTGCTGGATGCCACACAGACACAAGACTCTTGGGAGACGCATTTTCCATCTGGCTCAGAGGGGGAGGGGGAGGCTTTGCAACCCAGCCCCTGCCCAGGCCCCTGGGAGGGTGGGTGCCTGCTGAGCCCCCGGGGCAGCAGGAGCGGGGCAGGCGGGGTCTTTGTTCTCACTCCCACAGCAGAGGCAGATGTGGGGGCGCCTGCTGGGGCCAGACCAAGGTGGGGTGGCCTGGAGACTGCTTCCAACCGTGGCCGGGAAGCAGGGAACCTGCCCGGCGTGTCTGAGGCCACACTCTCAGCTGGCCGGTCCAAGCCTGCGGCTGGAGCTGGTGTCTGTTTAGCTAATAAAGTCCCACAGTTGCCTCACTGCCGTGTCTATTTGCTGATGCTGCGCGGGGTTTCAGGGGCCGCCTAGCCTCCTCCTGGGGTCAAAGGTGACAGAAGAGGCAGAGGCGGGAGCTTTCTGGAGAATTTACTGACCACAGCGTGGTGCACTTGACATCAGGTGCCCGCCATGGCCGGGCCGTGGTCTGAAGGCTGCCCGGGACCAGCTGCCTGCGCTCCAGCAGCCCCTCCCTCCTGAAGGCCGGGCCCCCGAGAAGAACGAGGCTGCAGAGTGATGTGGGGGCCAGCGGTGACTTCCTACCACACTGTTCTCAGGTGTAAGAGGCCGCTTCTGCCCAGGCATTGTCCATGGAAGACACACAGCCGGCCACTGCAGCCTCGGTTCTGGGATGCCCTGCGGCTGGGTCACAGGGGGCCACGGGCCACGCTGGGAGGCCACAGTCCTGTCGTGCCACGCAGCTCCCTGTCCCCAGAGGTCTGCTCAGATGCAGAGATCAGAAACCACACTCGCTGCCTGAATTCTGGGAGCAGAGCCCGGTACCCACTGCCTGGCCGGGGCCTACCCTGGG

bedtools merge

Like we said, bedtools has a ton of features – we could write a whole workshop about it. And I wanted to give one more example before we move on. Something else we might want to do with the regions in a bed file would be to merge ones that are overlapping or within some distance of each other. For instance, we may think the method we used to call SVs may be slightly inaccurate and is calling the same polymorphism as separate mutations in different individuals, so we want to merge overlapping events.

For this we can use bedtools merge. There is one catch, however.

Run the code block below to see what happens when we run bedtools merge on the bed file with macaque SVs:


bedtools merge -i data2/macaque-svs-filtered.bed
# bedtools: A suite of programs to process bed files
# merge   : The sub-program of bedtools to execute

The input bed file must be sorted! There are a couple of ways we could do this. If you look at the documentation for bedtools merge, they suggest using the native Unix sort command. However, bedtools itself also has a sort command. Let’s try that.

Run the code block below to sort the bed file with macaque SVs and then merge overlapping SV calls:


bedtools sort -i data2/macaque-svs-filtered.bed | bedtools merge > macaque-svs-filtered.sorted.merged.bed
# bedtools: A suite of programs to process bed files
# sort: The sub-program of bedtools to execute
# -i: The input bed file
# | : The Unix pipe operator to pass output from one command as input to another command
# bedtools: A suite of programs to process bed files
# merge: The sub-program of bedtools to execute
# > : The Unix redirect operator to write the output of the command to the following file

wc -l data2/macaque-svs-filtered.bed
wc -l macaque-svs-filtered.sorted.merged.bed
# Use wc -l to count the number of un-merged SVs in the original file and the number after merging
## 3646 data2/macaque-svs-filtered.bed
## 3372 macaque-svs-filtered.sorted.merged.bed

So we merged a few hundred calls. Note that because bedtools merge only requires one input file, we can default back to the standard Unix piping procedure without having to use the - shortcut (though we still could specify -i -).

Of course, in actuality we would only want to merger duplications with other duplications and deletions with other deletions.

Exercise: In the code block below, write a command that merges only duplications with other duplications. Save the result in a file called macaque-svs-filtered-dups.sorted.merged.bed. BONUS: Adjust the settings to merge any duplications within 1000bp of each other as well as directly overlapping (Hint: Check the help menu of bedtools merge!).


## Use the tools you've learned to merge only duplications with other duplications
# data2/macaque-svs-filtered.bed
grep "<DUP>" data2/macaque-svs-filtered.bed | bedtools sort | bedtools merge -d 1000 > macaque-svs-filtered-dups.sorted.merged.bed
## Use the tools you've learned to merge only duplications with other duplications

grep -c "<DUP>" data2/macaque-svs-filtered.bed 
wc -l macaque-svs-filtered-dups.sorted.merged.bed
# Count the number of lines in the original file and the new file to confirm we merged some duplications
## 432
## 379 macaque-svs-filtered-dups.sorted.merged.bed

GFF

In the context of our macaque SVs, a natural question would be how many of the mutations affect genic regions, and may therefore affect some cellular function. To know this, we need another file that contains the regions of the macaque genome that contain genes. This information could easily be contained in a bed file, but genes are complex, structured regions of the genome: they have exons, introns, multiple transcripts, and may have other information associated with them that is difficult to encode in a bed file.

In GFF files, we refer to the regions in the file as features.

The format for encoding information about genic regions (commonly called a genome annotation) is the GFF format. GFF stands for General Feature Format. There is a related format, the GTF format, which stands for General Transfer Format but it is very similar to GFF and slightly dated so we will only talk about GFF files today.

GFF files are also tab delimited files, with each row in the file referencing a particular region in the genome and each column a piece of information about that feature This probably sounds similar to the bed format, but contains more required columns. GFF files by definition have the following columns:

  1. Chromosome or assembly scaffold ID: The sequence name in the genome assembly file
  2. Annotation source: The name of the data source or program that annotated this feature
  3. Feature type: A categorical name for the type of feature defined in this row (e.g. “gene”, “transcript”, “exon”)
  4. Feature start coordinate: The start position of the feature defined in this row
  5. Feature end coordinate: The end position of the feature defined in this row
  6. Score: The score of the feature if quality is assessed during annotation, otherwise .
  7. Strand: Either + (forward strand) or - (reverse strand)
  8. Frame: For coding exons, this indicates the frame as either 0, 1, or 2
  9. Attribute: A semi-colon separated list of any other information related to the feature defined in this row

For more detailed information on GFF files, see the following links:

Let’s take a look at a GFF file and talk about it a bit.

Run the code block below to view the first few lines of a GFF file:


grep -v "biological_region" -m50 data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
# grep: The Unix string search command
# -v: This option tells grep to print lines that DO NOT contain the following string
# "biological_region": The string to search for in the provided file - we just don't want to display these for this demonstration
# -m50: This option tells grep to only display the first 50 matches
## ##gff-version   3
## ##sequence-region   1 1 225584828
## ##sequence-region   10 1 92844088
## ##sequence-region   11 1 133663169
## ##sequence-region   12 1 125506784
## ##sequence-region   13 1 108979918
## ##sequence-region   14 1 127894412
## ##sequence-region   15 1 111343173
## ##sequence-region   16 1 77216781
## ##sequence-region   17 1 95684472
## ##sequence-region   18 1 70235451
## ##sequence-region   19 1 53671032
## ##sequence-region   2 1 204787373
## ##sequence-region   20 1 74971481
## ##sequence-region   3 1 185818997
## ##sequence-region   4 1 172585720
## ##sequence-region   5 1 190429646
## ##sequence-region   6 1 180051392
## ##sequence-region   7 1 169600520
## ##sequence-region   8 1 144306982
## ##sequence-region   9 1 129882849
## ##sequence-region   MT 1 16564
## ##sequence-region   X 1 149150640
## ##sequence-region   Y 1 11753682
## #!genome-build  Mmul_8.0.1
## #!genome-version Mmul_8.0.1
## #!genome-date 2015-11
## #!genome-build-accession NCBI:GCA_000772875.3
## #!genebuild-last-updated 2016-02
## 1    Mmul_8.0.1  chromosome  1   225584828   .   .   .   ID=chromosome:1;Alias=CM002977.3,NC_027893.1
## ###
## 1    ensembl gene    25432   42232   .   +   .   ID=gene:ENSMMUG00000005947;Name=SAMD11;biotype=protein_coding;description=sterile alpha motif domain containing 11 [Source:HGNC Symbol%3BAcc:HGNC:28706];gene_id=ENSMMUG00000005947;logic_name=ensembl;version=3
## 1    ensembl mRNA    25432   35202   .   +   .   ID=transcript:ENSMMUT00000015569;Parent=gene:ENSMMUG00000005947;Name=SAMD11-208;biotype=protein_coding;transcript_id=ENSMMUT00000015569;version=3
## 1    ensembl exon    25432   25503   .   +   .   Parent=transcript:ENSMMUT00000015569;Name=ENSMMUE00000311984;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=ENSMMUE00000311984;rank=1;version=2
## 1    ensembl CDS 25432   25503   .   +   0   ID=CDS:ENSMMUP00000014582;Parent=transcript:ENSMMUT00000015569;protein_id=ENSMMUP00000014582
## 1    ensembl exon    29573   29754   .   +   .   Parent=transcript:ENSMMUT00000015569;Name=ENSMMUE00000311983;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=ENSMMUE00000311983;rank=2;version=1
## 1    ensembl CDS 29573   29754   .   +   0   ID=CDS:ENSMMUP00000014582;Parent=transcript:ENSMMUT00000015569;protein_id=ENSMMUP00000014582
## 1    ensembl exon    30429   30479   .   +   .   Parent=transcript:ENSMMUT00000015569;Name=ENSMMUE00000311982;constitutive=0;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSMMUE00000311982;rank=3;version=1
## 1    ensembl CDS 30429   30479   .   +   1   ID=CDS:ENSMMUP00000014582;Parent=transcript:ENSMMUT00000015569;protein_id=ENSMMUP00000014582
## 1    ensembl exon    34224   34348   .   +   .   Parent=transcript:ENSMMUT00000015569;Name=ENSMMUE00000339755;constitutive=0;ensembl_end_phase=1;ensembl_phase=2;exon_id=ENSMMUE00000339755;rank=4;version=1
## 1    ensembl CDS 34224   34348   .   +   1   ID=CDS:ENSMMUP00000014582;Parent=transcript:ENSMMUT00000015569;protein_id=ENSMMUP00000014582
## 1    ensembl exon    35177   35202   .   +   .   Parent=transcript:ENSMMUT00000015569;Name=ENSMMUE00000394552;constitutive=0;ensembl_end_phase=0;ensembl_phase=1;exon_id=ENSMMUE00000394552;rank=5;version=1
## 1    ensembl CDS 35177   35202   .   +   2   ID=CDS:ENSMMUP00000014582;Parent=transcript:ENSMMUT00000015569;protein_id=ENSMMUP00000014582
## 1    ensembl mRNA    25432   40770   .   +   .   ID=transcript:ENSMMUT00000047681;Parent=gene:ENSMMUG00000005947;Name=SAMD11-207;biotype=protein_coding;transcript_id=ENSMMUT00000047681;version=2
## 1    ensembl exon    25432   25503   .   +   .   Parent=transcript:ENSMMUT00000047681;Name=ENSMMUE00000311984;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=ENSMMUE00000311984;rank=1;version=2
## 1    ensembl CDS 25432   25503   .   +   0   ID=CDS:ENSMMUP00000040704;Parent=transcript:ENSMMUT00000047681;protein_id=ENSMMUP00000040704
## 1    ensembl exon    29573   29754   .   +   .   Parent=transcript:ENSMMUT00000047681;Name=ENSMMUE00000311983;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=ENSMMUE00000311983;rank=2;version=1
## 1    ensembl CDS 29573   29754   .   +   0   ID=CDS:ENSMMUP00000040704;Parent=transcript:ENSMMUT00000047681;protein_id=ENSMMUP00000040704
## 1    ensembl exon    30429   30479   .   +   .   Parent=transcript:ENSMMUT00000047681;Name=ENSMMUE00000311982;constitutive=0;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSMMUE00000311982;rank=3;version=1
## 1    ensembl CDS 30429   30479   .   +   1   ID=CDS:ENSMMUP00000040704;Parent=transcript:ENSMMUT00000047681;protein_id=ENSMMUP00000040704

We’ll just point out a couple of things. First, this file also has a header, like a BAM file, though this is not required for GFF files. In general, the GFF format is less standardized than others we’ve gone over in the workshop. Next you’ll note that columns 1, 3, and 4 are the same three columns (ALTHOUGH WITH DIFFERENT INTERVAL ENCODING) that define a bed file, so GFF files are (sort of) easy to convert to bed files, though with loss of information. This also means some bedtools programs can process GFF files as well.

Features in a GFF file are generally nested: genes are comprised of transcripts and transcripts are comprised of exons. All of these features are encoded in this file and are usually linked to each other by IDs in the last column, though this is not always standardized. This can make the strand column slightly confusing to work with for features nested under the same parental feature. For features on the positive strand (+), it is straightforward: they are ordered by start coordinate. For features nested under the same parental feature on the negative strand (-) though, the correct order is the reverse sorting by the end coordinate. Many of the tools we work with will consider and correct for strand, but it is always a good thing to consider if you ever parse GFF files on your own.

Because of all the quirks with GFF files, there are many tools out there to help process and analyze them, with gffread being a relatively stable one. We won’t be demonstrating these today though.

Exercise: In the code block below, write an awk command that counts the number of genes in the macaque annotation. Be sure to only check the feature name column (third column) because any feature that has a gene as a parent will also have a “gene id” in the last column that would return that line if it was searched for the string “gene”. Also be sure to only get exact matches for the word “gene”, else pseudogenes might be included in the count:



## Write awk command to count the number of genes in the macaque annotation
# data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
awk 'BEGIN{g=0} $3=="gene"{g++}; END{print g}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
## Write awk command to count the number of genes in the macaque annotation
## 20852

BONUS Exercise: In the code block below, write an awk command that calculates the average number of transcripts per gene in the macaque annotation. This requires initializing 2 counter variables at the beginning and searching for 2 patterns separately within your awk script, and then doing some math at the end:


## Write awk command to calculate average number of transcripts per gene
# data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
awk 'BEGIN{g=0;t=0} {if($3=="gene"){g++};if($3=="mRNA"){t++}} END{print t, g, t/g}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3

awk 'BEGIN{g=0;t=0} $3=="gene"{g++}; $3=="mRNA"{t++} END{print t, g, t/g}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3
## Write awk command to calculate average number of transcripts per gene
## 44732 20852 2.14521
## 44732 20852 2.14521

bedtools intersect

So, how many of our SVs in our macaque population overlap with genes? For this we can use bedtools intersect, which takes two interval files (either bed or GFF) and calculates how many of the features overlap. Even though it takes GFF as input, we need to parse out the gene coordinates only.

Run the code block below to retrieve only the genes from the macaque annotation GFF file:


awk 'BEGIN{OFS="\t"} $3=="gene"{print "chr"$0}' data2/Macaca_mulatta.Mmul_8.0.1.86.chr.gff3 > macaque-genes.gff
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
# > : The Unix redirect operator to write the output of the command to the following file

head macaque-genes.gff
# Display the first few lines of the new file with head
## chr1 ensembl gene    25432   42232   .   +   .   ID=gene:ENSMMUG00000005947;Name=SAMD11;biotype=protein_coding;description=sterile alpha motif domain containing 11 [Source:HGNC Symbol%3BAcc:HGNC:28706];gene_id=ENSMMUG00000005947;logic_name=ensembl;version=3
## chr1 ensembl gene    40822   57414   .   -   .   ID=gene:ENSMMUG00000015800;Name=NOC2L;biotype=protein_coding;description=NOC2 like nucleolar associated transcriptional repressor [Source:HGNC Symbol%3BAcc:HGNC:24517];gene_id=ENSMMUG00000015800;logic_name=ensembl;version=3
## chr1 ensembl gene    58784   63064   .   +   .   ID=gene:ENSMMUG00000015802;Name=KLHL17;biotype=protein_coding;description=kelch like family member 17 [Source:HGNC Symbol%3BAcc:HGNC:24023];gene_id=ENSMMUG00000015802;logic_name=ensembl;version=3
## chr1 ensembl gene    64366   72839   .   +   .   ID=gene:ENSMMUG00000015804;Name=PLEKHN1;biotype=protein_coding;description=pleckstrin homology domain containing N1 [Source:HGNC Symbol%3BAcc:HGNC:25284];gene_id=ENSMMUG00000015804;logic_name=ensembl;version=3
## chr1 ensembl gene    73276   79439   .   -   .   ID=gene:ENSMMUG00000022525;Name=PERM1;biotype=protein_coding;description=PPARGC1 and ESRR induced regulator%2C muscle 1 [Source:HGNC Symbol%3BAcc:HGNC:28208];gene_id=ENSMMUG00000022525;logic_name=ensembl;version=3
## chr1 ensembl gene    87794   89166   .   -   .   ID=gene:ENSMMUG00000008350;biotype=protein_coding;gene_id=ENSMMUG00000008350;logic_name=ensembl;version=3
## chr1 ensembl gene    97478   101905  .   -   .   ID=gene:ENSMMUG00000001817;Name=HES4;biotype=protein_coding;description=hes family bHLH transcription factor 4 [Source:HGNC Symbol%3BAcc:HGNC:24149];gene_id=ENSMMUG00000001817;logic_name=ensembl;version=2
## chr1 ensembl gene    116734  118310  .   +   .   ID=gene:ENSMMUG00000001819;Name=ISG15;biotype=protein_coding;description=ISG15 ubiquitin-like modifier [Source:HGNC Symbol%3BAcc:HGNC:4053];gene_id=ENSMMUG00000001819;logic_name=ensembl;version=3
## chr1 ensembl gene    120996  155534  .   +   .   ID=gene:ENSMMUG00000000838;Name=AGRN;biotype=protein_coding;description=agrin [Source:HGNC Symbol%3BAcc:HGNC:329];gene_id=ENSMMUG00000000838;logic_name=ensembl;version=3
## chr1 ensembl gene    173834  176671  .   -   .   ID=gene:ENSMMUG00000032293;Name=RNF223;biotype=protein_coding;description=ring finger protein 223 [Source:HGNC Symbol%3BAcc:HGNC:40020];gene_id=ENSMMUG00000032293;logic_name=ensembl;version=2

Now we can get the overlaps between genes and SVs in our sample of macaques.

Run the code block below to use bedtools intersect to get the overlapping regions between two interval files:


bedtools intersect -a data2/macaque-svs-filtered.bed -b macaque-genes.gff > macaque-svs-genes-intersect.bed
# bedtools: A suite of programs to process bed files
# intersect: The sub-program of bedtools to execute
# -a : The first interval file to check for overlaps
# -b : The second interval file to check overlaps

wc -l data2/macaque-svs-filtered.bed
wc -l macaque-svs-genes-intersect.bed
# Use wc -l to count the number of lines in the original bed file and those in the bed file that overlaps with genes
## 3646 data2/macaque-svs-filtered.bed
## 1702 macaque-svs-genes-intersect.bed

Ok great, we’ve got only the SVs that overlap with genes in the macaque genome. Let’s take a look at this file.

Run the code block below to view the first few lines of the bed file with SVs that overlap with genes:


head macaque-svs-genes-intersect.bed
# Display the first few lines of the bed file containing SVs that overlap with genes
## chr1 130740  131675  chr1:130740:<DEL>:935:285.63
## chr1 562048  562264  chr1:541132:<DEL>:49440:316.41
## chr1 569143  590572  chr1:541132:<DEL>:49440:316.41
## chr1 562048  562264  chr1:552968:<DUP>:29266:189.32
## chr1 569143  582234  chr1:552968:<DUP>:29266:189.32
## chr1 1117696 1122022 chr1:1117696:<DEL>:4326:201.55
## chr1 1151866 1154542 chr1:1151866:<DEL>:2676:11284.32
## chr1 1166390 1167586 chr1:1166390:<DEL>:1196:15253.03
## chr1 1408621 1409766 chr1:1408621:<DEL>:1145:1112.53
## chr1 1409564 1410074 chr1:1409564:<DEL>:510:13091.76

Exactly the same format as the input bed file, just with fewer lines. bedtools intersect can add additional columns with more information about the overlap and overlaps can be defined more clearly. Let’s try it out.

**Exercise: Read the documenation of bedtools intersect and do the following. Don’t save the output to a file, just pipe it to wc -l: 1. Count only the SVs that DO NOT overlap with any genes. 2. Count only SVs that have at least 90% of their sequence overlapping a gene. 3. Count only SVs that have at least 90% of their sequence overlapping a gene AND for which that overlap also encompasses at least 90% of the gene.


# data2/macaque-svs-filtered.bed

## Count SVs that DO NOT overlap with genes
bedtools intersect -v -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | wc -l
## Count SVs that DO NOT overlap with genes

## Count SVs that have at least 90% of their sequence overlap with a gene
bedtools intersect -f 0.9 -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | wc -l
## Count SVs that have at least 90% of their sequence overlap with a gene

## Count SVs that have at least 90% of their sequence overlap with a 90% of a gene's sequence
bedtools intersect -f 0.9 -r -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | wc -l
## Count SVs that have at least 90% of their sequence overlap with a 90% of a gene's sequence
## 2112
## 1437
## 7

bedtools intersect can also output the actual features that are overlapped with the amount of overlap with the -wo option.

Run the code block below to perform an intersect between macaque SVs and genes with the -wo option:


bedtools intersect -wo -a data2/macaque-svs-filtered.bed -b macaque-genes.gff | head
# bedtools: A suite of programs to process bed files
# intersect: The sub-program of bedtools to execute
# -wo : A bedtools intersect option that specifies to write both features and the number of overlapping bases to the output file
# -a : The first interval file to check for overlaps
# -b : The second interval file to check overlaps
# | : The Unix pipe operator to pass output from one command as input to another command
## chr1 130740  131675  chr1:130740:<DEL>:935:285.63    chr1    ensembl gene    120996  155534  .   +   .   ID=gene:ENSMMUG00000000838;Name=AGRN;biotype=protein_coding;description=agrin [Source:HGNC Symbol%3BAcc:HGNC:329];gene_id=ENSMMUG00000000838;logic_name=ensembl;version=3   935
## chr1 541132  590572  chr1:541132:<DEL>:49440:316.41  chr1    ensembl gene    562049  562264  .   +   .   ID=gene:ENSMMUG00000045301;biotype=protein_coding;gene_id=ENSMMUG00000045301;logic_name=ensembl;version=1   216
## chr1 541132  590572  chr1:541132:<DEL>:49440:316.41  chr1    ensembl gene    569144  591870  .   +   .   ID=gene:ENSMMUG00000001549;biotype=protein_coding;gene_id=ENSMMUG00000001549;logic_name=ensembl;version=3   21429
## chr1 552968  582234  chr1:552968:<DUP>:29266:189.32  chr1    ensembl gene    562049  562264  .   +   .   ID=gene:ENSMMUG00000045301;biotype=protein_coding;gene_id=ENSMMUG00000045301;logic_name=ensembl;version=1   216
## chr1 552968  582234  chr1:552968:<DUP>:29266:189.32  chr1    ensembl gene    569144  591870  .   +   .   ID=gene:ENSMMUG00000001549;biotype=protein_coding;gene_id=ENSMMUG00000001549;logic_name=ensembl;version=3   13091
## chr1 1117696 1122022 chr1:1117696:<DEL>:4326:201.55  chr1    ensembl gene    1085483 1236570 .   +   .   ID=gene:ENSMMUG00000018911;Name=PRKCZ;biotype=protein_coding;description=protein kinase C zeta [Source:HGNC Symbol%3BAcc:HGNC:9412];gene_id=ENSMMUG00000018911;logic_name=ensembl;version=3 4326
## chr1 1151866 1154542 chr1:1151866:<DEL>:2676:11284.32    chr1    ensembl gene    1085483 1236570 .   +   .   ID=gene:ENSMMUG00000018911;Name=PRKCZ;biotype=protein_coding;description=protein kinase C zeta [Source:HGNC Symbol%3BAcc:HGNC:9412];gene_id=ENSMMUG00000018911;logic_name=ensembl;version=3 2676
## chr1 1166390 1167586 chr1:1166390:<DEL>:1196:15253.03    chr1    ensembl gene    1085483 1236570 .   +   .   ID=gene:ENSMMUG00000018911;Name=PRKCZ;biotype=protein_coding;description=protein kinase C zeta [Source:HGNC Symbol%3BAcc:HGNC:9412];gene_id=ENSMMUG00000018911;logic_name=ensembl;version=3 1196
## chr1 1408621 1409766 chr1:1408621:<DEL>:1145:1112.53 chr1    ensembl gene    1407566 1447126 .   -   .   ID=gene:ENSMMUG00000012345;Name=MORN1;biotype=protein_coding;description=MORN repeat containing 1 [Source:HGNC Symbol%3BAcc:HGNC:25852];gene_id=ENSMMUG00000012345;logic_name=ensembl;version=3 1145
## chr1 1409564 1410074 chr1:1409564:<DEL>:510:13091.76 chr1    ensembl gene    1407566 1447126 .   -   .   ID=gene:ENSMMUG00000012345;Name=MORN1;biotype=protein_coding;description=MORN repeat containing 1 [Source:HGNC Symbol%3BAcc:HGNC:25852];gene_id=ENSMMUG00000012345;logic_name=ensembl;version=3 510

End of Day 2

That’s it for day 2! Join us next week to learn about VCF files, shell scripts, conda environments, and the cluster.