Part of becoming an efficient data scientist is trying out and learning the tools that work best for you. There are definitely plenty out there to try. Here we have assembled lists of popular FREE software for common data science tasks. If you feel any information is inaccurate or out of date, or if you want to recommend a program to add to the lists, please contact me.
Programs listed with a GREEN BACKGROUND are ones used in this workshop.
Your text editor will be your most used program, and will be how you interact with your data, so its important to find one that does exactly what you need!
Editor | Type | Platform | Link |
---|---|---|---|
nano | Command line | Linux/Unix | Website |
vi/vim | Command line | Linux/Mac/Windows | Website |
Emacs | Command line | Linux/Mac/Windows | Website |
Visual Studio Code | GUI | Linux/Mac/Windows | Website |
BBEdit | GUI | Mac | Website |
gedit | GUI | Linux/Mac/Windows | Website |
Sublime text | GUI | Linux/Mac/Windows | Website |
Atom | GUI | Linux/Mac/Windows | Website |
TextMate | GUI | Mac | Website |
Notepad++ | GUI | Windows | Website |
Rstudio | IDE | Linux/Mac/Windows | Website |
Visual Studio | IDE | Windows | Website |
File transfer programs will allow you to move files between your machine and your lab's/institution's server, or between servers. Sometimes you'll only want to move one or a few files to inspect them and a graphical, drag and drop program is sufficient. Other times though you'll need to be moving thousands of files or very large files and a command line file transfer may be required to automate the process. Below are a list of some popular programs of each type.
Cloud services, like Box, Dropbox, OneDrive, or Google Drive are also extremely useful for syncing folders across devices, but can be difficult to set up on a server. I personally use Box to store all of my active documents and project folders (sans large data), and am able to sync between my home and work computers. These programs may not be free, but be sure to check with your institution, which may offer free accounts with large or even unlimited storage while you work for them
SSH is the protocol that allows us to connect our local machine to a remote machine and run commands on it in the terminal. The most common SSH client is openSSH and is widely used. Until recently, however, it was not available on Windows and a third-party client was required. PuTTY is by far the best SSH client for Windows, and is still a great option for older versions or versions without openSSH installed.
Program | Author | Year | Use cases | Link | Paper |
---|---|---|---|---|---|
bedtools | Quinnlan and Hall | 2010 | Perform operations on sets of genomic coordinates. | Website | Paper |
bcftools | NA | NA | Perform operations on VCF and BCF formatted files. | Website | NA |
samtools | Li | 2009 | Perform operations on SAM/BAM/CRAM formatted files. | Website | Paper |
Picard tools | Broad Institute | 2019 | Performs many operations on SAM/BAM/CRAM and VCF files. | Website | Paper |
gffread | Pertea & Pertea | 2020 | General purpose GFF file manipulation | Website | Paper |
seqtk | Li | NA | A fast and lightweight tool for processing sequences in the FASTA or FASTQ format | Website | NA |