Welcome to the Harvard Informatics Healthy Habits for Data Science workshop!

This web page will guide you through some of the activities we have planned for you today!

Instructors

Adam Freedman: A bioinformatics scientist in the FAS Informatics group at Harvard University.

Danielle Khost: A bioinformatics scientist in the FAS Informatics group at Harvard University.

Gregg Thomas: A bioinformatics scientist in the FAS Informatics group at Harvard University and recent postdoc at the University of Montana where he studied the phylogenetics and comparative genomics of the mouse and rat radiation. He got his PhD at Indiana University where he worked on comparative genomics of arthropods, mutation rate evolution in primates, and convergent evolution using comparative genomics. In general, Gregg uses and develops computational methods to study molecular evolution and phylogenetics to determine what forces drive divergence and adaptation between species.

Lei Ma received her PhD from the MIT-WHOI Joint Program in Oceanography/Applied Ocean Science and Engineering. Her dissertation focused on the ecology of marine microorganisms in coral reefs and in Atlantic killifish. She is particularly interested in genotype-environment-microbiome interactions in animal hosts, such as the influence of host evolution on its microbiome. Other interests include mentoring, finding coding shortcuts, cats, video games, sci-fi, and knitting.

Tim Sackton: Director of the FAS Informatics group at Harvard University.

Workshop Summary & Outline

This workshop aims to introduce students to concepts and modern tools for project organization to facilitate high-throughput analyses and reproducibility. We will emulate a typical project workflow by reproducing a published analysis (Favate et al. 2022) using publicly available scripts and data.

Additionally, this workshop includes two optional BONUS days that will cover additional topics that are not strictly necessary for the project, but are useful for general computational work. These days will be less structured and will be more of a "drop-in" format. We will cover topics such as optimizing and customizing your data analysis tools, AI assisted coding and debugging, and more.

Here is a brief outline of the topics we'll be covering:

Day 1: Project organization

Wednesday March 13th, 9:30 am - 12:30 pm
Location: Northwest Building room 453
  • Intro to filesystems
  • Navigating file systems from the command line
  • Best practices for file organization
  • Downloading and tansferring data

BONUS Day 1: Optimizing and customizing your data analysis tools

Thursday March 14th, 9:30 am - 12:30 pm
Location: Biolabs room 2062/2064
  • The importance of text editors and why we recommend VSCode
  • Editing files remotely with VSCode
  • Transferring files with FTP clients
  • Shell profiles
  • Terminal multiplexers (e.g. screen, tmux)

Day 2: Package managers and software environments

Wednesday March 20th, 9:30 am - 12:30 pm
Location: Northwest Building room 453
  • Why is installing software so hard: permissions and dependencies
  • Concepts of containerization (e.g. Docker, singularity)
  • Package managers (e.g. pip, CRAN, conda)
  • Creating and managing software environments with mamba

Day 3: Version control with git

Thursday March 21st, 9:30 am - 12:30 pm
Location: Biolabs room 2062/2064
  • What is version control and why is it important?
  • Intro to git and github
  • Managing your project history with git
  • Collaborating on github

Day 4: Code notebooks and automating jobs on the cluster

Wednesday March 27th, 9:30 am - 12:30 pm
Location: Northwest Building room 453

BONUS day 2: AI assisted coding and debugging

Thursday March 28th, 9:30 am - 12:30 pm
Location: Biolabs room 2062/2064
  • Overview of large language models (LLMs; e.g. ChatGPT, GitHub Copilot)
  • Writing tests for your code
  • Debugging skills

Click the appropriate Get Started link below to read some info before class. Additional links to resources will appear for each day of the workshop.


Get Started