Big Data Analytics for Life Scientists

Slide and teaching material availability

Teaching materials are available through the Big Data Analytics for Life Scientists GitHub repository.

Summer School ‘Big Data Analytics for Life Scientists’

This summer school is intended for doctoral students in the life sciences who are interested in
learning fundamental skills for big data projects. No previous experience, but high motivation is
expected. The program designed for one week is listed below. There will be lectures covering
the theoretical background (about 2h) followed by practical exercises to gain hands-on
experiences (3h-6h).
This summer school will take place from June 17th to 21st in R021B (Mendelssohnstraße 4). It
is planned that lectures will start at 9am. Exercises will be conducted before and after a lunch
break. Participation in this summer school is free of charge, but application is required. There
are 12 slots available per cohort and priority will be given to students working on projects related
to plant-microbe interactions. Depending on the number of applications, a second week for
another cohort might be offered.
Please send your application with the following details via email to Boas Pucker:

  • Name
  • Affiliation
  • Current situation (progress in doctoral studies)
  • Summary of your research / research interests
  • Expectations / motivation for participating in this summer school

We are looking forward to a great data analytics week!

Tentative program

Day 1

Lecture

  • Exponential growth of databases
  • Systems Biology & Omics (Genomics, Transcriptomics, Proteomics, Metabolomics)
  • Potential for data upcycling & Data Life Cycle
  • Recent progress and current challenges in big data analytics
  • How to find the right software
  • Open science principles
  • Introduction to Linux (Ubuntu)
  • Data management (File and folder structures for big data projects)
  • Backup and archiving strategies
  • Documentation
  • Important script languages in bioinformatics: Python, R

Practical course

  • How to get help from an AI (and the limitations)
  • Working in a virtual machine
  • Installing computational tools
  • Finding usage of computational tools
  • Transferring files (filezilla, scp)
  • Jupyter Notebook

 

 

Day 2

Lecture

  • Introduction to genome biology
  • Sequencing technologies (Sanger, Illumina, ONT, PacBio)
  • Long read sequencing workflow (ONT)
  • Genome sequence assembly
  • Structural and functional annotation
  • Comparative genomics
  • Read mapping, variant calling, variant annotation
  • Genome-Wide Association Studies (GWAS) / Mapping-By-Sequencing (MBS)
  • File formats: FASTA, FASTQ, GFF, SAM/BAM, VCF

 

Practical course

  • QC of long reads
  • Trimming/filtering of reads
  • Assembly (Shasta)
  • Gene prediction (BRAKER3)
  • Functional annotation (Mercator)
  • Biosynthesis pathway annotation (KIPEs3)
  • Long read mapping (minimap2)
  • Variant calling and annotation (SnpEff, NAVIP)

 

Day 3

Lecture

  • Introduction to transcriptomics
  • History of transcriptomics (microarrays, RT-qPCR)
  • Concept of RNA-seq & workflow
  • Experimental design considerations
  • Quality control (RNA & data)
  • Read mapping & quantification
  • PCA, Heatmaps, DEG identification
  • Direct RNA sequencing & full cDNA sequencing
  • scRNA-seq
  • Re-using public datasets

Practical course

  • QC and trimming of reads (fastQC, Trimmomatic)
  • De novo transcriptome assembly (Trinity)
  • Split read mapping (STAR, HISAT2)
  • Quantification (kallisto)
  • Identification of DEGs (DESeq2)
  • Co-expression analysis (ppb-tools.de)

 

 

Day 4

Lecture

  • How to reduce complexity?
  • Considerations for designing scientific figures
  • Types of figures
  • Matplotlib, plotly, ggplot2
  • Examples: circos plots, synteny figures, PCA, phylogenetic trees
  • File formats: PNG, JPEG, TIFF, PDF, SVG
  • Manual editing (Inkscape)
  • Visualizing complex networks (Cytoscape)
  • Web examples: eFP browser
  • Designing figures with bioRender

Practical course

  • How to generate figures in Python, R (with AI support)
  • Circos plots & synteny figures
  • DEG plots & enrichment analysis
  • Visualize coexpression network with Cytoscape
  • Phylogenetic tree construction (FastTree2) + visualization (iTOL)

 

Day 5

Lecture

  • Introduction to scientific publishing business
  • How to publish your data (enable reuse)
  • Importance of metadata
  • FAIR data
  • Details about methods (#OpenMethods)
  • Sharing protocols through protocols.io
  • Publishing data sets through LeoPARD
  • Submission of sequencing data to ENA
  • Depositing scripts in GitHub and Zenodo

Practical course

  • Sharing protocols through protocols.io
  • Create a GitHub repository
  • Prepare data for submission to ENA
  • Complete the LeoPARD template

 

Q & A

  • Chances to ask remaining questions
  • Collection of feedback about content / structure