Informatics on High Throughput Sequencing Data
Date: June 11-12, 2012
Location: Downtown Toronto, ON
Lead Faculty (2012): Michael Brudno, Michael Stromberg, Malachi Griffith, Marc Fiume & Francis Ouellette
Registration Fee for Applications received before May 11, 2012: $500 + HST
Registration Fee for Applications received after May 11, 2012: $700 + HST
Awards available for 2012.
Target Audience
This workshop is geared to graduate students, post-doctoral fellows, clinical fellows and investigators involved in analyzing data from HT sequencing platforms.
Prerequisite: Some basic UNIX skills we be helpful. Some basics can be found here Your own laptop computer. If you do not access to a laptop, you may loan one from the CBW. Please contact course_info@bioinformatics.ca for more information.
Course Objectives
With the introduction of high-throughput sequencing platforms from Illumina, Roche and ABI, it is becoming feasible to consider sequencing approaches to address many research projects. However, knowing how to manage and interpret the large volume of sequence data resulting from such technologies is less clear. The CBW has developed a 2-day course covering the bioinformatics tools available for managing and interpreting high-throughput sequencing data.
Beginning with an understanding of the workflow involved to move from platform images to sequence generation, participants will gain practical experience and skills to be able to:
- Assess sequence quality
- Map sequence data onto a reference genome
- Quantify sequence data
- Integrate biological context with sequence information
Course Outline
Day 1
Module 1: Genome Alignment (M. Stromberg)
- Lecture
- Laboratory - Genome alignment
Module 2: Genome Variation (M. Stromberg)
Single Nucleotide Polymorphism (SNP) Sequence Data
- What are SNPs, SNVs, and short-INDELs? Why would I want to look for them?
- What should I have done up to this point? (e.g. BQ recalibration, duplicate removal, aligner choice)
- How are these variants detected? What factors are taken into account by the SNP callers?
- Different types of SNP calling: haploid/diploid, trio, somatic mutations, pooled
- YAY, WE FOUND MILLIONS OF SNPs!!!! How do I know if any of these are good?
- INDEL cleaning
- Are there any standard file formats for SNPs?
- Introduction to SNP calling tools and how they compare with each other.
Structural Genome Variations/Chromosome rearrangements
- What are SVs? What are the different types? Discuss the biological processes behind SVs.
- What should I have done up to this point? (e.g. duplicate removal, aligner choice)
- How are these variants detected? Discuss detection strategies (read pair, read depth, combined approach, local de novo assembly). Which SV types are detectable by which strategies?
- Are there any standard file formats for SVs?
- Introduction to SV detection tools and how they compare with each other.
Module 3: Genome Variation (M. Brudno)
Copy Number Variation (CNVs)
- What are CNVs? Discuss the biological processes behind CNVs.
- What should I have done up to this point? (e.g. duplicate removal, aligner choice)
- How are these variants detected? Discuss detection strategies (read pair, read depth, combined approach, local de novo assembly)
- Are there any standard file formats for CNVs?
- Introduction to CNV detection tools and how they compare with each other.
Module 4: Genome visualization (M. Fiume)
- Genomic data - common file formats (FASTA, SAM/BAM, BED, WIG, GFF, etc)
- Introduction to Genome Browsers
- Terminology
- Common browsers: UCSC, IGV, Savant, GBrowse
- Visualizing HT-seq data
- Visualizing unpaired data
- Visualizing paired data
- Finding genetic variants by eye
- Integrating other data sets into a browser
- Making annotations to the data
- Automatic SNP finding
- Laboratory - Variant detection and visualization within the genome using Savant
- Set up a project using local and remote data
- Visualize HT-seq alignments of paired and unpaired reads
- Perform open-ended visual analytical tasks (i.e. identify potential genetic variants in a specified region) using a number of available plugins
Accessing remote data (i.e. 1000 Genomes, UCSC)
Day 2
Module 5: RNA Sequence Analysis (M. Griffith)
- Introduction to RNA-seq data
- Various applications of RNA-seq data
- Identification of expressed genes and differential gene expression analysis
- Isoform discovery
- Alternative expression (i.e. differential splicing) analysis
- Allele specific expression analysis
- Fusion gene discovery
- Identification of expressed point mutation and indels
- Overview of analysis strategy for alignment of RNA-seq data to genome, transcriptome and junction databases as it relates to:
- Expression analysis
- Differential gene expression analysis
- Laboratory - Differential expression analysis using Alexa-seq and R (DESeq)
Module 6: Galaxy (F. Ouellette)
- Lecture - A pipeline tool for high-throughput sequence data analysis
- Laboratory - Galaxy analysis of HT-seq data

