Exploratory Data Analysis and Essential Statistics using R

Workshop Details

Date: September 6-7, 2012
Location: Downtown Toronto, ON
Lead Faculty (2012): Boris Steipe
Registration Fee for Applications received before August 3, 2012: $500 + HST
Registration Fee for Applications received after August 3, 2012: $700 + HST

Awards available for 2012.
Apply now!


Target Audience
Graduates, postgraduates and PIs who need to design and execute strategies for data analysis but have little or no formal prior training in statistics and /or familiarity with the R statistical workbench.

Prerequisites:

  • Your own laptop with R installed. If you do not have access to a laptop, you may loan one from CBW. Please contact course_info@bioinformatics.ca for more information.
  • Completing an online tutorial on the installation and basic use of R before the workshop.

Course Objectives
Before we can begin to apply rigorous statistical tools to research data, we often need to approach our data intuitively, and look for meaningful associations, surprising patterns, or irregularities, to formulate hypotheses. This is commonly referred to as Exploratory Data AnalysisEDA. This workshop introduces the essential tools and strategies that are available through the free statistical workbench R. Participants should be able to modify the scripts and protocols we discuss for their research tasks, identify potential problems with their own data, and define their statistics needs for cases in which expert advice is required. Case studies with common research scenarios such as microarray data, and flow cytometry will emphasize practical skills. Writing your own R functions and analysis scripts will be introduced at the beginning of the workshop and skills will be gradually built on over the course of the lectures. Plotting and visualization is a key element of EDA and we will gradually build skills–from the elementary built-in routines via their (sometimes bewildering) array of parameters to sophisticated, publication-ready presentations.


Course Outline
Each module contains a lecture, break and lab. A comprehensive lecture and laboratory manual will be provided.

Day 1
On the first day, we will work through the big picture–R, the principal strategies for EDA, and basic hypothesis testing.

Module 1: The R Landscape
Students will have installed the program and worked through a self-guided introduction specifically developed for the workshop as part of the pre-reading. The first module expands on the basic use of R and includes:

  • Ice breaking session for participants (promote networking)
  • An overview of R's capabilities and how to expand them through the large, community-contributed resources such as CRAN and BioConductor–how to keep abreast of best-practices
  • Reading and writing data from common biological file-formats, including numeric data, sequences, annotations, and networks
  • The difference between the various types of data objects in R and when each one is appropriate;
    Conditional selections and other filtering approaches
  • First experiments with writing R scripts.

Module 2: Exploratory data analysis for biological data
In this module we will discuss the principles of Exploratory Data Analysis (EDA), how to compute descriptive statistical measures, how to smooth and transform data and how to visualize data using R's powerful and flexible plotting routines. Topics include:

  • EDA principles
  • Descriptive statistics: mean/median and variance, quantiles, outliers
  • Transformations and smoothing techniques (e.g. Lowess)
  • Plotting in R: basics, advanced options, special packages and best practices

Module 3: Hypothesis testing for EDA
Once we have an idea how to approach our data, we need to establish whether our observations are significant. For example, most genes show differences in expression under testing and control conditions, but how large must a difference be to warrant experimental follow-up? This is the domain of hypothesis testing in statistics. R has many testing protocols built in, and many more can be installed. Topics include:

  • Common statistical tests and their underlying assumptions about the data
  • p-values, distributions, Z-scores and "significance"
  • False positive and false negative error rates
  • Bootstrap and resampling techniques
  • Multiple testing corrections: Bonferroni, family wise error rate, false discovery rate
  • Non-parametric alternatives
  • Power calculation and sample size
  • Lab Practical: Working with your own data

Day 2
On the second day we will pursue three topical areas of particular importance for EDA:

Module 4: Data reduction
Much of our biological data is very high-dimensional, and accordingly difficult to assess. However, powerful methods exist to simplify the problem. Topics include:

  • Visualizing multi-dimensional data
  • Data reduction with Principal Components Analysis
  • Using explicit models for data reduction

Module 5: Clustering Analysis
Very many clustering methods are in common use in the biological sciences and that fact alone should warn you that none is appropriate for all data under all conditions. Topics include:

  • Calculating "distance" between (high-dimensional) data points
  • Clustering principles and methods: hierarchical-, centroid-based, and information-based approaches in R
  • Assessing the quality of clustering results
  • Density estimation as an alternative>/li>
  • Outlook: classification

Module 6: Regression Analysis

  • Types of models for regression analysis in R
  • Linear regression
  • Calculating and plotting residuals
  • Predictions
  • Non-linear regression with arbitrary functions



Pre-Reading
You need to complete our introductory R tutorial for the course beforehand. The tutorial is very accessible and designed for students who have never used R before. Please navigate to: http://www.biochemistry.utoronto.ca/steipe/R