Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An introduction to quantitative biology and R David Quigley, Ph.D. Helen Diller Comprehensive Cancer Center, UCSF Department of Epidemiology and Biostatistics, UCSF What’s quantitative biology? The process of data analysis Reproducible research An glance at R analysis walk-through High-performance computing at UCSF David Quigley [email protected] What’s quantitative biology? “how many samples?” BIOMEDICAL SCIENCE Cell Biology Molecular Biology Evolutionary & comparative biology “Is it a significant effect?” David Quigley [email protected] Chemical Biology Genetics Clinical trials biostatisticans Study design BIOMEDICAL SCIENCE biologists Cell Biology Evolutionary & comparative biology biostatisticians David Quigley [email protected] statistical inference Molecular Biology Chemical Biology Genetics Clinical trials New techniques bring new data Gene expression microarrays Arrayed SNP chips pooled and arrayed genetic shRNA short read sequencing DNA RNA etc... High throughput mass spectroscopy David Quigley [email protected] Wet lab quantitative David Quigley [email protected] Nik-Zainal Cell 2012 Fullwood Nature 2009 CGAN Nature 2012 New challenges statistical sophistication in study design in interpretation many data points 1,000 to 1,000,000 measurements many false positives software becomes part of the experiment engineering thinking vs. biology thinking David Quigley [email protected] biostatisticans Study design BIOMEDICAL SCIENCE Bioinformatics biologists Cell Biology hybrid researchers bioinformaticians biostatisticians David Quigley [email protected] Evolutionary & comparative biology statistical inference Molecular Biology Chemical Biology Genetics Clinical trials Bioinformatics is multi-disciplinary Biology Computer science statistics machine learning Bioinformatics Working in isolation biologist wet lab experiments bioinformatics data analysis biostatisticans statistical inference biologist biological interpretation The process of quantitative data analysis cells in a dish DMSO library preparation sequencing produce DNA reads drug align reads to genome alignment pipeline normalize data interpret results library preparation cells in a dish DMSO sequencing produce DNA reads drug Primary analysis align reads to genome alignment pipeline Secondary analysis normalize data interpret results Primary analysis specialized tools and packages standardized pipelines develop over time driven by methods Secondary analysis general tools open-ended driven by statistics and biology David Quigley [email protected] Acquiring primary data Pipeline of command-line tools Instrument Tool 1 Tool 2 tool 3 quality control statistical normalization Acquiring primary data engineering instrument native output format spectrographs qPCR cycle files microarray images short sequences David Quigley [email protected] Primary Analysis: what happened? engineering instrument David Quigley [email protected] bioinformatics native output format standardized output spectrographs qPCR cycle files microarray images short sequences protein assignments expression matrixes genome variants Considerations during primary analysis batch effects sample quantity biological artifacts (e.g. GC content) individual assay quality sample quality platform effects operator effects David Quigley [email protected] Normalization challenges vary solved problems microarray expression level taqman expression level genotypes from SNP chips best practices SNV calling from sequence gene-level RNA-seq ChIP-seq open problems mRNA isoform reconstruction tumor clonality analysis from sequence David Quigley [email protected] Secondary analysis: what might it mean? To which DNA sequences does TP53 bind? What mutations are frequent in basal-like breast cancer? Which kinases does my tool compound target? David Quigley [email protected] Chosing quantitative tools Cost Learning curve Ease of use Flexibility ecosystem people other tools David Quigley [email protected] Traditional programming languages Python, C++, Java, others can solve any computable problem creates the fastest tools free requires programming expertise complex to write and test high effort David Quigley [email protected] Specialized single-purpose programs command line tools academic research type commands at a prompt or run scripts PLINK, bowtie, GATK, bedtools GUI (point and click) commercial software for a vendor’s platform slick, opaque, hard/impossible to automate David Quigley [email protected] Commercial statistics programs STATA, SPSS, GraphPad, others 1) Load one dataset 2) Select analysis by clicking on a GUI 3) Generate a report may have a built-in language mature tools Not free David Quigley [email protected] R: a “software environment” Using R is like writing and using software Traditionally, biologists did not do this. David Quigley [email protected] Why is R popular? Open-ended Most statistical problems can be solved Pro statisticians use R Large library of packages package: easy-to use published methods like a Qiagen kit Free! David Quigley [email protected] You use R by typing at the prompt There is no pull-down menu of statistical commands David Quigley [email protected] What’s good about this approach? easy to reproduce runs on anything makes sense to computer programmers re-use and chain analyses together work with multiple datasets use packages of code David Quigley [email protected] What’s hard about this approach? hard to get started cryptic commands built-in help is hard to use David Quigley [email protected] RStudio makes it easier David Quigley [email protected] bioconductor Curated collection of R packages Microarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more bioconductor.org David Quigley [email protected] A gentle introduction to R Excel vs. R Excel Easy tasks are easy non-trivial tasks impossible or expensive No paper trail Mangles gene names Plots look terrible Hard to reproduce and standarize David Quigley [email protected] Comparing Excel and R R Easy jobs are hard at first Non-trivial things are possible Easy to make a paper trail Bioinformatics tools often written for R Can create publication-ready plots David Quigley [email protected] Working with R is interactive type in a line of text R evaluates your words as a command R returns a response, e.g.: printed message new variable picture of a plot Working with R is interactive Simple functions on number lists Create a vector using the c() function Operate on the vector Commands are functions Organizing data Each subject has a row Each feature is a column David Quigley [email protected] R calls columns vectors vectors ordered collections of numbers or words name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”] age: [2.5, 2.6, 2.5, 4] David Quigley [email protected] R calls the data set a data frame data frame a list of vectors (columns) that have names elements can be read and written by row & column David Quigley [email protected] I can slice and dice the data frame David Quigley [email protected] I can slice and dice the data frame David Quigley [email protected] I can slice and dice the data frame David Quigley [email protected] I can slice and dice the data frame David Quigley [email protected] I can slice and dice the data frame David Quigley [email protected] I can slice and dice the data frame David Quigley [email protected] I can slice and dice the data frame David Quigley [email protected] I can slice and dice the data frame David Quigley [email protected] You can define your own functions function_name( details about how to do it ) generate sequence from 1 to 5 counting by 0.5 parameters for seq are named from, to, and by David Quigley [email protected] You can define your own functions function_name( details about how to do it ) report the mean of my.data. Result of one function is fed into another one. David Quigley [email protected] You can define your own functions function_name( details about how to do it ) define a new function that adds 2 to whatever it’s passed compare to original value of my.data David Quigley [email protected] Code is a protocol for the computer A program is a series of operations on data Short programs (scripts) are often linear Large programs have decision points “flow control” David Quigley [email protected] Most jobs: data preparation & scripts tools that manipulate text text editing programs (TextPad, BBEdit, Emacs) Python Old-school command line tools (awk) David Quigley [email protected] Reproducible research Replicate a wet lab experiment detailed protocols (not printed in the methods) extensive optimization reagents that might be unique or hard to get techniques that require years of experience David Quigley [email protected] Replicate a dry lab experiment published algorithms (if novel) published source code sometimes “available from the authors” well-specified input and deterministic output no reagents Okay, maybe a supercomputer or cloud How hard could it be? David Quigley [email protected] Many chances to make honest errors Bookkeeping errors Transposed column headers Out-of-date/changed annotations Off-by-one Misunderstood sample labels Batch effects Cryptic cohort stratification Inappropriate analytical methods David Quigley [email protected] Your notebook should be the final product hand-curate metadata; automate the analysis primary data metadata David Quigley [email protected] analysis script figures tables R Markdown David Quigley [email protected] R Markdown David Quigley [email protected] Walk-through a straightforward analysis Primary data from METABRIC study gene expression TP53 sequence 1,400 samples from 5 hospitals Is there an association between breast cancer subtype and TP53 mutation? David Quigley [email protected] Tasks Normalize data batch effects unwanted inter-sample variation Identify outliers associations between p53 and subtype David Quigley [email protected] Quantile Normalization (limma) Force every array to have the same distribution of expression intensities > library(limma) > raw = read.table('raw_extract.txt’, ...) > raw.normalized = normalize.quantiles( raw ) > normalized = log2( raw.normalized ) David Quigley [email protected] Identify batch effects in microarrays gene 1 Principle Components Analysis Identify strongest variation in a matrix gene 2 David Quigley [email protected] Identify batch effects in microarrays gene 1 Principle Components Analysis Identify axes of maximal variation in a matrix gene 2 David Quigley [email protected] Identify batch effects in microarrays Principle Components Analysis Identify axes of maximal variation in a matrix gene 1 gene 1 group A group B gene 2 David Quigley [email protected] gene 2 second component PCA identifies a batch effect hospital 3 (yellow) first component > my.pca = prcmp( t( expression.data ) ) > plot( my.pca, ... ) David Quigley [email protected] batch correction reduces bias second principle component ComBat package reduces user-defined batch effects first principle component David Quigley [email protected] Molecular subtypes of breast carcinoma, defined by gene expression ER status Luminal A N=507 Luminal B N=379 Her2 N=161 > sa = read.table(‘patients.txt’, ...) > tumor.counts = table( sa$ER.status, sa$PAM50Subtype) (convert counts to percentages) > barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... ) David Quigley [email protected] Basal N=234 Find interactions: TP53 and subtype Fit a linear model: > fitted.model = lm( dependent ~ independent ) Perform Analysis of Variance: > anova( fitted.model ) general form of my analysis: > anova( lm( gene.expression ~ PAM * TP53 ) 18,000 genes PAM: {LumA, LumB, Her2, Basal} TP53: {mutant, WT} David Quigley [email protected] Automate with loops Calculate anova for 18,000 genes by looping through each gene and storing result. > n_genes = 18000 > result = rep( 0, n_genes ) > for( counter in 1:n_genes ){ result[counter] = anova(...) } sort results identify significant interaction David Quigley [email protected] repeat 18,000 times Immune infiltration in TP53-WT Basal CD3E log2 expression log2 expression Does p53 have a role in immune surveillance? absent mild severe infiltration David Quigley [email protected] High-performance computing resources Cluster computing 1 computer 20 hours 20 computers 1 hour Clusters available on campus Institute for Human Genetics Recharge ~800 cores, plenty of disk space HDFCC Cluster Free for small jobs to cancer center members Contribute resources for big jobs ~800 cores, plenty of disk space QB3 Free for small jobs to QB3 members Lots of cores, not much disk space Amazon AWS Infinite capacity, pay by the hour David Quigley [email protected] Next steps: getting help and learning more UCSF resources Library classes and information Formal courses (BMI, Biostatistics) Cores (Computational Biology, Genomics) QBDG monthly methods discussion group David Quigley [email protected] I’m teaching a biostatistics class... Next year’s first year students Online and in-person material Want to TA? Get in touch. David Quigley [email protected] Online classes and blogs Free courses on data analysis http://jhudatascience.org simplystatistics.org Coursera etc... Good tutorials on sequence analysis http://evomics.org/learning David Quigley [email protected] online forums: expert help for free biostars.org all of bioinformatics David Quigley [email protected] online forums: expert help for free biostars.org all of bioinformatics David Quigley [email protected] seqanswers.com Nextgen sequencing online forums: expert help for free seqanswers.com biostars.org Nextgen sequencing all of bioinformatics stats.stackexchange.com statistics David Quigley [email protected] Questions? [email protected]